监控与告警
服务挂了第一时间知道 —— 别等用户来告诉你
监控不是可选项
服务跑起来了不代表万事大吉。CPU 跑满了你不知道,内存泄漏了你不知道,磁盘写满了你不知道 —— 等到用户来告诉你"用不了了",那就太被动了。
搭一套监控告警系统,问题发生的时候你比用户先知道,这才是正确的运维姿势。
😰 半夜服务挂了没人知道
周五晚上下班后服务器悄悄挂了,周一早上一堆人反馈"用不了"。你打开服务器一看 —— 磁盘满了,日志文件疯狂增长把空间写爆了。要是有监控告警,周五晚上就能收到通知,五分钟就能处理完。
Prometheus + Grafana + 告警通知
用 Prometheus 采集指标数据,Grafana 做可视化仪表盘,再配上告警规则,CPU 过高、内存不足、服务挂掉都能第一时间推送通知到你的手机或邮箱。整套方案开源免费,Docker 一键部署。
一键部署监控栈
用 docker-compose 把 Prometheus、Grafana、Node Exporter 一起拉起来:
部署监控栈
# 创建监控目录
mkdir -p ~/monitoring && cd ~/monitoring
# 创建配置文件(见下方)
# 然后一键启动
docker compose up -d
# 访问 Grafana 仪表盘
# http://your-server-ip:3001
# 默认账号密码: admin / admin
docker-compose.yml — 监控栈
services:
prometheus:
image: prom/prometheus:latest
container_name: prometheus
restart: always
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- ./alert_rules.yml:/etc/prometheus/alert_rules.yml
- prometheus_data:/prometheus
command:
- "--config.file=/etc/prometheus/prometheus.yml"
- "--storage.tsdb.retention.time=30d"
grafana:
image: grafana/grafana:latest
container_name: grafana
restart: always
ports:
- "3001:3000"
volumes:
- grafana_data:/var/lib/grafana
environment:
- GF_SECURITY_ADMIN_PASSWORD=your_secure_password
node-exporter:
image: prom/node-exporter:latest
container_name: node-exporter
restart: always
ports:
- "9100:9100"
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
command:
- "--path.procfs=/host/proc"
- "--path.sysfs=/host/sys"
- "--path.rootfs=/rootfs"
alertmanager:
image: prom/alertmanager:latest
container_name: alertmanager
restart: always
ports:
- "9093:9093"
volumes:
- ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
volumes:
prometheus_data:
grafana_data:
Prometheus 配置
prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- "alert_rules.yml"
alerting:
alertmanagers:
- static_configs:
- targets: ["alertmanager:9093"]
scrape_configs:
# 监控 Prometheus 自身
- job_name: "prometheus"
static_configs:
- targets: ["localhost:9090"]
# 监控服务器系统指标
- job_name: "node"
static_configs:
- targets: ["node-exporter:9100"]
# 监控 OpenClaw 服务
- job_name: "openclaw"
static_configs:
- targets: ["openclaw:3000"]
metrics_path: "/metrics"
scrape_interval: 10s
告警规则
alert_rules.yml — 告警规则
groups:
- name: openclaw_alerts
rules:
# CPU 使用率超过 80% 持续 5 分钟
- alert: HighCPU
expr: 100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "CPU 使用率过高"
description: "CPU 持续 5 分钟超过 80%,当前 {{ $value }}%"
# 内存使用率超过 90%
- alert: HighMemory
expr: (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 > 90
for: 3m
labels:
severity: critical
annotations:
summary: "内存即将耗尽"
description: "内存使用率超过 90%,当前 {{ $value }}%"
# 磁盘使用率超过 85%
- alert: DiskAlmostFull
expr: (1 - node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 > 85
for: 5m
labels:
severity: warning
annotations:
summary: "磁盘空间不足"
description: "根分区使用率超过 85%,当前 {{ $value }}%"
# OpenClaw 服务挂了
- alert: OpenClawDown
expr: up{job="openclaw"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "OpenClaw 服务不可用"
description: "OpenClaw 已经超过 1 分钟没有响应"
告警通知配置
告警触发后怎么通知你?可以发邮件、推到钉钉/企业微信/Slack:
alertmanager.yml — 通知配置
global:
resolve_timeout: 5m
route:
group_by: ["alertname"]
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: "default"
receivers:
- name: "default"
email_configs:
- to: "your-email@example.com"
from: "alertmanager@example.com"
smarthost: "smtp.example.com:587"
auth_username: "alertmanager@example.com"
auth_password: "your-email-password"
webhook_configs:
# 钉钉机器人
- url: "https://oapi.dingtalk.com/robot/send?access_token=YOUR_TOKEN"
send_resolved: true