监控与告警

服务挂了第一时间知道 —— 别等用户来告诉你

监控不是可选项

服务跑起来了不代表万事大吉。CPU 跑满了你不知道,内存泄漏了你不知道,磁盘写满了你不知道 —— 等到用户来告诉你"用不了了",那就太被动了。

搭一套监控告警系统,问题发生的时候你比用户先知道,这才是正确的运维姿势。

😰 半夜服务挂了没人知道

周五晚上下班后服务器悄悄挂了,周一早上一堆人反馈"用不了"。你打开服务器一看 —— 磁盘满了,日志文件疯狂增长把空间写爆了。要是有监控告警,周五晚上就能收到通知,五分钟就能处理完。

Prometheus + Grafana + 告警通知

用 Prometheus 采集指标数据,Grafana 做可视化仪表盘,再配上告警规则,CPU 过高、内存不足、服务挂掉都能第一时间推送通知到你的手机或邮箱。整套方案开源免费,Docker 一键部署。

一键部署监控栈

用 docker-compose 把 Prometheus、Grafana、Node Exporter 一起拉起来:

部署监控栈
# 创建监控目录
mkdir -p ~/monitoring && cd ~/monitoring

# 创建配置文件(见下方)
# 然后一键启动
docker compose up -d

# 访问 Grafana 仪表盘
# http://your-server-ip:3001
# 默认账号密码: admin / admin
docker-compose.yml — 监控栈
services:
  prometheus:
    image: prom/prometheus:latest
    container_name: prometheus
    restart: always
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - ./alert_rules.yml:/etc/prometheus/alert_rules.yml
      - prometheus_data:/prometheus
    command:
      - "--config.file=/etc/prometheus/prometheus.yml"
      - "--storage.tsdb.retention.time=30d"

  grafana:
    image: grafana/grafana:latest
    container_name: grafana
    restart: always
    ports:
      - "3001:3000"
    volumes:
      - grafana_data:/var/lib/grafana
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=your_secure_password

  node-exporter:
    image: prom/node-exporter:latest
    container_name: node-exporter
    restart: always
    ports:
      - "9100:9100"
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - "--path.procfs=/host/proc"
      - "--path.sysfs=/host/sys"
      - "--path.rootfs=/rootfs"

  alertmanager:
    image: prom/alertmanager:latest
    container_name: alertmanager
    restart: always
    ports:
      - "9093:9093"
    volumes:
      - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml

volumes:
  prometheus_data:
  grafana_data:

Prometheus 配置

prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - "alert_rules.yml"

alerting:
  alertmanagers:
    - static_configs:
        - targets: ["alertmanager:9093"]

scrape_configs:
  # 监控 Prometheus 自身
  - job_name: "prometheus"
    static_configs:
      - targets: ["localhost:9090"]

  # 监控服务器系统指标
  - job_name: "node"
    static_configs:
      - targets: ["node-exporter:9100"]

  # 监控 OpenClaw 服务
  - job_name: "openclaw"
    static_configs:
      - targets: ["openclaw:3000"]
    metrics_path: "/metrics"
    scrape_interval: 10s

告警规则

alert_rules.yml — 告警规则
groups:
  - name: openclaw_alerts
    rules:
      # CPU 使用率超过 80% 持续 5 分钟
      - alert: HighCPU
        expr: 100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "CPU 使用率过高"
          description: "CPU 持续 5 分钟超过 80%,当前 {{ $value }}%"

      # 内存使用率超过 90%
      - alert: HighMemory
        expr: (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 > 90
        for: 3m
        labels:
          severity: critical
        annotations:
          summary: "内存即将耗尽"
          description: "内存使用率超过 90%,当前 {{ $value }}%"

      # 磁盘使用率超过 85%
      - alert: DiskAlmostFull
        expr: (1 - node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 > 85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "磁盘空间不足"
          description: "根分区使用率超过 85%,当前 {{ $value }}%"

      # OpenClaw 服务挂了
      - alert: OpenClawDown
        expr: up{job="openclaw"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "OpenClaw 服务不可用"
          description: "OpenClaw 已经超过 1 分钟没有响应"

告警通知配置

告警触发后怎么通知你?可以发邮件、推到钉钉/企业微信/Slack:

alertmanager.yml — 通知配置
global:
  resolve_timeout: 5m

route:
  group_by: ["alertname"]
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: "default"

receivers:
  - name: "default"
    email_configs:
      - to: "your-email@example.com"
        from: "alertmanager@example.com"
        smarthost: "smtp.example.com:587"
        auth_username: "alertmanager@example.com"
        auth_password: "your-email-password"
    webhook_configs:
      # 钉钉机器人
      - url: "https://oapi.dingtalk.com/robot/send?access_token=YOUR_TOKEN"
        send_resolved: true
这篇教程对你有帮助吗?