前言

"你无法改善你无法度量的东西。" —— 彼得·德鲁克

在生产环境中,服务器宕机 5 分钟才发现,和 5 秒钟内收到告警,对业务的影响天差地别。一套完善的监控与告警体系,是运维团队的"眼睛"和"耳朵"。本文将从零搭建一套基于 Prometheus + Grafana + Loki 的现代监控体系,涵盖指标采集、可视化、告警通知和日志监控全流程。


一、监控体系概述:三大支柱

现代可观测性(Observability)建立在三大支柱之上:

支柱

含义

代表工具

指标(Metrics)

可聚合的数值型时间序列数据

Prometheus、InfluxDB、VictoriaMetrics

日志(Logs)

离散的事件记录

Loki、ELK Stack、ClickHouse

链路追踪(Traces)

请求在分布式系统中的完整路径

Jaeger、Zipkin、Tempo

三者互补:

  • 指标告诉你"发生了什么"(CPU 飙到 95%)

  • 日志告诉你"为什么发生"(OOM Killer 杀掉了进程)

  • 链路追踪告诉你"在哪里发生的"(哪个微服务调用超时)

本文重点聚焦前两者:指标监控(Prometheus 生态)和日志监控(Loki)。


二、Prometheus:时序数据引擎

2.1 架构概览

                  ┌──────────────┐
                  │  Alertmanager │
                  └──────┬───────┘
                         │ 告警推送
┌──────────────┐   ┌─────┴──────┐   ┌──────────────┐
│ node_exporter │──▶│ Prometheus │◀──│ 应用 Exporter │
└──────────────┘   └─────┬──────┘   └──────────────┘
                         │
                  ┌──────┴───────┐
                  │   Grafana    │
                  └──────────────┘

Prometheus 采用 Pull 模型:主动从目标拉取指标,而非被动接收。这带来了天然的服务发现能力和更简单的架构。

2.2 安装部署

方式一:二进制安装(推荐生产环境)

# 下载 Prometheus
PROM_VERSION="2.52.0"
wget https://github.com/prometheus/prometheus/releases/download/v${PROM_VERSION}/prometheus-${PROM_VERSION}.linux-amd64.tar.gz
tar xzf prometheus-${PROM_VERSION}.linux-amd64.tar.gz
sudo mv prometheus-${PROM_VERSION}.linux-amd64/{prometheus,promtool} /usr/local/bin/
sudo mkdir -p /etc/prometheus /var/lib/prometheus

# 创建专用用户
sudo useradd --no-create-home --shell /bin/false prometheus
sudo chown -R prometheus:prometheus /var/lib/prometheus

配置文件 /etc/prometheus/prometheus.yml

global:
  scrape_interval: 15s          # 全局采集间隔
  evaluation_interval: 15s      # 规则评估间隔
  scrape_timeout: 10s           # 采集超时

# 告警规则文件
rule_files:
  - /etc/prometheus/rules/*.yml

# 告警管理器地址
alerting:
  alertmanagers:
    - static_configs:
        - targets: ['localhost:9093']

# 采集目标
scrape_configs:
  # Prometheus 自身
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  # Node Exporter
  - job_name: 'node'
    static_configs:
      - targets:
          - '192.168.1.10:9100'
          - '192.168.1.11:9100'
          - '192.168.1.12:9100'
        labels:
          env: production
          region: cn-east

  # 文件服务发现(动态添加目标)
  - job_name: 'file-sd'
    file_sd_configs:
      - files:
          - /etc/prometheus/targets/*.json
        refresh_interval: 30s

创建 Systemd 服务:

# /etc/systemd/system/prometheus.service
[Unit]
Description=Prometheus Monitoring
After=network-online.target
Wants=network-online.target

[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/prometheus \
  --config.file=/etc/prometheus/prometheus.yml \
  --storage.tsdb.path=/var/lib/prometheus \
  --storage.tsdb.retention.time=30d \
  --storage.tsdb.retention.size=50GB \
  --web.enable-lifecycle \
  --web.enable-admin-api
ExecReload=/bin/kill -HUP $MAINPID
Restart=always
RestartSec=5

[Install]
WantedBy=multi-user.target
sudo systemctl daemon-reload
sudo systemctl enable --now prometheus

方式二:Docker Compose(快速搭建)

# docker-compose.yml
version: '3.8'
services:
  prometheus:
    image: prom/prometheus:v2.52.0
    container_name: prometheus
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - ./rules:/etc/prometheus/rules
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.retention.time=30d'
      - '--web.enable-lifecycle'
    restart: unless-stopped

volumes:
  prometheus_data:

2.3 PromQL 实战

PromQL 是 Prometheus 的查询语言,掌握它是高效使用 Prometheus 的关键。

基础查询:

# 查看 CPU 使用率(所有核心)
100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# 内存使用率
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100

# 磁盘使用率
(1 - node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100

# 网络流入速率(MB/s)
rate(node_network_receive_bytes_total{device="eth0"}[5m]) / 1024 / 1024

进阶技巧:

# 增长率:过去 1 小时内磁盘写入总量
increase(node_disk_written_bytes_total[1h])

# 预测:基于过去 6 小时趋势,预测 24 小时后磁盘使用量
predict_linear(node_filesystem_avail_bytes{mountpoint="/"}[6h], 24*3600)

# 排序:找出 CPU 使用率 Top 5 的实例
topk(5, 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100))

# 比较:当前值 vs 同期(昨日同期)
node_load1 offset 1d

# 聚合:按环境汇总 HTTP 请求总量
sum by(env) (rate(http_requests_total[5m]))

常用函数速查表:

函数

用途

示例

rate()

计算每秒增长率

rate(http_requests_total[5m])

increase()

计算区间增量

increase(counter[1h])

irate()

瞬时增长率(更敏感)

irate(http_requests_total[5m])

avg_over_time()

区间平均值

avg_over_temp[1h])

histogram_quantile()

百分位计算

histogram_quantile(0.99, ...)

predict_linear()

线性预测

predict_linear(disk_free[6h], 86400)

label_replace()

标签重写

label_replace(up, "host", "$1", "instance", "(.*):.*")


三、node_exporter:主机指标采集

3.1 安装配置

# 下载安装
NODE_VERSION="1.8.0"
wget https://github.com/prometheus/node_exporter/releases/download/v${NODE_VERSION}/node_exporter-${NODE_VERSION}.linux-amd64.tar.gz
tar xzf node_exporter-${NODE_VERSION}.linux-amd64.tar.gz
sudo mv node_exporter-${NODE_VERSION}.linux-amd64/node_exporter /usr/local/bin/

# 创建 Systemd 服务
sudo useradd --no-create-home --shell /bin/false node_exporter
cat <<'EOF' | sudo tee /etc/systemd/system/node_exporter.service
[Unit]
Description=Node Exporter
After=network.target

[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter \
  --collector.systemd \
  --collector.processes \
  --collector.tcpstat \
  --web.listen-address=:9100
Restart=always

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl daemon-reload
sudo systemctl enable --now node_exporter

3.2 关键采集器说明

node_exporter 默认启用大部分采集器,以下是值得关注的:

采集器

采集内容

启用方式

systemd

systemd 单元状态

--collector.systemd

processes

进程状态统计

--collector.processes

tcpstat

TCP 连接状态

--collector.tcpstat

textfile

自定义指标文件

--collector.textfile.directory=/var/lib/node_exporter/textfile

自定义指标示例(textfile collector):

# 定期生成自定义指标文件
cat <<'EOF' > /usr/local/bin/custom_metrics.sh
#!/bin/bash
# 待更新的包数量
UPDATES=$(apt list --upgradable 2>/dev/null | grep -c upgradable)
echo "node_pending_security_updates ${UPDATES}" > /var/lib/node_exporter/textfile/security_updates.prom
EOF

# 设置 crontab 每小时执行
echo "0 * * * * root /usr/local/bin/custom_metrics.sh" | sudo tee /etc/cron.d/custom_metrics

3.3 常用主机指标速查

# CPU 相关
node_cpu_seconds_total              # CPU 时间(按模式)
node_load1 / node_load5 / node_load15  # 系统负载

# 内存相关
node_memory_MemTotal_bytes          # 总内存
node_memory_MemAvailable_bytes      # 可用内存
node_memory_Buffers_bytes           # Buffer 缓存
node_memory_Cached_bytes            # Page Cache

# 磁盘相关
node_filesystem_size_bytes          # 文件系统总大小
node_filesystem_avail_bytes         # 可用空间
node_disk_io_time_seconds_total     # 磁盘 IO 时间
node_disk_read_bytes_total          # 读取字节数

# 网络相关
node_network_receive_bytes_total    # 接收字节数
node_network_transmit_bytes_total   # 发送字节数
node_network_receive_errs_total     # 接收错误数

# 系统相关
node_time_seconds                   # 当前时间戳
node_boot_time_seconds              # 启动时间戳
node_filefd_allocated               # 已分配文件描述符

四、Grafana:数据可视化

4.1 安装

# Debian/Ubuntu
sudo apt-get install -y apt-transport-https software-properties-common
wget -q -O - https://apt.grafana.com/gpg.key | gpg --dearmor | sudo tee /etc/apt/keyrings/grafana.gpg > /dev/null
echo "deb [signed-by=/etc/apt/keyrings/grafana.gpg] https://apt.grafana.com stable main" | sudo tee /etc/apt/sources.list.d/grafana.list
sudo apt-get update && sudo apt-get install grafana
sudo systemctl enable --now grafana-server

默认访问 http://localhost:3000,初始账号 admin / admin

4.2 添加 Prometheus 数据源

# 通过 API 添加数据源
curl -s -X POST http://admin:admin@localhost:3000/api/datasources \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Prometheus",
    "type": "prometheus",
    "url": "http://localhost:9090",
    "access": "proxy",
    "isDefault": true
  }'

4.3 导入社区面板

Grafana 社区有大量现成的 Dashboard,推荐几个运维必备的:

# Node Exporter Full(ID: 1860)- 主机监控全景
curl -s -X POST http://admin:admin@localhost:3000/api/dashboards/import \
  -H "Content-Type: application/json" \
  -d '{
    "dashboard": {"id": 1860},
    "overwrite": true,
    "inputs": [{"name": "DS_PROMETHEUS", "type": "datasource", "pluginId": "prometheus", "value": "Prometheus"}]
  }'

Dashboard ID

名称

用途

1860

Node Exporter Full

主机监控全览

9628

Prometheus 2.0 Overview

Prometheus 自身监控

12006

Kafka Exporter

Kafka 监控

763

Redis Dashboard

Redis 监控

12740

Kubernetes Pods

K8s Pod 监控

4.4 自定义 Dashboard 面板 JSON 示例

{
  "panels": [
    {
      "title": "CPU 使用率",
      "type": "timeseries",
      "targets": [
        {
          "expr": "100 - (avg by(instance) (rate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
          "legendFormat": "{{instance}}"
        }
      ],
      "fieldConfig": {
        "defaults": {
          "unit": "percent",
          "min": 0,
          "max": 100,
          "thresholds": {
            "steps": [
              { "value": null, "color": "green" },
              { "value": 70, "color": "yellow" },
              { "value": 90, "color": "red" }
            ]
          }
        }
      }
    }
  ]
}

五、告警规则编写

5.1 Alertmanager 安装

AM_VERSION="0.27.0"
wget https://github.com/prometheus/alertmanager/releases/download/v${AM_VERSION}/alertmanager-${AM_VERSION}.linux-amd64.tar.gz
tar xzf alertmanager-${AM_VERSION}.linux-amd64.tar.gz
sudo mv alertmanager-${AM_VERSION}.linux-amd64/{alertmanager,amtool} /usr/local/bin/
sudo mkdir -p /etc/alertmanager /var/lib/alertmanager

Alertmanager 配置 /etc/alertmanager/alertmanager.yml

global:
  resolve_timeout: 5m
  smtp_smarthost: 'smtp.example.com:587'
  smtp_from: 'alert@example.com'
  smtp_auth_username: 'alert@example.com'
  smtp_auth_password: 'your-password'
  smtp_require_tls: true

# 告警路由
route:
  receiver: 'default-receiver'
  group_by: ['alertname', 'instance']
  group_wait: 30s          # 首次告警等待时间(聚合)
  group_interval: 5m       # 同组告警间隔
  repeat_interval: 4h      # 重复告警间隔
  routes:
    # 严重告警走紧急通道
    - match:
        severity: critical
      receiver: 'critical-receiver'
      group_wait: 10s
      repeat_interval: 1h
    # 警告告警
    - match:
        severity: warning
      receiver: 'warning-receiver'
      repeat_interval: 4h

# 抑制规则:critical 触发时抑制同实例的 warning
inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'instance']

# 接收器
receivers:
  - name: 'default-receiver'
    email_configs:
      - to: 'ops-team@example.com'

  - name: 'critical-receiver'
    email_configs:
      - to: 'ops-emergency@example.com'
    webhook_configs:
      - url: 'http://localhost:8060/dingtalk/ops/send'
        send_resolved: true

  - name: 'warning-receiver'
    email_configs:
      - to: 'ops-team@example.com'

5.2 告警规则文件

创建 /etc/prometheus/rules/node_alerts.yml

groups:
  - name: node_alerts
    rules:
      # CPU 使用率过高
      - alert: HighCpuUsage
        expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "CPU 使用率过高 (实例 {{ $labels.instance }})"
          description: "CPU 使用率已超过 85%,当前值 {{ $value | printf \"%.1f\" }}%,持续 5 分钟。"

      # 内存使用率过高
      - alert: HighMemoryUsage
        expr: (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 > 90
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "内存使用率过高 (实例 {{ $labels.instance }})"
          description: "内存使用率已超过 90%,当前值 {{ $value | printf \"%.1f\" }}%。可用内存仅剩 {{ with printf \"node_memory_MemAvailable_bytes{instance='%s'}\" $labels.instance | query }}{{ . | first | value | humanize1024 }}{{ end }}。"

      # 磁盘空间不足
      - alert: DiskSpaceLow
        expr: (1 - node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"} / node_filesystem_size_bytes) * 100 > 85
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "磁盘空间不足 ({{ $labels.instance }}:{{ $labels.mountpoint }})"
          description: "磁盘 {{ $labels.mountpoint }} 使用率 {{ $value | printf \"%.1f\" }}%。"

      # 磁盘将在 24 小时内写满
      - alert: DiskWillFillIn24h
        expr: predict_linear(node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"}[6h], 24*3600) < 0
        for: 30m
        labels:
          severity: critical
        annotations:
          summary: "磁盘预计 24 小时内写满 ({{ $labels.instance }}:{{ $labels.mountpoint }})"

      # 实例宕机
      - alert: InstanceDown
        expr: up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "实例宕机 ({{ $labels.instance }})"
          description: "{{ $labels.job }} 的 {{ $labels.instance }} 已宕机超过 1 分钟。"

      # 系统重启
      - alert: SystemReboot
        expr: changes(node_boot_time_seconds[15m]) > 0
        labels:
          severity: warning
        annotations:
          summary: "系统发生重启 ({{ $labels.instance }})"

  - name: application_alerts
    rules:
      # HTTP 5xx 错误率
      - alert: HighHttp5xxRate
        expr: sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "HTTP 5xx 错误率超过 5%"
          description: "当前 5xx 错误率 {{ $value | printf \"%.2f\" }}%。"

      # 请求延迟过高
      - alert: HighRequestLatency
        expr: histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)) > 2
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "P99 请求延迟超过 2 秒"
          description: "当前 P99 延迟 {{ $value | printf \"%.2f\" }} 秒。"

5.3 规则验证与热加载

# 验证规则语法
promtool check rules /etc/prometheus/rules/*.yml

# 热加载配置(无需重启 Prometheus)
curl -X POST http://localhost:9090/-/reload

# 查看当前告警状态
curl -s http://localhost:9090/api/v1/alerts | jq .

六、通知渠道配置

6.1 邮件通知

在 Alertmanager 中已配置(见 5.1),确保 SMTP 信息正确即可。

测试邮件发送:

# 使用 amtool 测试
amtool alert add alertname=TestAlert severity=warning instance=localhost:9090 \
  --annotation=summary="测试告警" \
  --annotation=description="这是一条测试告警"

6.2 钉钉通知

使用 prometheus-webhook-dingtalk 作为中间件:

# Docker 部署
docker run -d --name dingtalk \
  -p 8060:8060 \
  timonwong/prometheus-webhook-dingtalk:latest \
  --ding.profile="ops=https://oapi.dingtalk.com/robot/send?access_token=YOUR_TOKEN"

自定义钉钉模板 /etc/webhook-dingtalk/config.yml

templates:
  - /etc/webhook-dingtalk/template.tmpl
targets:
  ops:
    url: https://oapi.dingtalk.com/robot/send?access_token=YOUR_ACCESS_TOKEN
    message:
      title: '{{ template "ding.link.title" . }}'
      text: '{{ template "ding.link.content" . }}'

自定义消息模板 template.tmpl

{{ define "ding.link.title" }}[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .GroupLabels.alertname }}{{ end }}

{{ define "ding.link.content" }}
## {{ .CommonAnnotations.summary }}

**告警详情:**
{{ range .Alerts }}
- **实例**: {{ .Labels.instance }}
- **严重级别**: {{ .Labels.severity }}
- **描述**: {{ .Annotations.description }}
- **触发时间**: {{ .StartsAt.Format "2006-01-02 15:04:05" }}
{{ if .EndsAt }}- **恢复时间**: {{ .EndsAt.Format "2006-01-02 15:04:05" }}{{ end }}
{{ end }}

> 告警由 Prometheus + Alertmanager 触发
{{ end }}

6.3 飞书通知

通过 Webhook 发送到飞书群:

# 使用 alertmanager-webhook 适配器
# 先创建飞书群机器人,获取 webhook 地址
# https://open.feishu.cn/document/client-docs/bot-v3/add-custom-bot

Python 适配脚本 feishu_webhook.py

#!/usr/bin/env python3
"""Alertmanager → 飞书 Webhook 适配器"""

from flask import Flask, request, jsonify
import requests
import json
from datetime import datetime

app = Flask(__name__)

FEISHU_WEBHOOK = "https://open.feishu.cn/open-apis/bot/v2/hook/YOUR_TOKEN"

@app.route('/feishu/send', methods=['POST'])
def send_to_feishu():
    data = request.json
    alerts = data.get('alerts', [])
    
    for alert in alerts:
        status = alert.get('status', 'unknown')
        labels = alert.get('labels', {})
        annotations = alert.get('annotations', {})
        
        # 构建飞书消息卡片
        color = "red" if status == "firing" else "green"
        icon = "🔥" if status == "firing" else "✅"
        
        card = {
            "msg_type": "interactive",
            "card": {
                "header": {
                    "title": {
                        "tag": "plain_text",
                        "content": f"{icon} [{status.upper()}] {labels.get('alertname', 'Unknown Alert')}"
                    },
                    "template": color
                },
                "elements": [
                    {
                        "tag": "div",
                        "fields": [
                            {"is_short": True, "text": {"tag": "lark_md", "content": f"**实例:** {labels.get('instance', '-')}"}},
                            {"is_short": True, "text": {"tag": "lark_md", "content": f"**级别:** {labels.get('severity', '-')}"}},
                            {"is_short": False, "text": {"tag": "lark_md", "content": f"**描述:** {annotations.get('description', '-')}"}},
                            {"is_short": False, "text": {"tag": "lark_md", "content": f"**时间:** {alert.get('startsAt', '-')}"}}
                        ]
                    }
                ]
            }
        }
        
        requests.post(FEISHU_WEBHOOK, json=card, timeout=5)
    
    return jsonify({"status": "ok"})

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=5001)

Alertmanager 中添加 webhook:

receivers:
  - name: 'feishu-receiver'
    webhook_configs:
      - url: 'http://localhost:5001/feishu/send'
        send_resolved: true

6.4 企业微信通知

# 使用 prometheus-webhook-wechat
docker run -d --name wechat-webhook \
  -p 5002:5002 \
  -e WECOM_CORP_ID=YOUR_CORP_ID \
  -e WECOM_SECRET=YOUR_SECRET \
  -e WECOM_AGENT_ID=YOUR_AGENT_ID \
  prometheus-webhook-wechat

七、常用监控指标速查

7.1 主机级指标

# CPU 使用率
100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# 各 CPU 模式占比
avg by(mode) (rate(node_cpu_seconds_total[5m])) * 100

# 内存使用率
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100

# Swap 使用率
(1 - node_memory_SwapFree_bytes / node_memory_SwapTotal_bytes) * 100

# 系统负载 (1/5/15 分钟)
node_load1
node_load5
node_load15

# 磁盘 IO 利用率
rate(node_disk_io_time_seconds_total[5m])

# 磁盘读写速率 (MB/s)
rate(node_disk_read_bytes_total[5m]) / 1024 / 1024
rate(node_disk_written_bytes_total[5m]) / 1024 / 1024

# 网络带宽 (Mbps)
rate(node_network_receive_bytes_total{device="eth0"}[5m]) * 8 / 1024 / 1024
rate(node_network_transmit_bytes_total{device="eth0"}[5m]) * 8 / 1024 / 1024

# TCP 连接状态
node_netstat_Tcp_CurrEstab
node_tcp_connection_states

# 打开的文件描述符数
node_filefd_allocated

# 上下文切换率
rate(node_context_switches_total[5m])

7.2 应用级指标(以 HTTP 服务为例)

# QPS (每秒请求数)
sum(rate(http_requests_total[5m]))

# 按状态码分布
sum by(status) (rate(http_requests_total[5m]))

# 错误率
sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))

# P50/P95/P99 延迟
histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

7.3 数据库指标(MySQL / Redis)

# MySQL 慢查询
rate(mysql_global_status_slow_queries[5m])

# MySQL 连接数
mysql_global_status_threads_connected / mysql_global_variables_max_connections

# Redis 命中率
redis_keyspace_hits_total / (redis_keyspace_hits_total + redis_keyspace_misses_total)

# Redis 内存使用
redis_memory_used_bytes / redis_memory_max_bytes

八、日志监控:Loki

8.1 为什么选 Loki?

特性

Loki

ELK

索引方式

仅索引标签

全文索引

存储成本

低(~10x 更少)

查询语言

LogQL

KQL / Lucene

适用场景

标签化检索

复杂全文搜索

与 Prometheus 集成

原生

需额外配置

Loki 的核心理念:像 Prometheus 一样对待日志——只索引元数据(标签),不索引日志内容本身。

8.2 Docker Compose 部署

# docker-compose-loki.yml
version: '3.8'
services:
  loki:
    image: grafana/loki:3.0.0
    container_name: loki
    ports:
      - "3100:3100"
    volumes:
      - ./loki-config.yml:/etc/loki/config.yml
      - loki_data:/loki
    command: -config.file=/etc/loki/config.yml
    restart: unless-stopped

  promtail:
    image: grafana/promtail:3.0.0
    container_name: promtail
    volumes:
      - ./promtail-config.yml:/etc/promtail/config.yml
      - /var/log:/var/log:ro
      - /var/lib/docker/containers:/var/lib/docker/containers:ro
    command: -config.file=/etc/promtail/config.yml
    restart: unless-stopped

volumes:
  loki_data:

Loki 配置 loki-config.yml

auth_enabled: false

server:
  http_listen_port: 3100

common:
  path_prefix: /loki
  storage:
    filesystem:
      chunks_directory: /loki/chunks
      rules_directory: /loki/rules
  replication_factor: 1
  ring:
    kvstore:
      store: inmemory

schema_config:
  configs:
    - from: "2024-01-01"
      store: tsdb
      object_store: filesystem
      schema: v13
      index:
        prefix: index_
        period: 24h

limits_config:
  retention_period: 30d          # 日志保留 30 天
  max_query_length: 721h
  ingestion_rate_mb: 10
  ingestion_burst_size_mb: 20

compactor:
  working_directory: /loki/compactor
  compaction_interval: 10m
  retention_enabled: true
  retention_delete_delay: 2h
  delete_request_store: filesystem

Promtail 配置 promtail-config.yml

server:
  http_listen_port: 9080

positions:
  filename: /tmp/positions.yaml

clients:
  - url: http://loki:3100/loki/api/v1/push

scrape_configs:
  # 系统日志
  - job_name: system
    static_configs:
      - targets: [localhost]
        labels:
          job: syslog
          host: ${HOSTNAME}
          __path__: /var/log/syslog

  # 应用日志
  - job_name: app-logs
    static_configs:
      - targets: [localhost]
        labels:
          job: myapp
          host: ${HOSTNAME}
          __path__: /var/log/myapp/*.log
    pipeline_stages:
      # 解析 JSON 日志
      - json:
          expressions:
            level: level
            msg: message
            trace_id: trace_id
      - labels:
          level:
      - timestamp:
          source: time
          format: RFC3339Nano

  # Docker 容器日志
  - job_name: docker
    static_configs:
      - targets: [localhost]
        labels:
          job: docker
          __path__: /var/lib/docker/containers/*/*.log
    pipeline_stages:
      - json:
          expressions:
            log: log
            stream: stream
            container_name: attrs.container_name
      - labels:
          stream:
          container_name:
      - output:
          source: log

8.3 LogQL 查询语法

# 基础:按标签过滤日志
{job="myapp"}

# 级别过滤
{job="myapp"} |= "error"
{job="myapp"} |~ "timeout|connection refused"
{job="myapp"} !~ "health_check"

# JSON 日志解析
{job="myapp"} | json | level="error"

# 正则提取
{job="myapp"} | pattern `<ip> - - [<_>] "<method> <path> <_>" <status> <size>`

# 统计:每分钟错误日志数量
count_over_time({job="myapp"} |= "error" [1m])

# 统计:过去 1 小时内各状态码分布
sum by(status) (count_over_time({job="myapp"} | json | __error__="" [1h]))

# Top 10 错误消息
topk(10,
  sum by(msg) (count_over_time({job="myapp"} | json | level="error" [1h]))
)

# 速率:每秒错误日志数
rate({job="myapp"} |= "error" [5m])

8.4 Grafana 中配置 Loki 数据源

# API 方式添加
curl -s -X POST http://admin:admin@localhost:3000/api/datasources \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Loki",
    "type": "loki",
    "url": "http://localhost:3100",
    "access": "proxy"
  }'

在 Grafana 的 Explore 中可以直接输入 LogQL 查询,支持与 Prometheus 指标联动(如点击指标图中的时间点,自动跳转到对应时间段的日志)。


九、监控最佳实践

9.1 分层监控策略

┌─────────────────────────────────────────────┐
│  L1: 基础设施层  (CPU/内存/磁盘/网络)        │  ← node_exporter
├─────────────────────────────────────────────┤
│  L2: 平台服务层  (DB/缓存/消息队列/K8s)      │  ← 各组件 Exporter
├─────────────────────────────────────────────┤
│  L3: 应用服务层  (QPS/延迟/错误率/业务指标)   │  ← 应用埋点
├─────────────────────────────────────────────┤
│  L4: 业务指标层  (订单量/转化率/活跃用户)     │  ← 自定义 Metrics
└─────────────────────────────────────────────┘

9.2 告警分级与响应

级别

含义

响应时间

通知方式

P0 Critical

服务不可用,影响用户

5 分钟内

电话 + 飞书/钉钉

P1 Warning

性能下降,可能升级

30 分钟内

飞书/钉钉 + 邮件

P2 Info

需关注,暂无影响

下一工作日

邮件 / 工单

9.3 避免告警疲劳

# 好的告警规则特征:
# 1. 有明确的 for 持续时间(避免瞬时抖动误报)
- alert: HighCpuUsage
  expr: ... > 85
  for: 5m          # ← 持续 5 分钟才触发

# 2. 使用 inhibit_rules 抑制低级告警
# 3. 合理设置 group_by 和 group_wait 聚合告警
# 4. 使用 repeat_interval 控制重复频率
# 5. 定期 Review 告警规则,删除"狼来了"告警

9.4 容量规划

# 磁盘容量预测(线性回归)
predict_linear(node_filesystem_avail_bytes{mountpoint="/"}[7d], 30*24*3600)

# 流量趋势
predict_linear(rate(node_network_receive_bytes_total[1h])[7d:1h], 30*24*3600)

# Prometheus 存储估算
# 每个样本 ~2 bytes,每天样本数 = active_series × (86400 / scrape_interval)
prometheus_tsdb_head_series          # 当前活跃时间序列数
rate(prometheus_tsdb_head_chunks_created_total[1h])  # 每小时新增 chunk 数

9.5 Prometheus 高可用

生产环境推荐部署两套 Prometheus 做冗余:

# 两套 Prometheus 采集相同目标
# 使用 Alertmanager 集群去重
# alertmanager.yml 中配置集群
cluster:
  listen-address: "0.0.0.0:9094"
  peers:
    - "alertmanager-1:9094"
    - "alertmanager-2:9094"

长期存储方案:

  • Thanos:Prometheus 的高可用扩展,支持全局查询和对象存储

  • VictoriaMetrics:高性能时序数据库,兼容 Prometheus 协议

  • Cortex:可水平扩展的 Prometheus 后端

9.6 监控自身健康

监控系统本身也需要被监控:

# Prometheus 自身
prometheus_config_last_reload_successful                    # 配置是否加载成功
prometheus_tsdb_compactions_failed_total                    # 压缩失败次数
prometheus_rule_evaluation_duration_seconds                 # 规则评估耗时

# Exporter 健康
up                                                          # 目标是否可达

# Alertmanager
alertmanager_notifications_total                            # 通知发送总量
alertmanager_notifications_failed_total                     # 通知失败量

十、常见问题排查

Q1: Prometheus 采集超时

# 检查目标状态
curl -s http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | select(.health=="down") | {instance, lastError}'

# 调整超时
# prometheus.yml 中增加 scrape_timeout(默认 10s)
scrape_timeout: 15s

Q2: 告警没有触发

# 检查规则是否加载
promtool check rules /etc/prometheus/rules/*.yml

# 在 Prometheus Web UI → Status → Rules 查看规则状态
# 在 Alerting → Alerts 查看当前告警状态

# 用 amtool 测试路由
amtool config routes test alertname=HighCpuUsage severity=warning

Q3: Grafana 图表无数据

# 验证数据源连通性
curl -s http://localhost:9090/api/v1/query?query=up

# 检查 PromQL 语法
curl -s "http://localhost:9090/api/v1/query?query=up&time=$(date +%s)"

# Grafana API 测试数据源
curl -s http://admin:admin@localhost:3000/api/datasources/proxy/1/api/v1/query?query=up

附录:一键部署脚本

将上述组件打包为一键部署脚本:

#!/bin/bash
# monitor-stack-deploy.sh - 一键部署监控栈

set -euo pipefail

PROM_VERSION="2.52.0"
NODE_EXP_VERSION="1.8.0"
AM_VERSION="0.27.0"
GF_VERSION="11.0.0"

echo "=== 1. 安装 Prometheus ==="
wget -q https://github.com/prometheus/prometheus/releases/download/v${PROM_VERSION}/prometheus-${PROM_VERSION}.linux-amd64.tar.gz
tar xzf prometheus-${PROM_VERSION}.linux-amd64.tar.gz
sudo mv prometheus-${PROM_VERSION}.linux-amd64/{prometheus,promtool} /usr/local/bin/
sudo mkdir -p /etc/prometheus/rules /var/lib/prometheus

echo "=== 2. 安装 Node Exporter ==="
wget -q https://github.com/prometheus/node_exporter/releases/download/v${NODE_EXP_VERSION}/node_exporter-${NODE_EXP_VERSION}.linux-amd64.tar.gz
tar xzf node_exporter-${NODE_EXP_VERSION}.linux-amd64.tar.gz
sudo mv node_exporter-${NODE_EXP_VERSION}.linux-amd64/node_exporter /usr/local/bin/

echo "=== 3. 安装 Alertmanager ==="
wget -q https://github.com/prometheus/alertmanager/releases/download/v${AM_VERSION}/alertmanager-${AM_VERSION}.linux-amd64.tar.gz
tar xzf alertmanager-${AM_VERSION}.linux-amd64.tar.gz
sudo mv alertmanager-${AM_VERSION}.linux-amd64/{alertmanager,amtool} /usr/local/bin/
sudo mkdir -p /etc/alertmanager /var/lib/alertmanager

echo "=== 4. 安装 Grafana ==="
sudo apt-get install -y apt-transport-https
wget -q -O - https://apt.grafana.com/gpg.key | gpg --dearmor | sudo tee /etc/apt/keyrings/grafana.gpg > /dev/null
echo "deb [signed-by=/etc/apt/keyrings/grafana.gpg] https://apt.grafana.com stable main" | sudo tee /etc/apt/sources.list.d/grafana.list
sudo apt-get update && sudo apt-get install -y grafana

echo "=== 5. 启动服务 ==="
sudo systemctl daemon-reload
sudo systemctl enable --now prometheus node_exporter alertmanager grafana-server

echo "=== 部署完成 ==="
echo "Prometheus:    http://localhost:9090"
echo "Alertmanager:  http://localhost:9093"
echo "Grafana:       http://localhost:3000 (admin/admin)"
echo "Node Exporter: http://localhost:9100"

总结

组件

角色

默认端口

Prometheus

指标采集与存储

9090

node_exporter

主机指标采集

9100

Alertmanager

告警管理与分发

9093

Grafana

可视化与仪表盘

3000

Loki

日志聚合存储

3100

Promtail

日志采集代理

9080

监控不是一次性工程,而是持续迭代的过程。建议:

  1. 先监控再优化:先让系统"可见",再针对性优化

  2. 告警要少而精:一条有意义的告警胜过十条噪音

  3. 定期复盘:每次故障后 Review 监控覆盖,补充缺失的指标

  4. 文档化:告警处理手册(Runbook)和监控指标含义要写清楚


本文基于 Prometheus 2.52、Grafana 11、Loki 3.0 编写。如遇版本差异,请参考官方文档。