前言
"你无法改善你无法度量的东西。" —— 彼得·德鲁克
在生产环境中,服务器宕机 5 分钟才发现,和 5 秒钟内收到告警,对业务的影响天差地别。一套完善的监控与告警体系,是运维团队的"眼睛"和"耳朵"。本文将从零搭建一套基于 Prometheus + Grafana + Loki 的现代监控体系,涵盖指标采集、可视化、告警通知和日志监控全流程。
一、监控体系概述:三大支柱
现代可观测性(Observability)建立在三大支柱之上:
三者互补:
指标告诉你"发生了什么"(CPU 飙到 95%)
日志告诉你"为什么发生"(OOM Killer 杀掉了进程)
链路追踪告诉你"在哪里发生的"(哪个微服务调用超时)
本文重点聚焦前两者:指标监控(Prometheus 生态)和日志监控(Loki)。
二、Prometheus:时序数据引擎
2.1 架构概览
┌──────────────┐
│ Alertmanager │
└──────┬───────┘
│ 告警推送
┌──────────────┐ ┌─────┴──────┐ ┌──────────────┐
│ node_exporter │──▶│ Prometheus │◀──│ 应用 Exporter │
└──────────────┘ └─────┬──────┘ └──────────────┘
│
┌──────┴───────┐
│ Grafana │
└──────────────┘Prometheus 采用 Pull 模型:主动从目标拉取指标,而非被动接收。这带来了天然的服务发现能力和更简单的架构。
2.2 安装部署
方式一:二进制安装(推荐生产环境)
# 下载 Prometheus
PROM_VERSION="2.52.0"
wget https://github.com/prometheus/prometheus/releases/download/v${PROM_VERSION}/prometheus-${PROM_VERSION}.linux-amd64.tar.gz
tar xzf prometheus-${PROM_VERSION}.linux-amd64.tar.gz
sudo mv prometheus-${PROM_VERSION}.linux-amd64/{prometheus,promtool} /usr/local/bin/
sudo mkdir -p /etc/prometheus /var/lib/prometheus
# 创建专用用户
sudo useradd --no-create-home --shell /bin/false prometheus
sudo chown -R prometheus:prometheus /var/lib/prometheus配置文件 /etc/prometheus/prometheus.yml:
global:
scrape_interval: 15s # 全局采集间隔
evaluation_interval: 15s # 规则评估间隔
scrape_timeout: 10s # 采集超时
# 告警规则文件
rule_files:
- /etc/prometheus/rules/*.yml
# 告警管理器地址
alerting:
alertmanagers:
- static_configs:
- targets: ['localhost:9093']
# 采集目标
scrape_configs:
# Prometheus 自身
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
# Node Exporter
- job_name: 'node'
static_configs:
- targets:
- '192.168.1.10:9100'
- '192.168.1.11:9100'
- '192.168.1.12:9100'
labels:
env: production
region: cn-east
# 文件服务发现(动态添加目标)
- job_name: 'file-sd'
file_sd_configs:
- files:
- /etc/prometheus/targets/*.json
refresh_interval: 30s创建 Systemd 服务:
# /etc/systemd/system/prometheus.service
[Unit]
Description=Prometheus Monitoring
After=network-online.target
Wants=network-online.target
[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/prometheus \
--config.file=/etc/prometheus/prometheus.yml \
--storage.tsdb.path=/var/lib/prometheus \
--storage.tsdb.retention.time=30d \
--storage.tsdb.retention.size=50GB \
--web.enable-lifecycle \
--web.enable-admin-api
ExecReload=/bin/kill -HUP $MAINPID
Restart=always
RestartSec=5
[Install]
WantedBy=multi-user.targetsudo systemctl daemon-reload
sudo systemctl enable --now prometheus方式二:Docker Compose(快速搭建)
# docker-compose.yml
version: '3.8'
services:
prometheus:
image: prom/prometheus:v2.52.0
container_name: prometheus
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- ./rules:/etc/prometheus/rules
- prometheus_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.retention.time=30d'
- '--web.enable-lifecycle'
restart: unless-stopped
volumes:
prometheus_data:2.3 PromQL 实战
PromQL 是 Prometheus 的查询语言,掌握它是高效使用 Prometheus 的关键。
基础查询:
# 查看 CPU 使用率(所有核心)
100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# 内存使用率
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100
# 磁盘使用率
(1 - node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100
# 网络流入速率(MB/s)
rate(node_network_receive_bytes_total{device="eth0"}[5m]) / 1024 / 1024进阶技巧:
# 增长率:过去 1 小时内磁盘写入总量
increase(node_disk_written_bytes_total[1h])
# 预测:基于过去 6 小时趋势,预测 24 小时后磁盘使用量
predict_linear(node_filesystem_avail_bytes{mountpoint="/"}[6h], 24*3600)
# 排序:找出 CPU 使用率 Top 5 的实例
topk(5, 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100))
# 比较:当前值 vs 同期(昨日同期)
node_load1 offset 1d
# 聚合:按环境汇总 HTTP 请求总量
sum by(env) (rate(http_requests_total[5m]))常用函数速查表:
三、node_exporter:主机指标采集
3.1 安装配置
# 下载安装
NODE_VERSION="1.8.0"
wget https://github.com/prometheus/node_exporter/releases/download/v${NODE_VERSION}/node_exporter-${NODE_VERSION}.linux-amd64.tar.gz
tar xzf node_exporter-${NODE_VERSION}.linux-amd64.tar.gz
sudo mv node_exporter-${NODE_VERSION}.linux-amd64/node_exporter /usr/local/bin/
# 创建 Systemd 服务
sudo useradd --no-create-home --shell /bin/false node_exporter
cat <<'EOF' | sudo tee /etc/systemd/system/node_exporter.service
[Unit]
Description=Node Exporter
After=network.target
[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter \
--collector.systemd \
--collector.processes \
--collector.tcpstat \
--web.listen-address=:9100
Restart=always
[Install]
WantedBy=multi-user.target
EOF
sudo systemctl daemon-reload
sudo systemctl enable --now node_exporter3.2 关键采集器说明
node_exporter 默认启用大部分采集器,以下是值得关注的:
自定义指标示例(textfile collector):
# 定期生成自定义指标文件
cat <<'EOF' > /usr/local/bin/custom_metrics.sh
#!/bin/bash
# 待更新的包数量
UPDATES=$(apt list --upgradable 2>/dev/null | grep -c upgradable)
echo "node_pending_security_updates ${UPDATES}" > /var/lib/node_exporter/textfile/security_updates.prom
EOF
# 设置 crontab 每小时执行
echo "0 * * * * root /usr/local/bin/custom_metrics.sh" | sudo tee /etc/cron.d/custom_metrics3.3 常用主机指标速查
# CPU 相关
node_cpu_seconds_total # CPU 时间(按模式)
node_load1 / node_load5 / node_load15 # 系统负载
# 内存相关
node_memory_MemTotal_bytes # 总内存
node_memory_MemAvailable_bytes # 可用内存
node_memory_Buffers_bytes # Buffer 缓存
node_memory_Cached_bytes # Page Cache
# 磁盘相关
node_filesystem_size_bytes # 文件系统总大小
node_filesystem_avail_bytes # 可用空间
node_disk_io_time_seconds_total # 磁盘 IO 时间
node_disk_read_bytes_total # 读取字节数
# 网络相关
node_network_receive_bytes_total # 接收字节数
node_network_transmit_bytes_total # 发送字节数
node_network_receive_errs_total # 接收错误数
# 系统相关
node_time_seconds # 当前时间戳
node_boot_time_seconds # 启动时间戳
node_filefd_allocated # 已分配文件描述符四、Grafana:数据可视化
4.1 安装
# Debian/Ubuntu
sudo apt-get install -y apt-transport-https software-properties-common
wget -q -O - https://apt.grafana.com/gpg.key | gpg --dearmor | sudo tee /etc/apt/keyrings/grafana.gpg > /dev/null
echo "deb [signed-by=/etc/apt/keyrings/grafana.gpg] https://apt.grafana.com stable main" | sudo tee /etc/apt/sources.list.d/grafana.list
sudo apt-get update && sudo apt-get install grafana
sudo systemctl enable --now grafana-server默认访问 http://localhost:3000,初始账号 admin / admin。
4.2 添加 Prometheus 数据源
# 通过 API 添加数据源
curl -s -X POST http://admin:admin@localhost:3000/api/datasources \
-H "Content-Type: application/json" \
-d '{
"name": "Prometheus",
"type": "prometheus",
"url": "http://localhost:9090",
"access": "proxy",
"isDefault": true
}'4.3 导入社区面板
Grafana 社区有大量现成的 Dashboard,推荐几个运维必备的:
# Node Exporter Full(ID: 1860)- 主机监控全景
curl -s -X POST http://admin:admin@localhost:3000/api/dashboards/import \
-H "Content-Type: application/json" \
-d '{
"dashboard": {"id": 1860},
"overwrite": true,
"inputs": [{"name": "DS_PROMETHEUS", "type": "datasource", "pluginId": "prometheus", "value": "Prometheus"}]
}'4.4 自定义 Dashboard 面板 JSON 示例
{
"panels": [
{
"title": "CPU 使用率",
"type": "timeseries",
"targets": [
{
"expr": "100 - (avg by(instance) (rate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
"legendFormat": "{{instance}}"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent",
"min": 0,
"max": 100,
"thresholds": {
"steps": [
{ "value": null, "color": "green" },
{ "value": 70, "color": "yellow" },
{ "value": 90, "color": "red" }
]
}
}
}
}
]
}五、告警规则编写
5.1 Alertmanager 安装
AM_VERSION="0.27.0"
wget https://github.com/prometheus/alertmanager/releases/download/v${AM_VERSION}/alertmanager-${AM_VERSION}.linux-amd64.tar.gz
tar xzf alertmanager-${AM_VERSION}.linux-amd64.tar.gz
sudo mv alertmanager-${AM_VERSION}.linux-amd64/{alertmanager,amtool} /usr/local/bin/
sudo mkdir -p /etc/alertmanager /var/lib/alertmanagerAlertmanager 配置 /etc/alertmanager/alertmanager.yml:
global:
resolve_timeout: 5m
smtp_smarthost: 'smtp.example.com:587'
smtp_from: 'alert@example.com'
smtp_auth_username: 'alert@example.com'
smtp_auth_password: 'your-password'
smtp_require_tls: true
# 告警路由
route:
receiver: 'default-receiver'
group_by: ['alertname', 'instance']
group_wait: 30s # 首次告警等待时间(聚合)
group_interval: 5m # 同组告警间隔
repeat_interval: 4h # 重复告警间隔
routes:
# 严重告警走紧急通道
- match:
severity: critical
receiver: 'critical-receiver'
group_wait: 10s
repeat_interval: 1h
# 警告告警
- match:
severity: warning
receiver: 'warning-receiver'
repeat_interval: 4h
# 抑制规则:critical 触发时抑制同实例的 warning
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'instance']
# 接收器
receivers:
- name: 'default-receiver'
email_configs:
- to: 'ops-team@example.com'
- name: 'critical-receiver'
email_configs:
- to: 'ops-emergency@example.com'
webhook_configs:
- url: 'http://localhost:8060/dingtalk/ops/send'
send_resolved: true
- name: 'warning-receiver'
email_configs:
- to: 'ops-team@example.com'5.2 告警规则文件
创建 /etc/prometheus/rules/node_alerts.yml:
groups:
- name: node_alerts
rules:
# CPU 使用率过高
- alert: HighCpuUsage
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 85
for: 5m
labels:
severity: warning
annotations:
summary: "CPU 使用率过高 (实例 {{ $labels.instance }})"
description: "CPU 使用率已超过 85%,当前值 {{ $value | printf \"%.1f\" }}%,持续 5 分钟。"
# 内存使用率过高
- alert: HighMemoryUsage
expr: (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 > 90
for: 5m
labels:
severity: critical
annotations:
summary: "内存使用率过高 (实例 {{ $labels.instance }})"
description: "内存使用率已超过 90%,当前值 {{ $value | printf \"%.1f\" }}%。可用内存仅剩 {{ with printf \"node_memory_MemAvailable_bytes{instance='%s'}\" $labels.instance | query }}{{ . | first | value | humanize1024 }}{{ end }}。"
# 磁盘空间不足
- alert: DiskSpaceLow
expr: (1 - node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"} / node_filesystem_size_bytes) * 100 > 85
for: 10m
labels:
severity: warning
annotations:
summary: "磁盘空间不足 ({{ $labels.instance }}:{{ $labels.mountpoint }})"
description: "磁盘 {{ $labels.mountpoint }} 使用率 {{ $value | printf \"%.1f\" }}%。"
# 磁盘将在 24 小时内写满
- alert: DiskWillFillIn24h
expr: predict_linear(node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"}[6h], 24*3600) < 0
for: 30m
labels:
severity: critical
annotations:
summary: "磁盘预计 24 小时内写满 ({{ $labels.instance }}:{{ $labels.mountpoint }})"
# 实例宕机
- alert: InstanceDown
expr: up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "实例宕机 ({{ $labels.instance }})"
description: "{{ $labels.job }} 的 {{ $labels.instance }} 已宕机超过 1 分钟。"
# 系统重启
- alert: SystemReboot
expr: changes(node_boot_time_seconds[15m]) > 0
labels:
severity: warning
annotations:
summary: "系统发生重启 ({{ $labels.instance }})"
- name: application_alerts
rules:
# HTTP 5xx 错误率
- alert: HighHttp5xxRate
expr: sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "HTTP 5xx 错误率超过 5%"
description: "当前 5xx 错误率 {{ $value | printf \"%.2f\" }}%。"
# 请求延迟过高
- alert: HighRequestLatency
expr: histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)) > 2
for: 5m
labels:
severity: warning
annotations:
summary: "P99 请求延迟超过 2 秒"
description: "当前 P99 延迟 {{ $value | printf \"%.2f\" }} 秒。"5.3 规则验证与热加载
# 验证规则语法
promtool check rules /etc/prometheus/rules/*.yml
# 热加载配置(无需重启 Prometheus)
curl -X POST http://localhost:9090/-/reload
# 查看当前告警状态
curl -s http://localhost:9090/api/v1/alerts | jq .六、通知渠道配置
6.1 邮件通知
在 Alertmanager 中已配置(见 5.1),确保 SMTP 信息正确即可。
测试邮件发送:
# 使用 amtool 测试
amtool alert add alertname=TestAlert severity=warning instance=localhost:9090 \
--annotation=summary="测试告警" \
--annotation=description="这是一条测试告警"6.2 钉钉通知
使用 prometheus-webhook-dingtalk 作为中间件:
# Docker 部署
docker run -d --name dingtalk \
-p 8060:8060 \
timonwong/prometheus-webhook-dingtalk:latest \
--ding.profile="ops=https://oapi.dingtalk.com/robot/send?access_token=YOUR_TOKEN"自定义钉钉模板 /etc/webhook-dingtalk/config.yml:
templates:
- /etc/webhook-dingtalk/template.tmpl
targets:
ops:
url: https://oapi.dingtalk.com/robot/send?access_token=YOUR_ACCESS_TOKEN
message:
title: '{{ template "ding.link.title" . }}'
text: '{{ template "ding.link.content" . }}'自定义消息模板 template.tmpl:
{{ define "ding.link.title" }}[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .GroupLabels.alertname }}{{ end }}
{{ define "ding.link.content" }}
## {{ .CommonAnnotations.summary }}
**告警详情:**
{{ range .Alerts }}
- **实例**: {{ .Labels.instance }}
- **严重级别**: {{ .Labels.severity }}
- **描述**: {{ .Annotations.description }}
- **触发时间**: {{ .StartsAt.Format "2006-01-02 15:04:05" }}
{{ if .EndsAt }}- **恢复时间**: {{ .EndsAt.Format "2006-01-02 15:04:05" }}{{ end }}
{{ end }}
> 告警由 Prometheus + Alertmanager 触发
{{ end }}6.3 飞书通知
通过 Webhook 发送到飞书群:
# 使用 alertmanager-webhook 适配器
# 先创建飞书群机器人,获取 webhook 地址
# https://open.feishu.cn/document/client-docs/bot-v3/add-custom-botPython 适配脚本 feishu_webhook.py:
#!/usr/bin/env python3
"""Alertmanager → 飞书 Webhook 适配器"""
from flask import Flask, request, jsonify
import requests
import json
from datetime import datetime
app = Flask(__name__)
FEISHU_WEBHOOK = "https://open.feishu.cn/open-apis/bot/v2/hook/YOUR_TOKEN"
@app.route('/feishu/send', methods=['POST'])
def send_to_feishu():
data = request.json
alerts = data.get('alerts', [])
for alert in alerts:
status = alert.get('status', 'unknown')
labels = alert.get('labels', {})
annotations = alert.get('annotations', {})
# 构建飞书消息卡片
color = "red" if status == "firing" else "green"
icon = "🔥" if status == "firing" else "✅"
card = {
"msg_type": "interactive",
"card": {
"header": {
"title": {
"tag": "plain_text",
"content": f"{icon} [{status.upper()}] {labels.get('alertname', 'Unknown Alert')}"
},
"template": color
},
"elements": [
{
"tag": "div",
"fields": [
{"is_short": True, "text": {"tag": "lark_md", "content": f"**实例:** {labels.get('instance', '-')}"}},
{"is_short": True, "text": {"tag": "lark_md", "content": f"**级别:** {labels.get('severity', '-')}"}},
{"is_short": False, "text": {"tag": "lark_md", "content": f"**描述:** {annotations.get('description', '-')}"}},
{"is_short": False, "text": {"tag": "lark_md", "content": f"**时间:** {alert.get('startsAt', '-')}"}}
]
}
]
}
}
requests.post(FEISHU_WEBHOOK, json=card, timeout=5)
return jsonify({"status": "ok"})
if __name__ == '__main__':
app.run(host='0.0.0.0', port=5001)Alertmanager 中添加 webhook:
receivers:
- name: 'feishu-receiver'
webhook_configs:
- url: 'http://localhost:5001/feishu/send'
send_resolved: true6.4 企业微信通知
# 使用 prometheus-webhook-wechat
docker run -d --name wechat-webhook \
-p 5002:5002 \
-e WECOM_CORP_ID=YOUR_CORP_ID \
-e WECOM_SECRET=YOUR_SECRET \
-e WECOM_AGENT_ID=YOUR_AGENT_ID \
prometheus-webhook-wechat七、常用监控指标速查
7.1 主机级指标
# CPU 使用率
100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# 各 CPU 模式占比
avg by(mode) (rate(node_cpu_seconds_total[5m])) * 100
# 内存使用率
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100
# Swap 使用率
(1 - node_memory_SwapFree_bytes / node_memory_SwapTotal_bytes) * 100
# 系统负载 (1/5/15 分钟)
node_load1
node_load5
node_load15
# 磁盘 IO 利用率
rate(node_disk_io_time_seconds_total[5m])
# 磁盘读写速率 (MB/s)
rate(node_disk_read_bytes_total[5m]) / 1024 / 1024
rate(node_disk_written_bytes_total[5m]) / 1024 / 1024
# 网络带宽 (Mbps)
rate(node_network_receive_bytes_total{device="eth0"}[5m]) * 8 / 1024 / 1024
rate(node_network_transmit_bytes_total{device="eth0"}[5m]) * 8 / 1024 / 1024
# TCP 连接状态
node_netstat_Tcp_CurrEstab
node_tcp_connection_states
# 打开的文件描述符数
node_filefd_allocated
# 上下文切换率
rate(node_context_switches_total[5m])7.2 应用级指标(以 HTTP 服务为例)
# QPS (每秒请求数)
sum(rate(http_requests_total[5m]))
# 按状态码分布
sum by(status) (rate(http_requests_total[5m]))
# 错误率
sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))
# P50/P95/P99 延迟
histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))7.3 数据库指标(MySQL / Redis)
# MySQL 慢查询
rate(mysql_global_status_slow_queries[5m])
# MySQL 连接数
mysql_global_status_threads_connected / mysql_global_variables_max_connections
# Redis 命中率
redis_keyspace_hits_total / (redis_keyspace_hits_total + redis_keyspace_misses_total)
# Redis 内存使用
redis_memory_used_bytes / redis_memory_max_bytes八、日志监控:Loki
8.1 为什么选 Loki?
Loki 的核心理念:像 Prometheus 一样对待日志——只索引元数据(标签),不索引日志内容本身。
8.2 Docker Compose 部署
# docker-compose-loki.yml
version: '3.8'
services:
loki:
image: grafana/loki:3.0.0
container_name: loki
ports:
- "3100:3100"
volumes:
- ./loki-config.yml:/etc/loki/config.yml
- loki_data:/loki
command: -config.file=/etc/loki/config.yml
restart: unless-stopped
promtail:
image: grafana/promtail:3.0.0
container_name: promtail
volumes:
- ./promtail-config.yml:/etc/promtail/config.yml
- /var/log:/var/log:ro
- /var/lib/docker/containers:/var/lib/docker/containers:ro
command: -config.file=/etc/promtail/config.yml
restart: unless-stopped
volumes:
loki_data:Loki 配置 loki-config.yml:
auth_enabled: false
server:
http_listen_port: 3100
common:
path_prefix: /loki
storage:
filesystem:
chunks_directory: /loki/chunks
rules_directory: /loki/rules
replication_factor: 1
ring:
kvstore:
store: inmemory
schema_config:
configs:
- from: "2024-01-01"
store: tsdb
object_store: filesystem
schema: v13
index:
prefix: index_
period: 24h
limits_config:
retention_period: 30d # 日志保留 30 天
max_query_length: 721h
ingestion_rate_mb: 10
ingestion_burst_size_mb: 20
compactor:
working_directory: /loki/compactor
compaction_interval: 10m
retention_enabled: true
retention_delete_delay: 2h
delete_request_store: filesystemPromtail 配置 promtail-config.yml:
server:
http_listen_port: 9080
positions:
filename: /tmp/positions.yaml
clients:
- url: http://loki:3100/loki/api/v1/push
scrape_configs:
# 系统日志
- job_name: system
static_configs:
- targets: [localhost]
labels:
job: syslog
host: ${HOSTNAME}
__path__: /var/log/syslog
# 应用日志
- job_name: app-logs
static_configs:
- targets: [localhost]
labels:
job: myapp
host: ${HOSTNAME}
__path__: /var/log/myapp/*.log
pipeline_stages:
# 解析 JSON 日志
- json:
expressions:
level: level
msg: message
trace_id: trace_id
- labels:
level:
- timestamp:
source: time
format: RFC3339Nano
# Docker 容器日志
- job_name: docker
static_configs:
- targets: [localhost]
labels:
job: docker
__path__: /var/lib/docker/containers/*/*.log
pipeline_stages:
- json:
expressions:
log: log
stream: stream
container_name: attrs.container_name
- labels:
stream:
container_name:
- output:
source: log8.3 LogQL 查询语法
# 基础:按标签过滤日志
{job="myapp"}
# 级别过滤
{job="myapp"} |= "error"
{job="myapp"} |~ "timeout|connection refused"
{job="myapp"} !~ "health_check"
# JSON 日志解析
{job="myapp"} | json | level="error"
# 正则提取
{job="myapp"} | pattern `<ip> - - [<_>] "<method> <path> <_>" <status> <size>`
# 统计:每分钟错误日志数量
count_over_time({job="myapp"} |= "error" [1m])
# 统计:过去 1 小时内各状态码分布
sum by(status) (count_over_time({job="myapp"} | json | __error__="" [1h]))
# Top 10 错误消息
topk(10,
sum by(msg) (count_over_time({job="myapp"} | json | level="error" [1h]))
)
# 速率:每秒错误日志数
rate({job="myapp"} |= "error" [5m])8.4 Grafana 中配置 Loki 数据源
# API 方式添加
curl -s -X POST http://admin:admin@localhost:3000/api/datasources \
-H "Content-Type: application/json" \
-d '{
"name": "Loki",
"type": "loki",
"url": "http://localhost:3100",
"access": "proxy"
}'在 Grafana 的 Explore 中可以直接输入 LogQL 查询,支持与 Prometheus 指标联动(如点击指标图中的时间点,自动跳转到对应时间段的日志)。
九、监控最佳实践
9.1 分层监控策略
┌─────────────────────────────────────────────┐
│ L1: 基础设施层 (CPU/内存/磁盘/网络) │ ← node_exporter
├─────────────────────────────────────────────┤
│ L2: 平台服务层 (DB/缓存/消息队列/K8s) │ ← 各组件 Exporter
├─────────────────────────────────────────────┤
│ L3: 应用服务层 (QPS/延迟/错误率/业务指标) │ ← 应用埋点
├─────────────────────────────────────────────┤
│ L4: 业务指标层 (订单量/转化率/活跃用户) │ ← 自定义 Metrics
└─────────────────────────────────────────────┘9.2 告警分级与响应
9.3 避免告警疲劳
# 好的告警规则特征:
# 1. 有明确的 for 持续时间(避免瞬时抖动误报)
- alert: HighCpuUsage
expr: ... > 85
for: 5m # ← 持续 5 分钟才触发
# 2. 使用 inhibit_rules 抑制低级告警
# 3. 合理设置 group_by 和 group_wait 聚合告警
# 4. 使用 repeat_interval 控制重复频率
# 5. 定期 Review 告警规则,删除"狼来了"告警9.4 容量规划
# 磁盘容量预测(线性回归)
predict_linear(node_filesystem_avail_bytes{mountpoint="/"}[7d], 30*24*3600)
# 流量趋势
predict_linear(rate(node_network_receive_bytes_total[1h])[7d:1h], 30*24*3600)
# Prometheus 存储估算
# 每个样本 ~2 bytes,每天样本数 = active_series × (86400 / scrape_interval)
prometheus_tsdb_head_series # 当前活跃时间序列数
rate(prometheus_tsdb_head_chunks_created_total[1h]) # 每小时新增 chunk 数9.5 Prometheus 高可用
生产环境推荐部署两套 Prometheus 做冗余:
# 两套 Prometheus 采集相同目标
# 使用 Alertmanager 集群去重
# alertmanager.yml 中配置集群
cluster:
listen-address: "0.0.0.0:9094"
peers:
- "alertmanager-1:9094"
- "alertmanager-2:9094"长期存储方案:
Thanos:Prometheus 的高可用扩展,支持全局查询和对象存储
VictoriaMetrics:高性能时序数据库,兼容 Prometheus 协议
Cortex:可水平扩展的 Prometheus 后端
9.6 监控自身健康
监控系统本身也需要被监控:
# Prometheus 自身
prometheus_config_last_reload_successful # 配置是否加载成功
prometheus_tsdb_compactions_failed_total # 压缩失败次数
prometheus_rule_evaluation_duration_seconds # 规则评估耗时
# Exporter 健康
up # 目标是否可达
# Alertmanager
alertmanager_notifications_total # 通知发送总量
alertmanager_notifications_failed_total # 通知失败量十、常见问题排查
Q1: Prometheus 采集超时
# 检查目标状态
curl -s http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | select(.health=="down") | {instance, lastError}'
# 调整超时
# prometheus.yml 中增加 scrape_timeout(默认 10s)
scrape_timeout: 15sQ2: 告警没有触发
# 检查规则是否加载
promtool check rules /etc/prometheus/rules/*.yml
# 在 Prometheus Web UI → Status → Rules 查看规则状态
# 在 Alerting → Alerts 查看当前告警状态
# 用 amtool 测试路由
amtool config routes test alertname=HighCpuUsage severity=warningQ3: Grafana 图表无数据
# 验证数据源连通性
curl -s http://localhost:9090/api/v1/query?query=up
# 检查 PromQL 语法
curl -s "http://localhost:9090/api/v1/query?query=up&time=$(date +%s)"
# Grafana API 测试数据源
curl -s http://admin:admin@localhost:3000/api/datasources/proxy/1/api/v1/query?query=up附录:一键部署脚本
将上述组件打包为一键部署脚本:
#!/bin/bash
# monitor-stack-deploy.sh - 一键部署监控栈
set -euo pipefail
PROM_VERSION="2.52.0"
NODE_EXP_VERSION="1.8.0"
AM_VERSION="0.27.0"
GF_VERSION="11.0.0"
echo "=== 1. 安装 Prometheus ==="
wget -q https://github.com/prometheus/prometheus/releases/download/v${PROM_VERSION}/prometheus-${PROM_VERSION}.linux-amd64.tar.gz
tar xzf prometheus-${PROM_VERSION}.linux-amd64.tar.gz
sudo mv prometheus-${PROM_VERSION}.linux-amd64/{prometheus,promtool} /usr/local/bin/
sudo mkdir -p /etc/prometheus/rules /var/lib/prometheus
echo "=== 2. 安装 Node Exporter ==="
wget -q https://github.com/prometheus/node_exporter/releases/download/v${NODE_EXP_VERSION}/node_exporter-${NODE_EXP_VERSION}.linux-amd64.tar.gz
tar xzf node_exporter-${NODE_EXP_VERSION}.linux-amd64.tar.gz
sudo mv node_exporter-${NODE_EXP_VERSION}.linux-amd64/node_exporter /usr/local/bin/
echo "=== 3. 安装 Alertmanager ==="
wget -q https://github.com/prometheus/alertmanager/releases/download/v${AM_VERSION}/alertmanager-${AM_VERSION}.linux-amd64.tar.gz
tar xzf alertmanager-${AM_VERSION}.linux-amd64.tar.gz
sudo mv alertmanager-${AM_VERSION}.linux-amd64/{alertmanager,amtool} /usr/local/bin/
sudo mkdir -p /etc/alertmanager /var/lib/alertmanager
echo "=== 4. 安装 Grafana ==="
sudo apt-get install -y apt-transport-https
wget -q -O - https://apt.grafana.com/gpg.key | gpg --dearmor | sudo tee /etc/apt/keyrings/grafana.gpg > /dev/null
echo "deb [signed-by=/etc/apt/keyrings/grafana.gpg] https://apt.grafana.com stable main" | sudo tee /etc/apt/sources.list.d/grafana.list
sudo apt-get update && sudo apt-get install -y grafana
echo "=== 5. 启动服务 ==="
sudo systemctl daemon-reload
sudo systemctl enable --now prometheus node_exporter alertmanager grafana-server
echo "=== 部署完成 ==="
echo "Prometheus: http://localhost:9090"
echo "Alertmanager: http://localhost:9093"
echo "Grafana: http://localhost:3000 (admin/admin)"
echo "Node Exporter: http://localhost:9100"总结
监控不是一次性工程,而是持续迭代的过程。建议:
先监控再优化:先让系统"可见",再针对性优化
告警要少而精:一条有意义的告警胜过十条噪音
定期复盘:每次故障后 Review 监控覆盖,补充缺失的指标
文档化:告警处理手册(Runbook)和监控指标含义要写清楚
本文基于 Prometheus 2.52、Grafana 11、Loki 3.0 编写。如遇版本差异,请参考官方文档。