可观测性不是买一个工具,而是一种工程文化。本文从零搭建一套生产级可观测性平台,涵盖三大支柱、SLO 体系、Prometheus + Grafana 监控、分布式追踪、故障排查 SOP、混沌工程和值班手册。所有配置可直接复用。
一、可观测性三大支柱
1.1 三大支柱概览
可观测性(Observability)的三大支柱:Metrics(指标)、Logs(日志)、Traces(链路追踪)。三者不是孤立存在的,真正的价值在于关联。
┌─────────────────────────────────────────────────┐
│ 可观测性平台 │
│ ┌───────────┐ ┌───────────┐ ┌───────────┐ │
│ │ Metrics │ │ Logs │ │ Traces │ │
│ │ Prometheus │ │ Loki │ │ Jaeger │ │
│ └─────┬─────┘ └─────┬─────┘ └─────┬─────┘ │
│ │ │ │ │
│ └──────────┬───┴──────────────┘ │
│ │ │
│ 关联层 (Exemplar / TraceID) │
│ │ │
│ Grafana 统一视图 │
└─────────────────────────────────────────────────┘1.2 数据关联方法
方法一:Exemplar 关联(Metrics → Traces)
Prometheus 的 Exemplar 机制允许在指标数据点上附加 TraceID:
# 应用端暴露带 exemplar 的 histogram
histogram.ObserveWithExemplar(
duration,
prometheus.Labels{"traceID": span.SpanContext().TraceID().String()},
)方法二:TraceID 贯穿(Logs ↔ Traces)
在日志中注入 TraceID,实现日志与链路双向跳转:
// Go 示例:从 context 中提取 traceID 并写入日志
span := trace.SpanFromContext(ctx)
logger.Info("request processed",
zap.String("trace_id", span.SpanContext().TraceID().String()),
zap.String("span_id", span.SpanContext().SpanID().String()),
)方法三:统一标签关联(Metrics ↔ Logs)
给指标和日志打上相同的业务标签:
# 共同标签
labels:
service: order-service
env: production
region: cn-east-1
version: v1.2.31.3 三大支柱对比速查
二、SLO / SLI 定义
2.1 SLI 选择
SLI(Service Level Indicator)是衡量服务质量的具体指标。选择原则:用户能感知到的,才是好 SLI。
# SLI 定义模板
slis:
- name: availability
description: "请求成功率"
metric: |
sum(rate(http_requests_total{code!~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
target: 0.999 # 99.9%
- name: latency
description: "P99 延迟"
metric: |
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)
target: 0.5 # 500ms
- name: correctness
description: "数据处理正确率"
metric: |
sum(rate(orders_processed_total{status="success"}[5m]))
/
sum(rate(orders_processed_total[5m]))
target: 0.9999 # 99.99%2.2 SLO 设定
SLO(Service Level Objective)= SLI + 目标值。设定策略:
SLO 设定阶梯:
├── 99% → 每月允许宕机 7.31 小时(内部工具)
├── 99.9% → 每月允许宕机 43.8 分钟(一般业务)
├── 99.95%→ 每月允许宕机 21.9 分钟(核心业务)
└── 99.99%→ 每月允许宕机 4.38 分钟(支付/交易)2.3 错误预算(Error Budget)
# 错误预算计算脚本
def calculate_error_budget(slo_target, time_window_days, total_requests):
"""计算错误预算"""
allowed_failure_rate = 1 - slo_target
allowed_failures = int(total_requests * allowed_failure_rate)
return {
"slo_target": f"{slo_target*100}%",
"time_window": f"{time_window_days} days",
"total_requests": total_requests,
"allowed_failures": allowed_failures,
"allowed_downtime_minutes": time_window_days * 24 * 60 * allowed_failure_rate
}
# 示例:99.9% SLO,30天窗口
print(calculate_error_budget(0.999, 30, 100_000_000))
# {'allowed_failures': 100000, 'allowed_downtime_minutes': 43.2}2.4 Burn Rate 告警
Burn Rate 表示错误预算的消耗速率。Burn Rate = 2 意味着以 2 倍速率消耗预算。
# Prometheus 告警规则:多窗口 Burn Rate
groups:
- name: slo_alerts
rules:
# 快速燃烧:1小时内消耗 14.4 倍预算(1/30天预算在1小时内耗尽)
- alert: SLOBurnRateHigh
expr: |
(
1 - (
sum(rate(http_requests_total{code!~"5.."}[1h]))
/
sum(rate(http_requests_total[1h]))
)
) > (14.4 * (1 - 0.999))
for: 2m
labels:
severity: critical
annotations:
summary: "错误预算快速燃烧 (14.4x)"
# 慢速燃烧:6小时内消耗 6 倍预算
- alert: SLOBurnRateMedium
expr: |
(
1 - (
sum(rate(http_requests_total{code!~"5.."}[6h]))
/
sum(rate(http_requests_total[6h]))
)
) > (6 * (1 - 0.999))
for: 15m
labels:
severity: warning
annotations:
summary: "错误预算慢速燃烧 (6x)"三、Prometheus + Grafana 监控体系
3.1 完整部署(Docker Compose)
# docker-compose.monitoring.yml
version: "3.8"
services:
prometheus:
image: prom/prometheus:v2.51.0
container_name: prometheus
ports:
- "9090:9090"
volumes:
- ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
- ./prometheus/rules:/etc/prometheus/rules
- prometheus_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--storage.tsdb.retention.time=30d'
- '--storage.tsdb.retention.size=50GB'
- '--web.enable-lifecycle'
- '--web.enable-remote-write-receiver'
- '--enable-feature=exemplar-storage'
restart: unless-stopped
grafana:
image: grafana/grafana:10.4.0
container_name: grafana
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin123
- GF_INSTALL_PLUGINS=grafana-piechart-panel
volumes:
- grafana_data:/var/lib/grafana
- ./grafana/provisioning:/etc/grafana/provisioning
- ./grafana/dashboards:/var/lib/grafana/dashboards
depends_on:
- prometheus
restart: unless-stopped
alertmanager:
image: prom/alertmanager:v0.27.0
container_name: alertmanager
ports:
- "9093:9093"
volumes:
- ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml
command:
- '--config.file=/etc/alertmanager/alertmanager.yml'
restart: unless-stopped
loki:
image: grafana/loki:2.9.6
container_name: loki
ports:
- "3100:3100"
volumes:
- ./loki/loki-config.yml:/etc/loki/local-config.yaml
- loki_data:/loki
command: -config.file=/etc/loki/local-config.yaml
restart: unless-stopped
tempo:
image: grafana/tempo:2.4.1
container_name: tempo
ports:
- "3200:3200" # Tempo HTTP
- "4317:4317" # OTLP gRPC
- "4318:4318" # OTLP HTTP
volumes:
- ./tempo/tempo-config.yml:/etc/tempo/config.yaml
- tempo_data:/var/tempo
command: -config.file=/etc/tempo/config.yaml
restart: unless-stopped
node-exporter:
image: prom/node-exporter:v1.7.0
container_name: node-exporter
ports:
- "9100:9100"
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
command:
- '--path.procfs=/host/proc'
- '--path.sysfs=/host/sys'
- '--path.rootfs=/rootfs'
restart: unless-stopped
cadvisor:
image: gcr.io/cadvisor/cadvisor:v0.47.2
container_name: cadvisor
ports:
- "8080:8080"
volumes:
- /:/rootfs:ro
- /var/run:/var/run:ro
- /sys:/sys:ro
- /var/lib/docker/:/var/lib/docker:ro
restart: unless-stopped
volumes:
prometheus_data:
grafana_data:
loki_data:
tempo_data:3.2 Prometheus 配置
# prometheus/prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_timeout: 10s
external_labels:
cluster: production
region: cn-east-1
# 告警规则文件
rule_files:
- "rules/*.yml"
# Alertmanager 配置
alerting:
alertmanagers:
- static_configs:
- targets: ['alertmanager:9093']
# 采集目标
scrape_configs:
# Prometheus 自身
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
# Node Exporter
- job_name: 'node-exporter'
static_configs:
- targets: ['node-exporter:9100']
labels:
instance: 'server-01'
# cAdvisor(容器指标)
- job_name: 'cadvisor'
static_configs:
- targets: ['cadvisor:8080']
# 应用服务(通过服务发现)
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
# 黑盒探测
- job_name: 'blackbox-http'
metrics_path: /probe
params:
module: [http_2xx]
static_configs:
- targets:
- https://api.example.com/health
- https://www.example.com
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: blackbox-exporter:91153.3 Alertmanager 配置
# alertmanager/alertmanager.yml
global:
resolve_timeout: 5m
smtp_smarthost: 'smtp.example.com:465'
smtp_from: 'alert@example.com'
smtp_auth_username: 'alert@example.com'
smtp_auth_password: 'password'
route:
group_by: ['alertname', 'cluster', 'service']
group_wait: 10s
group_interval: 10s
repeat_interval: 12h
receiver: 'default'
routes:
- match:
severity: critical
receiver: 'pager'
group_wait: 5s
- match:
severity: warning
receiver: 'slack'
- match_re:
alertname: SLOBurn.*
receiver: 'slo-escalation'
repeat_interval: 1h
receivers:
- name: 'default'
webhook_configs:
- url: 'http://dingtalk-webhook:8060/dingtalk/ops/send'
- name: 'pager'
pagerduty_configs:
- service_key: 'xxx'
webhook_configs:
- url: 'http://dingtalk-webhook:8060/dingtalk/emergency/send'
- name: 'slack'
slack_configs:
- api_url: 'https://hooks.slack.com/services/xxx'
channel: '#alerts'
title: '{{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'
- name: 'slo-escalation'
webhook_configs:
- url: 'http://dingtalk-webhook:8060/dingtalk/slo/send'
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'instance']3.4 Grafana 数据源配置
# grafana/provisioning/datasources/datasources.yml
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
jsonData:
exemplarTraceIdDestinations:
- name: traceID
datasourceUid: tempo
- name: Loki
type: loki
access: proxy
url: http://loki:3100
jsonData:
derivedFields:
- datasourceUid: tempo
matcherRegex: "traceID=(\\w+)"
name: TraceID
url: "$${__value.raw}"
- name: Tempo
type: tempo
access: proxy
url: http://tempo:3200
uid: tempo
jsonData:
tracesToLogs:
datasourceUid: loki
filterByTraceID: true
tags: ['service']
tracesToMetrics:
datasourceUid: prometheus
serviceMap:
datasourceUid: prometheus3.5 核心 Dashboard 模板
{
"dashboard": {
"title": "服务概览 - Golden Signals",
"panels": [
{
"title": "请求速率 (QPS)",
"type": "timeseries",
"targets": [
{
"expr": "sum(rate(http_requests_total{service=~\"$service\"}[5m])) by (code)",
"legendFormat": "{{code}}"
}
]
},
{
"title": "请求延迟分布",
"type": "timeseries",
"targets": [
{
"expr": "histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket{service=~\"$service\"}[5m])) by (le))",
"legendFormat": "p50"
},
{
"expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{service=~\"$service\"}[5m])) by (le))",
"legendFormat": "p95"
},
{
"expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{service=~\"$service\"}[5m])) by (le))",
"legendFormat": "p99"
}
]
},
{
"title": "错误率",
"type": "gauge",
"targets": [
{
"expr": "1 - (sum(rate(http_requests_total{service=~\"$service\",code!~\"5..\"}[5m])) / sum(rate(http_requests_total{service=~\"$service\"}[5m])))",
"legendFormat": "Error Rate"
}
],
"fieldConfig": {
"defaults": {
"thresholds": {
"steps": [
{"value": 0, "color": "green"},
{"value": 0.001, "color": "yellow"},
{"value": 0.01, "color": "red"}
]
}
}
}
},
{
"title": "SLO 错误预算剩余",
"type": "gauge",
"targets": [
{
"expr": "1 - ((1 - (sum(rate(http_requests_total{service=~\"$service\",code!~\"5..\"}[30d])) / sum(rate(http_requests_total{service=~\"$service\"}[30d])))) / (1 - 0.999))",
"legendFormat": "Budget Remaining"
}
]
}
]
}
}四、日志与指标关联
4.1 Exemplar 配置
在 Prometheus 中启用 Exemplar 存储:
# prometheus.yml 启动参数
--enable-feature=exemplar-storage
# 应用端 Go 代码示例
import (
"github.com/prometheus/client_golang/prometheus"
"go.opentelemetry.io/otel/trace"
)
var httpDuration = prometheus.NewHistogramVec(
prometheus.HistogramOpts{
Name: "http_request_duration_seconds",
Help: "HTTP request duration",
Buckets: prometheus.DefBuckets,
},
[]string{"method", "path", "status"},
)
func recordMetrics(ctx context.Context, method, path, status string, duration float64) {
span := trace.SpanFromContext(ctx)
labels := prometheus.Labels{
"method": method,
"path": path,
"status": status,
}
// 附带 TraceID 作为 Exemplar
if span.SpanContext().IsValid() {
httpDuration.With(labels).(prometheus.ExemplarObserver).ObserveWithExemplar(
duration,
prometheus.Labels{
"traceID": span.SpanContext().TraceID().String(),
},
)
} else {
httpDuration.With(labels).Observe(duration)
}
}4.2 TraceID 日志注入
Python + structlog 示例:
import structlog
from opentelemetry import trace
def add_trace_info(logger, method_name, event_dict):
span = trace.get_current_span()
ctx = span.get_span_context()
if ctx.is_valid:
event_dict["trace_id"] = format(ctx.trace_id, '032x')
event_dict["span_id"] = format(ctx.span_id, '016x')
return event_dict
structlog.configure(
processors=[
add_trace_info,
structlog.dev.ConsoleRenderer(),
]
)
logger = structlog.get_logger()
logger.info("order_created", order_id="ORD-001", amount=99.9)
# 输出包含 trace_id 和 span_idJava + Logback 示例:
<!-- logback-spring.xml -->
<configuration>
<appender name="JSON" class="ch.qos.logback.core.ConsoleAppender">
<encoder class="net.logstash.logback.encoder.LogstashEncoder">
<customFields>{"service":"order-service","env":"prod"}</customFields>
<!-- 自动注入 MDC 中的 traceId 和 spanId -->
</encoder>
</appender>
</configuration>4.3 Loki 日志转指标
# loki/loki-config.yml 中的 metric 实例
limits_config:
metric_enabled: true
# 使用 LogQL 从日志中提取指标
# 1. 统计错误日志速率
# rate({service="order-service"} |= "ERROR" [5m])
# 2. 提取数值型指标(如响应时间)
# quantile_over_time(0.99, {service="order-service"} | json | unwrap response_time [5m])
# 3. 按状态码分组统计
# sum(rate({service="order-service"} | json | line_format "{{.status}}" | __error__="" [5m])) by (status)Grafana 中配置 Log → Metric 查询:
# LogQL:从日志中提取 P99 延迟
quantile_over_time(
0.99,
{namespace="production", service="order-service"}
| json
| unwrap duration_ms
[5m]
) by (endpoint)五、分布式追踪落地
5.1 OpenTelemetry 全链路集成
Go 应用接入:
package main
import (
"context"
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
"go.opentelemetry.io/otel/sdk/resource"
sdktrace "go.opentelemetry.io/otel/sdk/trace"
semconv "go.opentelemetry.io/otel/semconv/v1.24.0"
)
func initTracer(ctx context.Context) (*sdktrace.TracerProvider, error) {
// OTLP gRPC 导出器 → Tempo
exporter, err := otlptracegrpc.New(ctx,
otlptracegrpc.WithEndpoint("tempo:4317"),
otlptracegrpc.WithInsecure(),
)
if err != nil {
return nil, err
}
tp := sdktrace.NewTracerProvider(
sdktrace.WithBatcher(exporter),
sdktrace.WithResource(resource.NewWithAttributes(
semconv.SchemaURL,
semconv.ServiceNameKey.String("order-service"),
semconv.ServiceVersionKey.String("v1.2.3"),
semconv.DeploymentEnvironmentKey.String("production"),
)),
sdktrace.WithSampler(sdktrace.ParentBased(
sdktrace.TraceIDRatioBased(0.1), // 10% 采样
)),
)
otel.SetTracerProvider(tp)
return tp, nil
}Python 应用接入:
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource
resource = Resource.create({
"service.name": "payment-service",
"service.version": "v2.0.0",
"deployment.environment": "production",
})
provider = TracerProvider(resource=resource)
processor = BatchSpanProcessor(
OTLPSpanExporter(endpoint="tempo:4317", insecure=True)
)
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)
# 使用
tracer = trace.get_tracer("payment-service")
@app.route("/pay", methods=["POST"])
def process_payment():
with tracer.start_as_current_span("process_payment") as span:
span.set_attribute("order.id", request.json["order_id"])
span.set_attribute("payment.amount", request.json["amount"])
# 业务逻辑...
return {"status": "success"}5.2 Tempo 配置
# tempo/tempo-config.yml
server:
http_listen_port: 3200
distributor:
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
zipkin:
endpoint: 0.0.0.0:9411
ingester:
max_block_duration: 5m
compactor:
compaction:
block_retention: 720h # 30 天
metrics_generator:
registry:
external_labels:
source: tempo
cluster: production
storage:
path: /var/tempo/generator/wal
remote_write:
- url: http://prometheus:9090/api/v1/write
send_exemplars: true
traces_storage:
path: /var/tempo/generator/traces
processor:
service_graphs:
dimensions: [namespace, service]
span_metrics:
dimensions: [namespace, service, endpoint]
storage:
trace:
backend: local
local:
path: /var/tempo/traces
wal:
path: /var/tempo/wal5.3 采样策略
# 采样策略矩阵
sampling_strategies:
# 1. 头部采样(Head-based):在链路入口决定
head_based:
description: "简单高效,但可能漏采异常链路"
config:
type: probabilistic
param: 0.1 # 10% 采样率
# 2. 尾部采样(Tail-based):收集完整链路后决定
tail_based:
description: "能保留异常链路,但需要更多资源"
config:
policies:
- name: errors
type: status_code
status_codes: [ERROR]
- name: slow
type: latency
threshold_ms: 5000
- name: probabilistic
type: probabilistic
percentage: 5
# 3. 自适应采样
adaptive:
description: "根据流量动态调整采样率"
formula: |
采样率 = min(1.0, 目标QPS / 当前QPS)
当 QPS < 100 时: 100% 采样
当 QPS = 1000 时: 10% 采样
当 QPS = 10000 时: 1% 采样OpenTelemetry Collector 采样配置:
# otel-collector-config.yml
processors:
tail_sampling:
decision_wait: 10s
num_traces: 100000
policies:
- name: errors-policy
type: status_code
status_code: {status_codes: [ERROR]}
- name: slow-traces
type: latency
latency: {threshold_ms: 3000}
- name: probabilistic
type: probabilistic
probabilistic: {sampling_percentage: 5}
service:
pipelines:
traces:
receivers: [otlp]
processors: [tail_sampling, batch]
exporters: [otlp]六、故障排查 SOP
6.1 标准排查流程
故障排查五步法:
┌──────────────────────────────────────────────────┐
│ 1. 定位现象(What) │
│ → 告警内容、用户反馈、监控图表 │
├──────────────────────────────────────────────────┤
│ 2. 缩小范围(Where) │
│ → 哪个服务?哪个实例?哪个接口? │
├──────────────────────────────────────────────────┤
│ 3. 找到根因(Why) │
│ → 日志、链路、资源指标、变更记录 │
├──────────────────────────────────────────────────┤
│ 4. 修复问题(How) │
│ → 回滚、扩容、重启、配置修复 │
├──────────────────────────────────────────────────┤
│ 5. 复盘总结(Learn) │
│ → 根因、影响、修复措施、预防方案 │
└──────────────────────────────────────────────────┘6.2 5-Why 分析法
问题:用户反馈下单失败
Why 1: 为什么下单失败?
→ 订单服务返回 500 错误
Why 2: 为什么订单服务返回 500?
→ 数据库连接超时
Why 3: 为什么数据库连接超时?
→ 连接池耗尽(200/200)
Why 4: 为什么连接池耗尽?
→ 慢查询堆积,连接未及时释放
Why 5: 为什么出现慢查询?
→ 昨天上线的新功能缺少索引,全表扫描
根因:新功能 SQL 缺少索引 → 慢查询 → 连接池耗尽 → 服务不可用
修复:添加索引 + 连接池超时优化 + SQL 审核流程6.3 故障树分析(FTA)
服务不可用
├── 应用层
│ ├── 进程崩溃 → OOM / Panic / Segfault
│ ├── 配置错误 → 环境变量 / 配置中心
│ └── 代码缺陷 → Bug / 死锁 / 无限循环
├── 中间件层
│ ├── 数据库 → 连接池 / 慢查询 / 主从延迟
│ ├── 缓存 → 缓存穿透 / 缓存雪崩 / Redis 宕机
│ └── 消息队列 → 积压 / 消费者挂掉 / 分区不均
├── 网络层
│ ├── DNS 解析失败
│ ├── 负载均衡器故障
│ └── 网络分区 / 带宽打满
└── 基础设施层
├── 服务器 → CPU / 内存 / 磁盘
├── 容器 → Pod Evicted / 节点 NotReady
└── 云服务 → SLB / RDS / OSS 故障6.4 排查命令速查
# === 系统资源 ===
# CPU 使用率
top -bn1 | head -20
mpstat -P ALL 1 5
# 内存使用
free -h
cat /proc/meminfo | grep -E "MemTotal|MemFree|Buffers|Cached"
# 磁盘
df -h
iotop -oP
# 网络连接
ss -tunlp | head -20
ss -s # 连接统计
# === 容器排查 ===
# 查看 Pod 状态
kubectl get pods -n production -o wide
kubectl describe pod <pod-name> -n production
# 查看 Pod 日志
kubectl logs <pod-name> -n production --tail=100
kubectl logs <pod-name> -n production --previous # 上一次崩溃日志
# 进入容器排查
kubectl exec -it <pod-name> -n production -- /bin/sh
# === JVM 排查 ===
# 线程 dump
jstack <pid> > thread_dump.txt
# 堆内存分析
jmap -heap <pid>
jmap -dump:format=b,file=heap.hprof <pid>
# GC 日志
jstat -gcutil <pid> 1000 10
# === Go 排查 ===
# pprof 采样
go tool pprof http://localhost:6060/debug/pprof/goroutine
go tool pprof http://localhost:6060/debug/pprof/heap
go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30七、常见故障场景排查
7.1 服务响应慢
# Step 1: 确认延迟分布
curl -s "http://prometheus:9090/api/v1/query" \
--data-urlencode 'query=histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{service="order-service"}[5m])) by (le))'
# Step 2: 查看链路追踪
# 在 Grafana Tempo 中搜索耗时 > 3s 的 trace
# Step 3: 检查依赖服务
curl -s "http://prometheus:9090/api/v1/query" \
--data-urlencode 'query=sum(rate(grpc_client_handling_seconds_bucket{service="order-service"}[5m])) by (grpc_service, le)'
# Step 4: 检查资源瓶颈
# CPU
curl -s "http://prometheus:9090/api/v1/query" \
--data-urlencode 'query=rate(process_cpu_seconds_total{service="order-service"}[5m])'
# 检查是否有 GC 停顿
curl -s "http://prometheus:9090/api/v1/query" \
--data-urlencode 'query=rate(go_gc_duration_seconds_count{service="order-service"}[5m])'排查清单:
检查 P99 延迟趋势图
搜索慢 Trace(> 3s),定位瓶颈 Span
检查下游服务延迟(DB、Redis、RPC)
检查 GC 停顿时间
检查 CPU / 内存使用率
检查是否有大查询或慢 SQL
7.2 错误率飙升
# Step 1: 确认错误类型分布
curl -s "http://prometheus:9090/api/v1/query" \
--data-urlencode 'query=sum(rate(http_requests_total{service="order-service",code=~"5.."}[5m])) by (code, path)'
# Step 2: 查看错误日志
# LogQL 查询
# {service="order-service"} |= "ERROR" | json | line_format "{{.error}}"
# Step 3: 检查最近变更
kubectl rollout history deployment/order-service -n production
# Step 4: 快速回滚(如果是新版本导致)
kubectl rollout undo deployment/order-service -n production排查清单:
确认错误码分布(500/502/503/504)
搜索错误日志,提取关键错误信息
检查最近是否有代码发布或配置变更
检查下游依赖是否健康
检查是否有流量突增
必要时执行回滚
7.3 内存泄漏
# Step 1: 确认内存增长趋势
curl -s "http://prometheus:9090/api/v1/query" \
--data-urlencode 'query=process_resident_memory_bytes{service="order-service"}'
# Step 2: 对比不同实例的内存使用
curl -s "http://prometheus:9090/api/v1/query" \
--data-urlencode 'query=process_resident_memory_bytes{service="order-service"} / 1024 / 1024'
# Step 3: Go 应用 - 分析堆内存
curl http://localhost:6060/debug/pprof/heap > heap.prof
go tool pprof -http=:8081 heap.prof
# Step 4: Java 应用 - 堆 dump
jmap -dump:format=b,file=/tmp/heap.hprof <pid>
# 用 MAT 或 VisualVM 分析
# Step 5: 检查容器 OOM 事件
kubectl get events -n production --field-selector reason=OOMKilling排查清单:
观察内存增长曲线(持续上升 vs 正常波动)
对比有/无流量时的内存基线
Go: pprof heap 分析 top alloc_space
Java: jmap dump + MAT 分析 Dominator Tree
检查是否有大对象缓存未设置过期
检查连接/句柄是否正常关闭
7.4 连接池耗尽
# Step 1: 确认连接池状态
curl -s "http://prometheus:9090/api/v1/query" \
--data-urlencode 'query=db_connections_open{service="order-service"}'
curl -s "http://prometheus:9090/api/v1/query" \
--data-urlencode 'query=db_connections_max{service="order-service"}'
# Step 2: 检查连接等待时间
curl -s "http://prometheus:9090/api/v1/query" \
--data-urlencode 'query=histogram_quantile(0.99, sum(rate(db_connection_wait_seconds_bucket[5m])) by (le))'
# Step 3: 检查慢查询
# PostgreSQL
SELECT pid, now() - pg_stat_activity.query_start AS duration, query
FROM pg_stat_activity
WHERE state = 'active' AND now() - pg_stat_activity.query_start > interval '5 seconds'
ORDER BY duration DESC;
# Step 4: 检查连接泄漏
# 应用端查看未归还的连接
# Go: sql.DBStats
# Java: HikariCP metrics修复方案:
# 连接池配置最佳实践
database:
pool:
max_open_conns: 50 # 最大连接数
max_idle_conns: 10 # 最大空闲连接
conn_max_lifetime: 300s # 连接最大存活时间
conn_max_idle_time: 60s # 空闲连接最大存活时间
acquire_timeout: 5s # 获取连接超时八、混沌工程入门
8.1 Chaos Mesh 部署
# 安装 Chaos Mesh
helm repo add chaos-mesh https://charts.chaos-mesh.org
helm install chaos-mesh chaos-mesh/chaos-mesh \
--namespace chaos-testing \
--create-namespace \
--set chaosDaemon.runtime=containerd \
--set chaosDaemon.socketPath=/run/containerd/containerd.sock
# 验证安装
kubectl get pods -n chaos-testing8.2 常见故障注入场景
场景一:Pod 随机杀掉(模拟实例崩溃)
# chaos/pod-kill.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: pod-kill-order-service
namespace: chaos-testing
spec:
action: pod-kill
mode: one
selector:
namespaces: [production]
labelSelectors:
app: order-service
scheduler:
cron: '@every 30m' # 每30分钟随机杀一个 Pod场景二:网络延迟注入
# chaos/network-delay.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: network-delay-db
namespace: chaos-testing
spec:
action: delay
mode: all
selector:
namespaces: [production]
labelSelectors:
app: order-service
delay:
latency: "200ms"
jitter: "50ms"
correlation: "75"
direction: to
target:
selector:
namespaces: [production]
labelSelectors:
app: mysql
mode: all
duration: "10m"场景三:CPU 压力
# chaos/stress-cpu.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
name: cpu-stress
namespace: chaos-testing
spec:
mode: one
selector:
namespaces: [production]
labelSelectors:
app: order-service
stressors:
cpu:
workers: 4
load: 80
duration: "5m"场景四:磁盘填充
# chaos/disk-fill.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
name: disk-fill
namespace: chaos-testing
spec:
mode: one
selector:
namespaces: [production]
labelSelectors:
app: log-service
stressors:
memory:
workers: 1
size: "2GiB"
duration: "10m"8.3 LitmusChaos 简介
# 安装 LitmusChaos
helm repo add litmuschaos https://litmuschaos.github.io/litmus-helm
helm install litmus litmuschaos/litmus \
--namespace litmus \
--create-namespace
# 创建 ChaosEngine 实验
kubectl apply -f - <<EOF
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: order-service-chaos
namespace: production
spec:
appinfo:
appns: production
applabel: app=order-service
appkind: deployment
chaosServiceAccount: litmus-admin
experiments:
- name: pod-delete
spec:
components:
env:
- name: TOTAL_CHAOS_DURATION
value: "60"
- name: CHAOS_INTERVAL
value: "10"
- name: FORCE
value: "false"
EOF8.4 混沌工程实验清单
# 混沌工程实验 checklist
experiments:
- name: "基础故障注入"
tests:
- "随机杀 Pod → 验证自动恢复时间 < 30s"
- "单节点宕机 → 验证服务无感知"
- "DNS 故障 → 验证降级策略"
- name: "网络故障"
tests:
- "下游服务延迟 +200ms → 验证超时熔断"
- "网络丢包 10% → 验证重试机制"
- "网络分区 → 验证脑裂处理"
- name: "资源压力"
tests:
- "CPU 80% 持续 5min → 验证 HPA 弹性"
- "内存 90% → 验证 OOM 保护"
- "磁盘满 → 验证日志轮转"
- name: "依赖故障"
tests:
- "数据库主从切换 → 验证读写分离"
- "Redis 主节点宕机 → 验证哨兵切换"
- "MQ 消费暂停 → 验证积压告警"九、运维值班手册
9.1 值班检查清单
# 日常巡检清单(每日 09:00 / 18:00)
daily_checklist:
- name: "集群健康状态"
commands:
- kubectl get nodes
- kubectl top nodes
expected: "所有节点 Ready, CPU < 80%, Memory < 85%"
- name: "核心服务状态"
commands:
- kubectl get pods -n production | grep -v Running
expected: "无异常 Pod"
- name: "告警检查"
commands:
- curl -s http://alertmanager:9093/api/v2/alerts | jq '.[] | select(.status.state=="active")'
expected: "无活跃 Critical 告警"
- name: "SLO 错误预算"
commands:
- "Grafana SLO Dashboard 检查"
expected: "错误预算剩余 > 50%"
- name: "证书到期检查"
commands:
- kubectl get certificate -A
- echo | openssl s_client -connect api.example.com:443 2>/dev/null | openssl x509 -noout -dates
expected: "所有证书 > 30 天到期"
- name: "备份状态"
commands:
- "检查 Velero 备份状态"
- kubectl get backup -n velero
expected: "最近一次备份成功且 < 24h"9.2 常见问题速查表
┌──────────────────┬────────────────────────────┬──────────────────────────┐
│ 症状 │ 可能原因 │ 处理方式 │
├──────────────────┼────────────────────────────┼──────────────────────────┤
│ 502 Bad Gateway │ 上游服务挂掉 │ 检查 Pod 状态,重启 │
│ 503 Unavailable │ 服务未就绪 / 熔断 │ 检查健康检查 / 熔断器 │
│ 504 Timeout │ 下游响应慢 │ 查链路追踪,检查慢查询 │
│ Pod CrashLoop │ 应用启动失败 │ kubectl logs --previous │
│ Pod Pending │ 资源不足 / 调度失败 │ kubectl describe pod │
│ Pod Evicted │ 磁盘/内存压力 │ 清理磁盘,扩容节点 │
│ OOMKilled │ 内存超限 │ 调整 limits,排查泄漏 │
│ 磁盘使用 > 90% │ 日志/临时文件堆积 │ 清理日志,检查轮转 │
│ CPU 持续 > 90% │ 死循环 / 流量突增 │ 检查代码 / HPA 配置 │
│ 连接池耗尽 │ 慢查询 / 连接泄漏 │ 优化 SQL / 调整池大小 │
│ 消息积压 │ 消费者处理慢 │ 扩容消费者 / 排查阻塞 │
│ 证书过期提醒 │ 自动续签失败 │ 手动续签 / 检查 cert-mgr │
└──────────────────┴────────────────────────────┴──────────────────────────┘9.3 应急响应流程
应急响应分级:
┌──────────────────────────────────────────────────────────┐
│ P0 - 致命(核心业务完全不可用) │
│ 响应时间: 5 分钟 │
│ 处理流程: │
│ 1. 立即拉群 / 电话通知 │
│ 2. 确认影响范围和用户数 │
│ 3. 15 分钟内未定位根因 → 执行回滚 │
│ 4. 每 30 分钟同步修复进展 │
│ 5. 修复后 24h 内完成复盘 │
├──────────────────────────────────────────────────────────┤
│ P1 - 严重(核心功能降级,部分用户受影响) │
│ 响应时间: 15 分钟 │
│ 处理流程: │
│ 1. 值班人员介入排查 │
│ 2. 30 分钟内定位根因 │
│ 3. 1 小时内完成修复 │
│ 4. 48h 内完成复盘 │
├──────────────────────────────────────────────────────────┤
│ P2 - 一般(非核心功能异常) │
│ 响应时间: 1 小时 │
│ 处理流程: │
│ 1. 创建工单记录 │
│ 2. 下一个工作日处理 │
│ 3. 一周内完成修复 │
├──────────────────────────────────────────────────────────┤
│ P3 - 轻微(体验优化类) │
│ 响应时间: 下个工作日 │
│ 处理流程: │
│ 1. 记录到 Backlog │
│ 2. 排入迭代计划 │
└──────────────────────────────────────────────────────────┘9.4 应急响应 Checklist
# P0 应急响应 Checklist
emergency_checklist:
- step: "1. 确认故障"
actions:
- "查看监控大屏确认异常"
- "检查是否有对应告警"
- "确认影响范围(哪些服务、多少用户)"
- step: "2. 通知相关人员"
actions:
- "拉起应急群(飞书/钉钉)"
- "通知服务 Owner"
- "通知业务方(如需)"
- step: "3. 止血"
actions:
- "如果是新版本 → 回滚"
- "如果是流量突增 → 限流/扩容"
- "如果是下游故障 → 熔断降级"
- "如果是配置变更 → 回退配置"
- step: "4. 排查根因"
actions:
- "查看错误日志"
- "查看链路追踪"
- "检查资源指标"
- "检查最近变更记录"
- step: "5. 修复验证"
actions:
- "确认修复措施生效"
- "观察监控指标恢复正常"
- "确认无二次告警"
- step: "6. 复盘"
actions:
- "记录故障时间线"
- "分析根因"
- "制定改进措施"
- "更新 SOP / 监控告警"十、完整可观测性平台部署模板
10.1 一键部署脚本
#!/bin/bash
# deploy-observability.sh - 一键部署可观测性平台
set -euo pipefail
NAMESPACE="monitoring"
HELM_RELEASE="observability"
echo "=== 创建命名空间 ==="
kubectl create namespace ${NAMESPACE} --dry-run=client -o yaml | kubectl apply -f -
echo "=== 添加 Helm 仓库 ==="
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update
echo "=== 部署 kube-prometheus-stack (Prometheus + Grafana + Alertmanager) ==="
helm upgrade --install kube-prometheus prometheus-community/kube-prometheus-stack \
--namespace ${NAMESPACE} \
--set prometheus.prometheusSpec.retention=30d \
--set prometheus.prometheusSpec.retentionSize=50GB \
--set prometheus.prometheusSpec.enableFeatures[0]=exemplar-storage \
--set grafana.adminPassword=admin123 \
--set grafana."grafana.ini".server.root_url="https://grafana.example.com" \
--set alertmanager.config.global.resolve_timeout=5m \
--wait
echo "=== 部署 Loki ==="
helm upgrade --install loki grafana/loki \
--namespace ${NAMESPACE} \
--set mode=single \
--set persistence.enabled=true \
--set persistence.size=100Gi \
--wait
echo "=== 部署 Promtail(日志采集)==="
helm upgrade --install promtail grafana/promtail \
--namespace ${NAMESPACE} \
--set config.clients[0].url="http://loki:3100/loki/api/v1/push" \
--wait
echo "=== 部署 Tempo(链路追踪)==="
helm upgrade --install tempo grafana/tempo \
--namespace ${NAMESPACE} \
--set persistence.enabled=true \
--set persistence.size=50Gi \
--wait
echo "=== 部署 OpenTelemetry Collector ==="
helm upgrade --install otel-collector open-telemetry/opentelemetry-collector \
--namespace ${NAMESPACE} \
--set mode=deployment \
--set config.exporters.otlp.endpoint="tempo:4317" \
--set config.exporters.otlp.insecure=true \
--wait
echo "=== 部署 Chaos Mesh ==="
helm upgrade --install chaos-mesh chaos-mesh/chaos-mesh \
--namespace chaos-testing \
--create-namespace \
--wait
echo "=== 导入 Grafana Dashboard ==="
# Golden Signals Dashboard
kubectl apply -f - <<EOF
apiVersion: v1
kind: ConfigMap
metadata:
name: grafana-dashboard-golden-signals
namespace: ${NAMESPACE}
labels:
grafana_dashboard: "1"
data:
golden-signals.json: |
$(cat dashboards/golden-signals.json)
EOF
echo "=== 部署完成 ==="
echo "Grafana: https://grafana.example.com"
echo "Prometheus: https://prometheus.example.com"
echo "Alertmgr: https://alertmanager.example.com"10.2 Helm Values 配置模板
# values-kube-prometheus-stack.yaml
prometheus:
prometheusSpec:
retention: 30d
retentionSize: "50GB"
enableFeatures:
- exemplar-storage
resources:
requests:
cpu: "500m"
memory: "2Gi"
limits:
cpu: "2"
memory: "8Gi"
storageSpec:
volumeClaimTemplate:
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 200Gi
grafana:
adminPassword: "admin123"
persistence:
enabled: true
size: 20Gi
resources:
requests:
cpu: "100m"
memory: "256Mi"
limits:
cpu: "500m"
memory: "512Mi"
"grafana.ini":
server:
root_url: "https://grafana.example.com"
auth.ldap:
enabled: false
alertmanager:
config:
global:
resolve_timeout: 5m
route:
group_by: ['alertname', 'cluster', 'service']
group_wait: 10s
group_interval: 10s
repeat_interval: 12h
receiver: 'default'
receivers:
- name: 'default'
webhook_configs:
- url: 'http://notification-service:8080/webhook'
# values-loki.yaml
loki:
auth_enabled: false
commonConfig:
replication_factor: 1
storage:
type: filesystem
rulerConfig:
alertmanager_url: http://kube-prometheus-alertmanager:9093
# values-tempo.yaml
tempo:
storage:
trace:
backend: local
local:
path: /var/tempo/traces
receivers:
otlp:
protocols:
grpc:
endpoint: "0.0.0.0:4317"
http:
endpoint: "0.0.0.0:4318"
zipkin:
endpoint: "0.0.0.0:9411"10.3 端口与访问信息汇总
可观测性平台服务清单:
┌───────────────────┬──────────┬──────────────────────────────┐
│ 服务 │ 端口 │ 用途 │
├───────────────────┼──────────┼──────────────────────────────┤
│ Grafana │ 3000 │ 可视化面板 │
│ Prometheus │ 9090 │ 指标存储与查询 │
│ Alertmanager │ 9093 │ 告警管理 │
│ Loki │ 3100 │ 日志聚合 │
│ Tempo │ 3200 │ 链路追踪存储 │
│ OTLP gRPC │ 4317 │ OpenTelemetry 数据接入 │
│ OTLP HTTP │ 4318 │ OpenTelemetry 数据接入 │
│ Node Exporter │ 9100 │ 主机指标 │
│ cAdvisor │ 8080 │ 容器指标 │
│ Promtail │ 9080 │ 日志采集 │
│ Chaos Mesh │ 2333 │ 混沌工程控制台 │
└───────────────────┴──────────┴──────────────────────────────┘附录:关键配置文件目录结构
monitoring/
├── docker-compose.monitoring.yml
├── deploy-observability.sh
├── prometheus/
│ ├── prometheus.yml
│ └── rules/
│ ├── slo_alerts.yml
│ ├── node_alerts.yml
│ └── app_alerts.yml
├── grafana/
│ ├── provisioning/
│ │ ├── datasources/
│ │ │ └── datasources.yml
│ │ └── dashboards/
│ │ └── dashboards.yml
│ └── dashboards/
│ ├── golden-signals.json
│ ├── node-exporter.json
│ └── slo-overview.json
├── alertmanager/
│ └── alertmanager.yml
├── loki/
│ └── loki-config.yml
├── tempo/
│ └── tempo-config.yml
├── otel-collector/
│ └── otel-collector-config.yml
└── chaos/
├── pod-kill.yaml
├── network-delay.yaml
├── stress-cpu.yaml
└── disk-fill.yaml总结:可观测性不是一个项目,而是一段旅程。从 Metrics 开始,逐步补齐 Logs 和 Traces,建立 SLO 体系,最后用混沌工程验证你的可观测性是否真的能覆盖故障场景。记住,如果你的告警没有在故障发生前响起来,那你的监控还不够好。