可观测性不是买一个工具,而是一种工程文化。本文从零搭建一套生产级可观测性平台,涵盖三大支柱、SLO 体系、Prometheus + Grafana 监控、分布式追踪、故障排查 SOP、混沌工程和值班手册。所有配置可直接复用。


一、可观测性三大支柱

1.1 三大支柱概览

可观测性(Observability)的三大支柱:Metrics(指标)Logs(日志)Traces(链路追踪)。三者不是孤立存在的,真正的价值在于关联

┌─────────────────────────────────────────────────┐
│                  可观测性平台                      │
│  ┌───────────┐  ┌───────────┐  ┌───────────┐    │
│  │  Metrics   │  │   Logs    │  │  Traces   │    │
│  │ Prometheus │  │   Loki    │  │   Jaeger  │    │
│  └─────┬─────┘  └─────┬─────┘  └─────┬─────┘    │
│        │              │              │           │
│        └──────────┬───┴──────────────┘           │
│                   │                              │
│            关联层 (Exemplar / TraceID)            │
│                   │                              │
│              Grafana 统一视图                      │
└─────────────────────────────────────────────────┘

1.2 数据关联方法

方法一:Exemplar 关联(Metrics → Traces)

Prometheus 的 Exemplar 机制允许在指标数据点上附加 TraceID:

# 应用端暴露带 exemplar 的 histogram
histogram.ObserveWithExemplar(
    duration,
    prometheus.Labels{"traceID": span.SpanContext().TraceID().String()},
)

方法二:TraceID 贯穿(Logs ↔ Traces)

在日志中注入 TraceID,实现日志与链路双向跳转:

// Go 示例:从 context 中提取 traceID 并写入日志
span := trace.SpanFromContext(ctx)
logger.Info("request processed",
    zap.String("trace_id", span.SpanContext().TraceID().String()),
    zap.String("span_id", span.SpanContext().SpanID().String()),
)

方法三:统一标签关联(Metrics ↔ Logs)

给指标和日志打上相同的业务标签:

# 共同标签
labels:
  service: order-service
  env: production
  region: cn-east-1
  version: v1.2.3

1.3 三大支柱对比速查

维度

Metrics

Logs

Traces

数据类型

数值型时序数据

非结构化/半结构化文本

请求调用链

存储开销

查询语言

PromQL

LogQL / Lucene

TraceQL

典型工具

Prometheus

Loki / ELK

Jaeger / Tempo

适用场景

趋势告警

详情排查

链路定位


二、SLO / SLI 定义

2.1 SLI 选择

SLI(Service Level Indicator)是衡量服务质量的具体指标。选择原则:用户能感知到的,才是好 SLI

# SLI 定义模板
slis:
  - name: availability
    description: "请求成功率"
    metric: |
      sum(rate(http_requests_total{code!~"5.."}[5m]))
      /
      sum(rate(http_requests_total[5m]))
    target: 0.999  # 99.9%

  - name: latency
    description: "P99 延迟"
    metric: |
      histogram_quantile(0.99,
        sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
      )
    target: 0.5     # 500ms

  - name: correctness
    description: "数据处理正确率"
    metric: |
      sum(rate(orders_processed_total{status="success"}[5m]))
      /
      sum(rate(orders_processed_total[5m]))
    target: 0.9999  # 99.99%

2.2 SLO 设定

SLO(Service Level Objective)= SLI + 目标值。设定策略:

SLO 设定阶梯:
├── 99%   → 每月允许宕机 7.31 小时(内部工具)
├── 99.9% → 每月允许宕机 43.8 分钟(一般业务)
├── 99.95%→ 每月允许宕机 21.9 分钟(核心业务)
└── 99.99%→ 每月允许宕机 4.38 分钟(支付/交易)

2.3 错误预算(Error Budget)

# 错误预算计算脚本
def calculate_error_budget(slo_target, time_window_days, total_requests):
    """计算错误预算"""
    allowed_failure_rate = 1 - slo_target
    allowed_failures = int(total_requests * allowed_failure_rate)
    
    return {
        "slo_target": f"{slo_target*100}%",
        "time_window": f"{time_window_days} days",
        "total_requests": total_requests,
        "allowed_failures": allowed_failures,
        "allowed_downtime_minutes": time_window_days * 24 * 60 * allowed_failure_rate
    }

# 示例:99.9% SLO,30天窗口
print(calculate_error_budget(0.999, 30, 100_000_000))
# {'allowed_failures': 100000, 'allowed_downtime_minutes': 43.2}

2.4 Burn Rate 告警

Burn Rate 表示错误预算的消耗速率。Burn Rate = 2 意味着以 2 倍速率消耗预算。

# Prometheus 告警规则:多窗口 Burn Rate
groups:
  - name: slo_alerts
    rules:
      # 快速燃烧:1小时内消耗 14.4 倍预算(1/30天预算在1小时内耗尽)
      - alert: SLOBurnRateHigh
        expr: |
          (
            1 - (
              sum(rate(http_requests_total{code!~"5.."}[1h]))
              /
              sum(rate(http_requests_total[1h]))
            )
          ) > (14.4 * (1 - 0.999))
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "错误预算快速燃烧 (14.4x)"

      # 慢速燃烧:6小时内消耗 6 倍预算
      - alert: SLOBurnRateMedium
        expr: |
          (
            1 - (
              sum(rate(http_requests_total{code!~"5.."}[6h]))
              /
              sum(rate(http_requests_total[6h]))
            )
          ) > (6 * (1 - 0.999))
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "错误预算慢速燃烧 (6x)"

三、Prometheus + Grafana 监控体系

3.1 完整部署(Docker Compose)

# docker-compose.monitoring.yml
version: "3.8"

services:
  prometheus:
    image: prom/prometheus:v2.51.0
    container_name: prometheus
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
      - ./prometheus/rules:/etc/prometheus/rules
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--storage.tsdb.retention.time=30d'
      - '--storage.tsdb.retention.size=50GB'
      - '--web.enable-lifecycle'
      - '--web.enable-remote-write-receiver'
      - '--enable-feature=exemplar-storage'
    restart: unless-stopped

  grafana:
    image: grafana/grafana:10.4.0
    container_name: grafana
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin123
      - GF_INSTALL_PLUGINS=grafana-piechart-panel
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning
      - ./grafana/dashboards:/var/lib/grafana/dashboards
    depends_on:
      - prometheus
    restart: unless-stopped

  alertmanager:
    image: prom/alertmanager:v0.27.0
    container_name: alertmanager
    ports:
      - "9093:9093"
    volumes:
      - ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml
    command:
      - '--config.file=/etc/alertmanager/alertmanager.yml'
    restart: unless-stopped

  loki:
    image: grafana/loki:2.9.6
    container_name: loki
    ports:
      - "3100:3100"
    volumes:
      - ./loki/loki-config.yml:/etc/loki/local-config.yaml
      - loki_data:/loki
    command: -config.file=/etc/loki/local-config.yaml
    restart: unless-stopped

  tempo:
    image: grafana/tempo:2.4.1
    container_name: tempo
    ports:
      - "3200:3200"   # Tempo HTTP
      - "4317:4317"   # OTLP gRPC
      - "4318:4318"   # OTLP HTTP
    volumes:
      - ./tempo/tempo-config.yml:/etc/tempo/config.yaml
      - tempo_data:/var/tempo
    command: -config.file=/etc/tempo/config.yaml
    restart: unless-stopped

  node-exporter:
    image: prom/node-exporter:v1.7.0
    container_name: node-exporter
    ports:
      - "9100:9100"
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - '--path.procfs=/host/proc'
      - '--path.sysfs=/host/sys'
      - '--path.rootfs=/rootfs'
    restart: unless-stopped

  cadvisor:
    image: gcr.io/cadvisor/cadvisor:v0.47.2
    container_name: cadvisor
    ports:
      - "8080:8080"
    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:ro
      - /sys:/sys:ro
      - /var/lib/docker/:/var/lib/docker:ro
    restart: unless-stopped

volumes:
  prometheus_data:
  grafana_data:
  loki_data:
  tempo_data:

3.2 Prometheus 配置

# prometheus/prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s
  scrape_timeout: 10s
  external_labels:
    cluster: production
    region: cn-east-1

# 告警规则文件
rule_files:
  - "rules/*.yml"

# Alertmanager 配置
alerting:
  alertmanagers:
    - static_configs:
        - targets: ['alertmanager:9093']

# 采集目标
scrape_configs:
  # Prometheus 自身
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  # Node Exporter
  - job_name: 'node-exporter'
    static_configs:
      - targets: ['node-exporter:9100']
        labels:
          instance: 'server-01'

  # cAdvisor(容器指标)
  - job_name: 'cadvisor'
    static_configs:
      - targets: ['cadvisor:8080']

  # 应用服务(通过服务发现)
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        target_label: __address__

  # 黑盒探测
  - job_name: 'blackbox-http'
    metrics_path: /probe
    params:
      module: [http_2xx]
    static_configs:
      - targets:
          - https://api.example.com/health
          - https://www.example.com
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: blackbox-exporter:9115

3.3 Alertmanager 配置

# alertmanager/alertmanager.yml
global:
  resolve_timeout: 5m
  smtp_smarthost: 'smtp.example.com:465'
  smtp_from: 'alert@example.com'
  smtp_auth_username: 'alert@example.com'
  smtp_auth_password: 'password'

route:
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 12h
  receiver: 'default'
  routes:
    - match:
        severity: critical
      receiver: 'pager'
      group_wait: 5s
    - match:
        severity: warning
      receiver: 'slack'
    - match_re:
        alertname: SLOBurn.*
      receiver: 'slo-escalation'
      repeat_interval: 1h

receivers:
  - name: 'default'
    webhook_configs:
      - url: 'http://dingtalk-webhook:8060/dingtalk/ops/send'

  - name: 'pager'
    pagerduty_configs:
      - service_key: 'xxx'
    webhook_configs:
      - url: 'http://dingtalk-webhook:8060/dingtalk/emergency/send'

  - name: 'slack'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/xxx'
        channel: '#alerts'
        title: '{{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'

  - name: 'slo-escalation'
    webhook_configs:
      - url: 'http://dingtalk-webhook:8060/dingtalk/slo/send'

inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'instance']

3.4 Grafana 数据源配置

# grafana/provisioning/datasources/datasources.yml
apiVersion: 1

datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    jsonData:
      exemplarTraceIdDestinations:
        - name: traceID
          datasourceUid: tempo

  - name: Loki
    type: loki
    access: proxy
    url: http://loki:3100
    jsonData:
      derivedFields:
        - datasourceUid: tempo
          matcherRegex: "traceID=(\\w+)"
          name: TraceID
          url: "$${__value.raw}"

  - name: Tempo
    type: tempo
    access: proxy
    url: http://tempo:3200
    uid: tempo
    jsonData:
      tracesToLogs:
        datasourceUid: loki
        filterByTraceID: true
        tags: ['service']
      tracesToMetrics:
        datasourceUid: prometheus
      serviceMap:
        datasourceUid: prometheus

3.5 核心 Dashboard 模板

{
  "dashboard": {
    "title": "服务概览 - Golden Signals",
    "panels": [
      {
        "title": "请求速率 (QPS)",
        "type": "timeseries",
        "targets": [
          {
            "expr": "sum(rate(http_requests_total{service=~\"$service\"}[5m])) by (code)",
            "legendFormat": "{{code}}"
          }
        ]
      },
      {
        "title": "请求延迟分布",
        "type": "timeseries",
        "targets": [
          {
            "expr": "histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket{service=~\"$service\"}[5m])) by (le))",
            "legendFormat": "p50"
          },
          {
            "expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{service=~\"$service\"}[5m])) by (le))",
            "legendFormat": "p95"
          },
          {
            "expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{service=~\"$service\"}[5m])) by (le))",
            "legendFormat": "p99"
          }
        ]
      },
      {
        "title": "错误率",
        "type": "gauge",
        "targets": [
          {
            "expr": "1 - (sum(rate(http_requests_total{service=~\"$service\",code!~\"5..\"}[5m])) / sum(rate(http_requests_total{service=~\"$service\"}[5m])))",
            "legendFormat": "Error Rate"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "thresholds": {
              "steps": [
                {"value": 0, "color": "green"},
                {"value": 0.001, "color": "yellow"},
                {"value": 0.01, "color": "red"}
              ]
            }
          }
        }
      },
      {
        "title": "SLO 错误预算剩余",
        "type": "gauge",
        "targets": [
          {
            "expr": "1 - ((1 - (sum(rate(http_requests_total{service=~\"$service\",code!~\"5..\"}[30d])) / sum(rate(http_requests_total{service=~\"$service\"}[30d])))) / (1 - 0.999))",
            "legendFormat": "Budget Remaining"
          }
        ]
      }
    ]
  }
}

四、日志与指标关联

4.1 Exemplar 配置

在 Prometheus 中启用 Exemplar 存储:

# prometheus.yml 启动参数
--enable-feature=exemplar-storage

# 应用端 Go 代码示例
import (
    "github.com/prometheus/client_golang/prometheus"
    "go.opentelemetry.io/otel/trace"
)

var httpDuration = prometheus.NewHistogramVec(
    prometheus.HistogramOpts{
        Name:    "http_request_duration_seconds",
        Help:    "HTTP request duration",
        Buckets: prometheus.DefBuckets,
    },
    []string{"method", "path", "status"},
)

func recordMetrics(ctx context.Context, method, path, status string, duration float64) {
    span := trace.SpanFromContext(ctx)
    labels := prometheus.Labels{
        "method": method,
        "path":   path,
        "status": status,
    }
    
    // 附带 TraceID 作为 Exemplar
    if span.SpanContext().IsValid() {
        httpDuration.With(labels).(prometheus.ExemplarObserver).ObserveWithExemplar(
            duration,
            prometheus.Labels{
                "traceID": span.SpanContext().TraceID().String(),
            },
        )
    } else {
        httpDuration.With(labels).Observe(duration)
    }
}

4.2 TraceID 日志注入

Python + structlog 示例:

import structlog
from opentelemetry import trace

def add_trace_info(logger, method_name, event_dict):
    span = trace.get_current_span()
    ctx = span.get_span_context()
    if ctx.is_valid:
        event_dict["trace_id"] = format(ctx.trace_id, '032x')
        event_dict["span_id"] = format(ctx.span_id, '016x')
    return event_dict

structlog.configure(
    processors=[
        add_trace_info,
        structlog.dev.ConsoleRenderer(),
    ]
)

logger = structlog.get_logger()
logger.info("order_created", order_id="ORD-001", amount=99.9)
# 输出包含 trace_id 和 span_id

Java + Logback 示例:

<!-- logback-spring.xml -->
<configuration>
    <appender name="JSON" class="ch.qos.logback.core.ConsoleAppender">
        <encoder class="net.logstash.logback.encoder.LogstashEncoder">
            <customFields>{"service":"order-service","env":"prod"}</customFields>
            <!-- 自动注入 MDC 中的 traceId 和 spanId -->
        </encoder>
    </appender>
</configuration>

4.3 Loki 日志转指标

# loki/loki-config.yml 中的 metric 实例
limits_config:
  metric_enabled: true

# 使用 LogQL 从日志中提取指标
# 1. 统计错误日志速率
# rate({service="order-service"} |= "ERROR" [5m])

# 2. 提取数值型指标(如响应时间)
# quantile_over_time(0.99, {service="order-service"} | json | unwrap response_time [5m])

# 3. 按状态码分组统计
# sum(rate({service="order-service"} | json | line_format "{{.status}}" | __error__="" [5m])) by (status)

Grafana 中配置 Log → Metric 查询:

# LogQL:从日志中提取 P99 延迟
quantile_over_time(
  0.99,
  {namespace="production", service="order-service"}
  | json
  | unwrap duration_ms
  [5m]
) by (endpoint)

五、分布式追踪落地

5.1 OpenTelemetry 全链路集成

Go 应用接入:

package main

import (
    "context"
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
    "go.opentelemetry.io/otel/sdk/resource"
    sdktrace "go.opentelemetry.io/otel/sdk/trace"
    semconv "go.opentelemetry.io/otel/semconv/v1.24.0"
)

func initTracer(ctx context.Context) (*sdktrace.TracerProvider, error) {
    // OTLP gRPC 导出器 → Tempo
    exporter, err := otlptracegrpc.New(ctx,
        otlptracegrpc.WithEndpoint("tempo:4317"),
        otlptracegrpc.WithInsecure(),
    )
    if err != nil {
        return nil, err
    }

    tp := sdktrace.NewTracerProvider(
        sdktrace.WithBatcher(exporter),
        sdktrace.WithResource(resource.NewWithAttributes(
            semconv.SchemaURL,
            semconv.ServiceNameKey.String("order-service"),
            semconv.ServiceVersionKey.String("v1.2.3"),
            semconv.DeploymentEnvironmentKey.String("production"),
        )),
        sdktrace.WithSampler(sdktrace.ParentBased(
            sdktrace.TraceIDRatioBased(0.1), // 10% 采样
        )),
    )

    otel.SetTracerProvider(tp)
    return tp, nil
}

Python 应用接入:

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource

resource = Resource.create({
    "service.name": "payment-service",
    "service.version": "v2.0.0",
    "deployment.environment": "production",
})

provider = TracerProvider(resource=resource)
processor = BatchSpanProcessor(
    OTLPSpanExporter(endpoint="tempo:4317", insecure=True)
)
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)

# 使用
tracer = trace.get_tracer("payment-service")

@app.route("/pay", methods=["POST"])
def process_payment():
    with tracer.start_as_current_span("process_payment") as span:
        span.set_attribute("order.id", request.json["order_id"])
        span.set_attribute("payment.amount", request.json["amount"])
        # 业务逻辑...
        return {"status": "success"}

5.2 Tempo 配置

# tempo/tempo-config.yml
server:
  http_listen_port: 3200

distributor:
  receivers:
    otlp:
      protocols:
        grpc:
          endpoint: 0.0.0.0:4317
        http:
          endpoint: 0.0.0.0:4318
    zipkin:
      endpoint: 0.0.0.0:9411

ingester:
  max_block_duration: 5m

compactor:
  compaction:
    block_retention: 720h  # 30 天

metrics_generator:
  registry:
    external_labels:
      source: tempo
      cluster: production
  storage:
    path: /var/tempo/generator/wal
    remote_write:
      - url: http://prometheus:9090/api/v1/write
        send_exemplars: true
  traces_storage:
    path: /var/tempo/generator/traces
  processor:
    service_graphs:
      dimensions: [namespace, service]
    span_metrics:
      dimensions: [namespace, service, endpoint]

storage:
  trace:
    backend: local
    local:
      path: /var/tempo/traces
    wal:
      path: /var/tempo/wal

5.3 采样策略

# 采样策略矩阵
sampling_strategies:
  # 1. 头部采样(Head-based):在链路入口决定
  head_based:
    description: "简单高效,但可能漏采异常链路"
    config:
      type: probabilistic
      param: 0.1  # 10% 采样率

  # 2. 尾部采样(Tail-based):收集完整链路后决定
  tail_based:
    description: "能保留异常链路,但需要更多资源"
    config:
      policies:
        - name: errors
          type: status_code
          status_codes: [ERROR]
        - name: slow
          type: latency
          threshold_ms: 5000
        - name: probabilistic
          type: probabilistic
          percentage: 5

  # 3. 自适应采样
  adaptive:
    description: "根据流量动态调整采样率"
    formula: |
      采样率 = min(1.0, 目标QPS / 当前QPS)
      当 QPS < 100 时: 100% 采样
      当 QPS = 1000 时: 10% 采样
      当 QPS = 10000 时: 1% 采样

OpenTelemetry Collector 采样配置:

# otel-collector-config.yml
processors:
  tail_sampling:
    decision_wait: 10s
    num_traces: 100000
    policies:
      - name: errors-policy
        type: status_code
        status_code: {status_codes: [ERROR]}
      - name: slow-traces
        type: latency
        latency: {threshold_ms: 3000}
      - name: probabilistic
        type: probabilistic
        probabilistic: {sampling_percentage: 5}

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [tail_sampling, batch]
      exporters: [otlp]

六、故障排查 SOP

6.1 标准排查流程

故障排查五步法:
┌──────────────────────────────────────────────────┐
│ 1. 定位现象(What)                               │
│    → 告警内容、用户反馈、监控图表                    │
├──────────────────────────────────────────────────┤
│ 2. 缩小范围(Where)                               │
│    → 哪个服务?哪个实例?哪个接口?                   │
├──────────────────────────────────────────────────┤
│ 3. 找到根因(Why)                                  │
│    → 日志、链路、资源指标、变更记录                    │
├──────────────────────────────────────────────────┤
│ 4. 修复问题(How)                                  │
│    → 回滚、扩容、重启、配置修复                       │
├──────────────────────────────────────────────────┤
│ 5. 复盘总结(Learn)                                │
│    → 根因、影响、修复措施、预防方案                    │
└──────────────────────────────────────────────────┘

6.2 5-Why 分析法

问题:用户反馈下单失败

Why 1: 为什么下单失败?
→ 订单服务返回 500 错误

Why 2: 为什么订单服务返回 500?
→ 数据库连接超时

Why 3: 为什么数据库连接超时?
→ 连接池耗尽(200/200)

Why 4: 为什么连接池耗尽?
→ 慢查询堆积,连接未及时释放

Why 5: 为什么出现慢查询?
→ 昨天上线的新功能缺少索引,全表扫描

根因:新功能 SQL 缺少索引 → 慢查询 → 连接池耗尽 → 服务不可用
修复:添加索引 + 连接池超时优化 + SQL 审核流程

6.3 故障树分析(FTA)

服务不可用
├── 应用层
│   ├── 进程崩溃 → OOM / Panic / Segfault
│   ├── 配置错误 → 环境变量 / 配置中心
│   └── 代码缺陷 → Bug / 死锁 / 无限循环
├── 中间件层
│   ├── 数据库 → 连接池 / 慢查询 / 主从延迟
│   ├── 缓存 → 缓存穿透 / 缓存雪崩 / Redis 宕机
│   └── 消息队列 → 积压 / 消费者挂掉 / 分区不均
├── 网络层
│   ├── DNS 解析失败
│   ├── 负载均衡器故障
│   └── 网络分区 / 带宽打满
└── 基础设施层
    ├── 服务器 → CPU / 内存 / 磁盘
    ├── 容器 → Pod Evicted / 节点 NotReady
    └── 云服务 → SLB / RDS / OSS 故障

6.4 排查命令速查

# === 系统资源 ===
# CPU 使用率
top -bn1 | head -20
mpstat -P ALL 1 5

# 内存使用
free -h
cat /proc/meminfo | grep -E "MemTotal|MemFree|Buffers|Cached"

# 磁盘
df -h
iotop -oP

# 网络连接
ss -tunlp | head -20
ss -s  # 连接统计

# === 容器排查 ===
# 查看 Pod 状态
kubectl get pods -n production -o wide
kubectl describe pod <pod-name> -n production

# 查看 Pod 日志
kubectl logs <pod-name> -n production --tail=100
kubectl logs <pod-name> -n production --previous  # 上一次崩溃日志

# 进入容器排查
kubectl exec -it <pod-name> -n production -- /bin/sh

# === JVM 排查 ===
# 线程 dump
jstack <pid> > thread_dump.txt

# 堆内存分析
jmap -heap <pid>
jmap -dump:format=b,file=heap.hprof <pid>

# GC 日志
jstat -gcutil <pid> 1000 10

# === Go 排查 ===
# pprof 采样
go tool pprof http://localhost:6060/debug/pprof/goroutine
go tool pprof http://localhost:6060/debug/pprof/heap
go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30

七、常见故障场景排查

7.1 服务响应慢

# Step 1: 确认延迟分布
curl -s "http://prometheus:9090/api/v1/query" \
  --data-urlencode 'query=histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{service="order-service"}[5m])) by (le))'

# Step 2: 查看链路追踪
# 在 Grafana Tempo 中搜索耗时 > 3s 的 trace

# Step 3: 检查依赖服务
curl -s "http://prometheus:9090/api/v1/query" \
  --data-urlencode 'query=sum(rate(grpc_client_handling_seconds_bucket{service="order-service"}[5m])) by (grpc_service, le)'

# Step 4: 检查资源瓶颈
# CPU
curl -s "http://prometheus:9090/api/v1/query" \
  --data-urlencode 'query=rate(process_cpu_seconds_total{service="order-service"}[5m])'

# 检查是否有 GC 停顿
curl -s "http://prometheus:9090/api/v1/query" \
  --data-urlencode 'query=rate(go_gc_duration_seconds_count{service="order-service"}[5m])'

排查清单:

  • 检查 P99 延迟趋势图

  • 搜索慢 Trace(> 3s),定位瓶颈 Span

  • 检查下游服务延迟(DB、Redis、RPC)

  • 检查 GC 停顿时间

  • 检查 CPU / 内存使用率

  • 检查是否有大查询或慢 SQL

7.2 错误率飙升

# Step 1: 确认错误类型分布
curl -s "http://prometheus:9090/api/v1/query" \
  --data-urlencode 'query=sum(rate(http_requests_total{service="order-service",code=~"5.."}[5m])) by (code, path)'

# Step 2: 查看错误日志
# LogQL 查询
# {service="order-service"} |= "ERROR" | json | line_format "{{.error}}"

# Step 3: 检查最近变更
kubectl rollout history deployment/order-service -n production

# Step 4: 快速回滚(如果是新版本导致)
kubectl rollout undo deployment/order-service -n production

排查清单:

  • 确认错误码分布(500/502/503/504)

  • 搜索错误日志,提取关键错误信息

  • 检查最近是否有代码发布或配置变更

  • 检查下游依赖是否健康

  • 检查是否有流量突增

  • 必要时执行回滚

7.3 内存泄漏

# Step 1: 确认内存增长趋势
curl -s "http://prometheus:9090/api/v1/query" \
  --data-urlencode 'query=process_resident_memory_bytes{service="order-service"}'

# Step 2: 对比不同实例的内存使用
curl -s "http://prometheus:9090/api/v1/query" \
  --data-urlencode 'query=process_resident_memory_bytes{service="order-service"} / 1024 / 1024'

# Step 3: Go 应用 - 分析堆内存
curl http://localhost:6060/debug/pprof/heap > heap.prof
go tool pprof -http=:8081 heap.prof

# Step 4: Java 应用 - 堆 dump
jmap -dump:format=b,file=/tmp/heap.hprof <pid>
# 用 MAT 或 VisualVM 分析

# Step 5: 检查容器 OOM 事件
kubectl get events -n production --field-selector reason=OOMKilling

排查清单:

  • 观察内存增长曲线(持续上升 vs 正常波动)

  • 对比有/无流量时的内存基线

  • Go: pprof heap 分析 top alloc_space

  • Java: jmap dump + MAT 分析 Dominator Tree

  • 检查是否有大对象缓存未设置过期

  • 检查连接/句柄是否正常关闭

7.4 连接池耗尽

# Step 1: 确认连接池状态
curl -s "http://prometheus:9090/api/v1/query" \
  --data-urlencode 'query=db_connections_open{service="order-service"}'

curl -s "http://prometheus:9090/api/v1/query" \
  --data-urlencode 'query=db_connections_max{service="order-service"}'

# Step 2: 检查连接等待时间
curl -s "http://prometheus:9090/api/v1/query" \
  --data-urlencode 'query=histogram_quantile(0.99, sum(rate(db_connection_wait_seconds_bucket[5m])) by (le))'

# Step 3: 检查慢查询
# PostgreSQL
SELECT pid, now() - pg_stat_activity.query_start AS duration, query
FROM pg_stat_activity
WHERE state = 'active' AND now() - pg_stat_activity.query_start > interval '5 seconds'
ORDER BY duration DESC;

# Step 4: 检查连接泄漏
# 应用端查看未归还的连接
# Go: sql.DBStats
# Java: HikariCP metrics

修复方案:

# 连接池配置最佳实践
database:
  pool:
    max_open_conns: 50          # 最大连接数
    max_idle_conns: 10          # 最大空闲连接
    conn_max_lifetime: 300s     # 连接最大存活时间
    conn_max_idle_time: 60s     # 空闲连接最大存活时间
    acquire_timeout: 5s         # 获取连接超时

八、混沌工程入门

8.1 Chaos Mesh 部署

# 安装 Chaos Mesh
helm repo add chaos-mesh https://charts.chaos-mesh.org
helm install chaos-mesh chaos-mesh/chaos-mesh \
  --namespace chaos-testing \
  --create-namespace \
  --set chaosDaemon.runtime=containerd \
  --set chaosDaemon.socketPath=/run/containerd/containerd.sock

# 验证安装
kubectl get pods -n chaos-testing

8.2 常见故障注入场景

场景一:Pod 随机杀掉(模拟实例崩溃)

# chaos/pod-kill.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: pod-kill-order-service
  namespace: chaos-testing
spec:
  action: pod-kill
  mode: one
  selector:
    namespaces: [production]
    labelSelectors:
      app: order-service
  scheduler:
    cron: '@every 30m'  # 每30分钟随机杀一个 Pod

场景二:网络延迟注入

# chaos/network-delay.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: network-delay-db
  namespace: chaos-testing
spec:
  action: delay
  mode: all
  selector:
    namespaces: [production]
    labelSelectors:
      app: order-service
  delay:
    latency: "200ms"
    jitter: "50ms"
    correlation: "75"
  direction: to
  target:
    selector:
      namespaces: [production]
      labelSelectors:
        app: mysql
    mode: all
  duration: "10m"

场景三:CPU 压力

# chaos/stress-cpu.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
  name: cpu-stress
  namespace: chaos-testing
spec:
  mode: one
  selector:
    namespaces: [production]
    labelSelectors:
      app: order-service
  stressors:
    cpu:
      workers: 4
      load: 80
  duration: "5m"

场景四:磁盘填充

# chaos/disk-fill.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
  name: disk-fill
  namespace: chaos-testing
spec:
  mode: one
  selector:
    namespaces: [production]
    labelSelectors:
      app: log-service
  stressors:
    memory:
      workers: 1
      size: "2GiB"
  duration: "10m"

8.3 LitmusChaos 简介

# 安装 LitmusChaos
helm repo add litmuschaos https://litmuschaos.github.io/litmus-helm
helm install litmus litmuschaos/litmus \
  --namespace litmus \
  --create-namespace

# 创建 ChaosEngine 实验
kubectl apply -f - <<EOF
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: order-service-chaos
  namespace: production
spec:
  appinfo:
    appns: production
    applabel: app=order-service
    appkind: deployment
  chaosServiceAccount: litmus-admin
  experiments:
    - name: pod-delete
      spec:
        components:
          env:
            - name: TOTAL_CHAOS_DURATION
              value: "60"
            - name: CHAOS_INTERVAL
              value: "10"
            - name: FORCE
              value: "false"
EOF

8.4 混沌工程实验清单

# 混沌工程实验 checklist
experiments:
  - name: "基础故障注入"
    tests:
      - "随机杀 Pod → 验证自动恢复时间 < 30s"
      - "单节点宕机 → 验证服务无感知"
      - "DNS 故障 → 验证降级策略"

  - name: "网络故障"
    tests:
      - "下游服务延迟 +200ms → 验证超时熔断"
      - "网络丢包 10% → 验证重试机制"
      - "网络分区 → 验证脑裂处理"

  - name: "资源压力"
    tests:
      - "CPU 80% 持续 5min → 验证 HPA 弹性"
      - "内存 90% → 验证 OOM 保护"
      - "磁盘满 → 验证日志轮转"

  - name: "依赖故障"
    tests:
      - "数据库主从切换 → 验证读写分离"
      - "Redis 主节点宕机 → 验证哨兵切换"
      - "MQ 消费暂停 → 验证积压告警"

九、运维值班手册

9.1 值班检查清单

# 日常巡检清单(每日 09:00 / 18:00)
daily_checklist:
  - name: "集群健康状态"
    commands:
      - kubectl get nodes
      - kubectl top nodes
    expected: "所有节点 Ready, CPU < 80%, Memory < 85%"

  - name: "核心服务状态"
    commands:
      - kubectl get pods -n production | grep -v Running
    expected: "无异常 Pod"

  - name: "告警检查"
    commands:
      - curl -s http://alertmanager:9093/api/v2/alerts | jq '.[] | select(.status.state=="active")'
    expected: "无活跃 Critical 告警"

  - name: "SLO 错误预算"
    commands:
      - "Grafana SLO Dashboard 检查"
    expected: "错误预算剩余 > 50%"

  - name: "证书到期检查"
    commands:
      - kubectl get certificate -A
      - echo | openssl s_client -connect api.example.com:443 2>/dev/null | openssl x509 -noout -dates
    expected: "所有证书 > 30 天到期"

  - name: "备份状态"
    commands:
      - "检查 Velero 备份状态"
      - kubectl get backup -n velero
    expected: "最近一次备份成功且 < 24h"

9.2 常见问题速查表

┌──────────────────┬────────────────────────────┬──────────────────────────┐
│ 症状             │ 可能原因                    │ 处理方式                  │
├──────────────────┼────────────────────────────┼──────────────────────────┤
│ 502 Bad Gateway  │ 上游服务挂掉               │ 检查 Pod 状态,重启       │
│ 503 Unavailable  │ 服务未就绪 / 熔断          │ 检查健康检查 / 熔断器     │
│ 504 Timeout      │ 下游响应慢                 │ 查链路追踪,检查慢查询    │
│ Pod CrashLoop    │ 应用启动失败               │ kubectl logs --previous   │
│ Pod Pending      │ 资源不足 / 调度失败        │ kubectl describe pod      │
│ Pod Evicted      │ 磁盘/内存压力              │ 清理磁盘,扩容节点        │
│ OOMKilled        │ 内存超限                   │ 调整 limits,排查泄漏     │
│ 磁盘使用 > 90%   │ 日志/临时文件堆积          │ 清理日志,检查轮转        │
│ CPU 持续 > 90%   │ 死循环 / 流量突增          │ 检查代码 / HPA 配置       │
│ 连接池耗尽       │ 慢查询 / 连接泄漏          │ 优化 SQL / 调整池大小     │
│ 消息积压         │ 消费者处理慢               │ 扩容消费者 / 排查阻塞     │
│ 证书过期提醒     │ 自动续签失败               │ 手动续签 / 检查 cert-mgr  │
└──────────────────┴────────────────────────────┴──────────────────────────┘

9.3 应急响应流程

应急响应分级:
┌──────────────────────────────────────────────────────────┐
│ P0 - 致命(核心业务完全不可用)                             │
│ 响应时间: 5 分钟                                          │
│ 处理流程:                                                 │
│   1. 立即拉群 / 电话通知                                   │
│   2. 确认影响范围和用户数                                   │
│   3. 15 分钟内未定位根因 → 执行回滚                         │
│   4. 每 30 分钟同步修复进展                                 │
│   5. 修复后 24h 内完成复盘                                  │
├──────────────────────────────────────────────────────────┤
│ P1 - 严重(核心功能降级,部分用户受影响)                    │
│ 响应时间: 15 分钟                                         │
│ 处理流程:                                                 │
│   1. 值班人员介入排查                                      │
│   2. 30 分钟内定位根因                                     │
│   3. 1 小时内完成修复                                      │
│   4. 48h 内完成复盘                                        │
├──────────────────────────────────────────────────────────┤
│ P2 - 一般(非核心功能异常)                                 │
│ 响应时间: 1 小时                                          │
│ 处理流程:                                                 │
│   1. 创建工单记录                                          │
│   2. 下一个工作日处理                                      │
│   3. 一周内完成修复                                        │
├──────────────────────────────────────────────────────────┤
│ P3 - 轻微(体验优化类)                                    │
│ 响应时间: 下个工作日                                       │
│ 处理流程:                                                 │
│   1. 记录到 Backlog                                       │
│   2. 排入迭代计划                                          │
└──────────────────────────────────────────────────────────┘

9.4 应急响应 Checklist

# P0 应急响应 Checklist
emergency_checklist:
  - step: "1. 确认故障"
    actions:
      - "查看监控大屏确认异常"
      - "检查是否有对应告警"
      - "确认影响范围(哪些服务、多少用户)"

  - step: "2. 通知相关人员"
    actions:
      - "拉起应急群(飞书/钉钉)"
      - "通知服务 Owner"
      - "通知业务方(如需)"

  - step: "3. 止血"
    actions:
      - "如果是新版本 → 回滚"
      - "如果是流量突增 → 限流/扩容"
      - "如果是下游故障 → 熔断降级"
      - "如果是配置变更 → 回退配置"

  - step: "4. 排查根因"
    actions:
      - "查看错误日志"
      - "查看链路追踪"
      - "检查资源指标"
      - "检查最近变更记录"

  - step: "5. 修复验证"
    actions:
      - "确认修复措施生效"
      - "观察监控指标恢复正常"
      - "确认无二次告警"

  - step: "6. 复盘"
    actions:
      - "记录故障时间线"
      - "分析根因"
      - "制定改进措施"
      - "更新 SOP / 监控告警"

十、完整可观测性平台部署模板

10.1 一键部署脚本

#!/bin/bash
# deploy-observability.sh - 一键部署可观测性平台

set -euo pipefail

NAMESPACE="monitoring"
HELM_RELEASE="observability"

echo "=== 创建命名空间 ==="
kubectl create namespace ${NAMESPACE} --dry-run=client -o yaml | kubectl apply -f -

echo "=== 添加 Helm 仓库 ==="
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update

echo "=== 部署 kube-prometheus-stack (Prometheus + Grafana + Alertmanager) ==="
helm upgrade --install kube-prometheus prometheus-community/kube-prometheus-stack \
  --namespace ${NAMESPACE} \
  --set prometheus.prometheusSpec.retention=30d \
  --set prometheus.prometheusSpec.retentionSize=50GB \
  --set prometheus.prometheusSpec.enableFeatures[0]=exemplar-storage \
  --set grafana.adminPassword=admin123 \
  --set grafana."grafana.ini".server.root_url="https://grafana.example.com" \
  --set alertmanager.config.global.resolve_timeout=5m \
  --wait

echo "=== 部署 Loki ==="
helm upgrade --install loki grafana/loki \
  --namespace ${NAMESPACE} \
  --set mode=single \
  --set persistence.enabled=true \
  --set persistence.size=100Gi \
  --wait

echo "=== 部署 Promtail(日志采集)==="
helm upgrade --install promtail grafana/promtail \
  --namespace ${NAMESPACE} \
  --set config.clients[0].url="http://loki:3100/loki/api/v1/push" \
  --wait

echo "=== 部署 Tempo(链路追踪)==="
helm upgrade --install tempo grafana/tempo \
  --namespace ${NAMESPACE} \
  --set persistence.enabled=true \
  --set persistence.size=50Gi \
  --wait

echo "=== 部署 OpenTelemetry Collector ==="
helm upgrade --install otel-collector open-telemetry/opentelemetry-collector \
  --namespace ${NAMESPACE} \
  --set mode=deployment \
  --set config.exporters.otlp.endpoint="tempo:4317" \
  --set config.exporters.otlp.insecure=true \
  --wait

echo "=== 部署 Chaos Mesh ==="
helm upgrade --install chaos-mesh chaos-mesh/chaos-mesh \
  --namespace chaos-testing \
  --create-namespace \
  --wait

echo "=== 导入 Grafana Dashboard ==="
# Golden Signals Dashboard
kubectl apply -f - <<EOF
apiVersion: v1
kind: ConfigMap
metadata:
  name: grafana-dashboard-golden-signals
  namespace: ${NAMESPACE}
  labels:
    grafana_dashboard: "1"
data:
  golden-signals.json: |
    $(cat dashboards/golden-signals.json)
EOF

echo "=== 部署完成 ==="
echo "Grafana:    https://grafana.example.com"
echo "Prometheus: https://prometheus.example.com"
echo "Alertmgr:   https://alertmanager.example.com"

10.2 Helm Values 配置模板

# values-kube-prometheus-stack.yaml
prometheus:
  prometheusSpec:
    retention: 30d
    retentionSize: "50GB"
    enableFeatures:
      - exemplar-storage
    resources:
      requests:
        cpu: "500m"
        memory: "2Gi"
      limits:
        cpu: "2"
        memory: "8Gi"
    storageSpec:
      volumeClaimTemplate:
        spec:
          accessModes: ["ReadWriteOnce"]
          resources:
            requests:
              storage: 200Gi

grafana:
  adminPassword: "admin123"
  persistence:
    enabled: true
    size: 20Gi
  resources:
    requests:
      cpu: "100m"
      memory: "256Mi"
    limits:
      cpu: "500m"
      memory: "512Mi"
  "grafana.ini":
    server:
      root_url: "https://grafana.example.com"
    auth.ldap:
      enabled: false

alertmanager:
  config:
    global:
      resolve_timeout: 5m
    route:
      group_by: ['alertname', 'cluster', 'service']
      group_wait: 10s
      group_interval: 10s
      repeat_interval: 12h
      receiver: 'default'
    receivers:
      - name: 'default'
        webhook_configs:
          - url: 'http://notification-service:8080/webhook'

# values-loki.yaml
loki:
  auth_enabled: false
  commonConfig:
    replication_factor: 1
  storage:
    type: filesystem
  rulerConfig:
    alertmanager_url: http://kube-prometheus-alertmanager:9093

# values-tempo.yaml
tempo:
  storage:
    trace:
      backend: local
      local:
        path: /var/tempo/traces
  receivers:
    otlp:
      protocols:
        grpc:
          endpoint: "0.0.0.0:4317"
        http:
          endpoint: "0.0.0.0:4318"
    zipkin:
      endpoint: "0.0.0.0:9411"

10.3 端口与访问信息汇总

可观测性平台服务清单:
┌───────────────────┬──────────┬──────────────────────────────┐
│ 服务              │ 端口     │ 用途                          │
├───────────────────┼──────────┼──────────────────────────────┤
│ Grafana           │ 3000     │ 可视化面板                    │
│ Prometheus        │ 9090     │ 指标存储与查询                 │
│ Alertmanager      │ 9093     │ 告警管理                      │
│ Loki              │ 3100     │ 日志聚合                      │
│ Tempo             │ 3200     │ 链路追踪存储                   │
│ OTLP gRPC         │ 4317     │ OpenTelemetry 数据接入         │
│ OTLP HTTP         │ 4318     │ OpenTelemetry 数据接入         │
│ Node Exporter     │ 9100     │ 主机指标                      │
│ cAdvisor          │ 8080     │ 容器指标                      │
│ Promtail          │ 9080     │ 日志采集                      │
│ Chaos Mesh        │ 2333     │ 混沌工程控制台                 │
└───────────────────┴──────────┴──────────────────────────────┘

附录:关键配置文件目录结构

monitoring/
├── docker-compose.monitoring.yml
├── deploy-observability.sh
├── prometheus/
│   ├── prometheus.yml
│   └── rules/
│       ├── slo_alerts.yml
│       ├── node_alerts.yml
│       └── app_alerts.yml
├── grafana/
│   ├── provisioning/
│   │   ├── datasources/
│   │   │   └── datasources.yml
│   │   └── dashboards/
│   │       └── dashboards.yml
│   └── dashboards/
│       ├── golden-signals.json
│       ├── node-exporter.json
│       └── slo-overview.json
├── alertmanager/
│   └── alertmanager.yml
├── loki/
│   └── loki-config.yml
├── tempo/
│   └── tempo-config.yml
├── otel-collector/
│   └── otel-collector-config.yml
└── chaos/
    ├── pod-kill.yaml
    ├── network-delay.yaml
    ├── stress-cpu.yaml
    └── disk-fill.yaml

总结:可观测性不是一个项目,而是一段旅程。从 Metrics 开始,逐步补齐 Logs 和 Traces,建立 SLO 体系,最后用混沌工程验证你的可观测性是否真的能覆盖故障场景。记住,如果你的告警没有在故障发生前响起来,那你的监控还不够好