前言
在微服务架构下,一个用户请求可能穿越十几个服务、经过消息队列和数据库,出了问题排查起来如同大海捞针。链路追踪(Distributed Tracing)和应用性能管理(APM, Application Performance Management)就是为了解决这个问题而生的。
本文从原理到实战,覆盖 OpenTelemetry、Jaeger、SkyWalking 三大主流方案,并给出 Java / Node.js 的集成示例和生产部署建议。
一、链路追踪基础
1.1 分布式追踪原理
分布式追踪的核心思想是:在请求经过的每个服务中埋点,记录调用关系和耗时,然后将这些片段拼接成一条完整的调用链。
工作流程:
请求到达入口服务时,生成全局唯一的 Trace ID
每个服务处理请求时,创建一个 Span(跨度),记录操作名、开始/结束时间、状态
Span 之间通过 Parent-Span ID 形成树状结构
所有 Span 上报到后端存储,拼接成完整的 Trace
1.2 核心概念:Span 与 Trace
Trace(追踪):一个完整请求的全链路记录,由多个 Span 组成。
Span(跨度):一个逻辑操作单元,包含:
traceId:全局唯一追踪 IDspanId:当前跨度 IDparentSpanId:父跨度 ID(根 Span 为空)operationName:操作名称(如GET /api/users)startTime/endTime:起止时间status:OK / ERROR / UNSETattributes:键值对形式的附加信息events:时间点事件(如异常记录)links:与其他 Span 的关联
一个典型的 Trace 结构:
Trace ID: abc123
├── Span A: API Gateway (0-200ms)
│ ├── Span B: User Service (10-150ms)
│ │ └── Span C: MySQL Query (20-80ms)
│ └── Span D: Order Service (30-180ms)
│ ├── Span E: Redis Cache (40-50ms)
│ └── Span F: RabbitMQ Publish (60-70ms)1.3 OpenTracing 与 OpenTelemetry
OpenTelemetry(OTel) 是当前的行业标准,由 OpenTracing 和 OpenCensus 合并而来,由 CNCF 维护,被所有主流云厂商和 APM 工具支持。
二、OpenTelemetry
2.1 架构总览
OpenTelemetry 的架构分为三层:
┌─────────────────────────────────────────┐
│ 应用层 (SDK) │
│ Trace API → Trace SDK → Exporter │
│ Metrics API → Metrics SDK → Exporter │
│ Logs API → Logs SDK → Exporter │
├─────────────────────────────────────────┤
│ Collector (可选) │
│ Receiver → Processor → Exporter │
├─────────────────────────────────────────┤
│ 后端存储/分析 │
│ Jaeger / Tempo / Prometheus / Loki │
└─────────────────────────────────────────┘API:定义接口规范,应用代码只依赖 API
SDK:API 的实现,提供采样、批处理、上下文传播等能力
Collector:独立进程,负责接收、处理、转发遥测数据
Exporter:将数据发送到后端(Jaeger、Zipkin、Prometheus 等)
2.2 Collector 配置
OpenTelemetry Collector 有两种发行版:
Core:核心组件
Contrib:社区贡献的扩展组件(推荐使用)
Docker 部署:
docker run -d --name otel-collector \
-p 4317:4317 \
-p 4318:4318 \
-p 8888:8888 \
-p 8889:8889 \
-v $(pwd)/otel-collector-config.yaml:/etc/otelcol-contrib/config.yaml \
otel/opentelemetry-collector-contrib:latest配置文件 otel-collector-config.yaml:
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
# 也可以接收 Jaeger/Zipkin 格式
jaeger:
protocols:
grpc:
endpoint: 0.0.0.0:14250
thrift_http:
endpoint: 0.0.0.0:14268
zipkin:
endpoint: 0.0.0.0:9411
processors:
batch:
timeout: 5s
send_batch_size: 1024
# 内存限制,防止 OOM
memory_limiter:
check_interval: 1s
limit_mib: 512
spike_limit_mib: 128
# 添加环境信息
resource:
attributes:
- key: environment
value: production
action: upsert
key: team
value: platform
action: upsert
exporters:
# 输出到 Jaeger
otlp/jaeger:
endpoint: jaeger:4317
tls:
insecure: true
# 输出到 Prometheus
prometheus:
endpoint: 0.0.0.0:8889
namespace: otel
# 调试输出
debug:
verbosity: basic
extensions:
health_check:
endpoint: 0.0.0.0:13133
zpages:
endpoint: 0.0.0.0:55679
service:
extensions: [health_check, zpages]
pipelines:
traces:
receivers: [otlp, jaeger, zipkin]
processors: [memory_limiter, resource, batch]
exporters: [otlp/jaeger, debug]
metrics:
receivers: [otlp]
processors: [memory_limiter, resource, batch]
exporters: [prometheus]2.3 SDK 集成(以 Java 为例)
Maven 依赖:
<dependency>
<groupId>io.opentelemetry</groupId>
<artifactId>opentelemetry-api</artifactId>
<version>1.40.0</version>
</dependency>
<dependency>
<groupId>io.opentelemetry</groupId>
<artifactId>opentelemetry-sdk</artifactId>
<version>1.40.0</version>
</dependency>
<dependency>
<groupId>io.opentelemetry</groupId>
<artifactId>opentelemetry-exporter-otlp</artifactId>
<version>1.40.0</version>
</dependency>手动初始化 SDK:
import io.opentelemetry.api.OpenTelemetry;
import io.opentelemetry.api.trace.Tracer;
import io.opentelemetry.exporter.otlp.trace.OtlpGrpcSpanExporter;
import io.opentelemetry.sdk.OpenTelemetrySdk;
import io.opentelemetry.sdk.trace.SdkTracerProvider;
import io.opentelemetry.sdk.trace.export.BatchSpanProcessor;
public class TelemetrySetup {
public static OpenTelemetry initOpenTelemetry(String endpoint) {
OtlpGrpcSpanExporter exporter = OtlpGrpcSpanExporter.builder()
.setEndpoint(endpoint)
.build();
SdkTracerProvider tracerProvider = SdkTracerProvider.builder()
.addSpanProcessor(BatchSpanProcessor.builder(exporter).build())
.setResource(Resource.getDefault().merge(
Resource.builder()
.put("service.name", "my-service")
.put("service.version", "1.0.0")
.build()))
.build();
OpenTelemetrySdk sdk = OpenTelemetrySdk.builder()
.setTracerProvider(tracerProvider)
.build();
Runtime.getRuntime().addShutdownHook(new Thread(tracerProvider::close));
return sdk;
}
}2.4 自动埋点(Java Agent)
OpenTelemetry 提供 Java Agent,零代码修改即可实现自动埋点:
# 下载 Agent
curl -LO https://github.com/open-telemetry/opentelemetry-java-instrumentation/releases/latest/download/opentelemetry-javaagent.jar
# 启动应用
java -javaagent:opentelemetry-javaagent.jar \
-Dotel.service.name=my-service \
-Dotel.exporter.otlp.endpoint=http://otel-collector:4317 \
-Dotel.traces.sampler=parentbased_traceidratio \
-Dotel.traces.sampler.arg=0.1 \
-jar my-app.jar支持的自动埋点框架(部分):
Spring Boot / Spring MVC / Spring WebFlux
gRPC / Apache HttpClient / OkHttp / Netty
JDBC / MyBatis / Hibernate / JPA
Kafka / RabbitMQ / RocketMQ
Redis / Lettuce / Jedis / MongoDB / Elasticsearch
Log4j / Logback / SLF4J
三、Jaeger
3.1 安装部署
Docker Compose 一键部署(All-in-One):
version: "3.9"
services:
jaeger:
image: jaegertracing/all-in-one:1.58
environment:
COLLECTOR_OTLP_ENABLED: "true"
SPAN_STORAGE_TYPE: "elasticsearch"
ES_SERVER_URLS: "http://elasticsearch:9200"
ES_INDEX_PREFIX: "jaeger"
ES_TAGS_AS_FIELDS_ALL: "true"
ports:
- "16686:16686" # UI
- "4317:4317" # OTLP gRPC
- "4318:4318" # OTLP HTTP
- "14250:14250" # gRPC
- "14268:14268" # HTTP Thrift
depends_on:
- elasticsearch
elasticsearch:
image: docker.elastic.co/elasticsearch/elasticsearch:8.14.0
environment:
- discovery.type=single-node
- xpack.security.enabled=false
- "ES_JAVA_OPTS=-Xms1g -Xmx1g"
volumes:
- es-data:/usr/share/elasticsearch/data
ports:
- "9200:9200"
volumes:
es-data:生产环境:分离部署
# 1. 部署 Elasticsearch 集群(推荐 3 节点以上)
# 2. 分别部署 Jaeger 组件
jaeger-collector \
--es.server-urls=http://es1:9200,es2:9200,es3:9200 \
--es.index-jaeger-spans-days=7 \
--es.index-jaeger-services-days=7 \
--es.index-jaeger-dependencies-days=7 \
--es.max-span-age=168h0m0s \
--collector.otlp.enabled=true
jaeger-query \
--es.server-urls=http://es1:9200,es2:9200,es3:9200
jaeger-ingester \
--kafka.consumer.brokers=kafka:9092 \
--kafka.consumer.topic=jaeger-spans \
--es.server-urls=http://es1:9200,es2:9200,es3:92003.2 架构组件
Jaeger 的生产架构包含以下组件:
应用 → Agent → Collector → Storage → Query → UI
↑ ↓
(可选:Kafka 作为缓冲层) 用户访问3.3 采样策略
# 策略配置文件 sampling-strategies.json
{
"default_strategy": {
"type": "probabilistic",
"param": 0.1
},
"service_strategies": [
{
"service": "payment-service",
"type": "rateLimiting",
"param": 100
},
{
"service": "gateway",
"type": "probabilistic",
"param": 0.5,
"operation_strategies": [
{
"operation": "/health",
"type": "probabilistic",
"param": 0.01
}
]
}
]
}采样类型说明:
probabilistic:按比例随机采样(如 0.1 = 10%)
rateLimiting:限制每秒采样数(如 100 = 每秒最多 100 条)
remote:由后端远程控制采样率(适合动态调整)
3.4 UI 使用
访问 http://jaeger-host:16686,核心功能:
Trace 查询:按服务名、操作名、时间范围、标签、Duration 筛选
Trace 详情:火焰图 / 甘特图视图,直观展示调用链和耗时
对比功能:选两个 Trace 做 Diff,定位性能退化
依赖图:自动绘制服务间调用拓扑
Monitor:基于 Trace 数据的 RED 指标监控
3.5 Jaeger vs Zipkin 对比
建议:新项目优先选 Jaeger,已有 Zipkin 基础设施的可继续使用。
四、SkyWalking
4.1 安装部署
Docker Compose 部署:
version: "3.9"
services:
oap:
image: apache/skywalking-oap-server:9.7.0
environment:
SW_STORAGE: elasticsearch
SW_STORAGE_ES_CLUSTER_NODES: elasticsearch:9200
SW_HEALTH_CHECKER: default
SW_TELEMETRY: prometheus
ports:
- "11800:11800" # gRPC
- "12800:12800" # REST
- "1234:1234" # Agent gRPC
depends_on:
- elasticsearch
ui:
image: apache/skywalking-ui:9.7.0
environment:
SW_OAP_ADDRESS: http://oap:12800
ports:
- "8080:8080"
depends_on:
- oap
elasticsearch:
image: docker.elastic.co/elasticsearch/elasticsearch:8.14.0
environment:
- discovery.type=single-node
- xpack.security.enabled=false
- "ES_JAVA_OPTS=-Xms1g -Xmx1g"
volumes:
- sw-es-data:/usr/share/elasticsearch/data
ports:
- "9200:9200"
volumes:
sw-es-data:4.2 Java Agent 配置
# 下载 Agent
wget https://dlcdn.apache.org/skywalking/java-agent/9.1.0/apache-skywalking-java-agent-9.1.0.tgz
tar -xzf apache-skywalking-java-agent-9.1.0.tgz
# 启动应用
java -javaagent:/path/to/skywalking-agent/skywalking-agent.jar \
-Dskywalking.agent.service_name=my-service \
-Dskywalking.collector.backend_service=oap-host:11800 \
-Dskywalking.agent.sample_n_per_3_secs=10 \
-Dskywalking.agent.ignore_suffix=".jpg,.css,.js,.png" \
-Dskywalking.agent.trace.ignore_path="/health,/metrics,/actuator/**" \
-jar my-app.jar配置文件 agent/config/agent.config(推荐方式):
# 服务名
agent.service_name=${SW_AGENT_NAME:my-service}
# 后端地址
collector.backend_service=${SW_AGENT_COLLECTOR_BACKEND_SERVICES:oap-host:11800}
# 采样:每 3 秒最多采样 N 个
agent.sample_n_per_3_secs=${SW_AGENT_SAMPLE_N_PER_3_SECS:10}
# 忽略的路径后缀
agent.ignore_suffix=${SW_IGNORE_SUFFIX:.jpg,.css,.js,.png,.ico}
# 快照:异常时记录调用栈深度
agent.snapshot.depth=${SW_AGENT_SNAPSHOT_DEPTH:5}
# 跨线程传播
agent.cross_thread_propagation=${SW_AGENT_CROSS_THREAD:true}4.3 告警规则
在 OAP 的 config/alarm-settings.yml 中配置:
rules:
# 服务响应时间告警
service_resp_time_rule:
metrics-name: service_resp_time
op: ">"
threshold: 2000
period: 10
count: 3
silence-period: 5
message: "服务 {name} 响应时间超过 2 秒(最近 10 分钟内触发 3 次)"
# 服务错误率告警
service_sla_rule:
metrics-name: service_sla
op: "<"
threshold: 8000 # 80%(单位为百分比×100)
period: 10
count: 2
silence-period: 5
message: "服务 {name} 可用性低于 80%"
# 端点响应时间告警
endpoint_resp_time_rule:
metrics-name: endpoint_avg
op: ">"
threshold: 3000
period: 10
count: 3
silence-period: 5
message: "端点 {name} 平均响应时间超过 3 秒"
# 数据库访问告警
database_access_resp_time_rule:
metrics-name: database_access_resp_time
op: ">"
threshold: 1000
period: 10
count: 3
silence-period: 5
message: "数据库 {name} 响应时间超过 1 秒"
# Webhook 回调(可对接企业微信、钉钉、飞书)
webhooks:
- url: http://alertmanager:9095/api/v1/alerts
receivers:
- platform-team@company.com4.4 自定义埋点
import org.apache.skywalking.apm.toolkit.trace.Tag;
import org.apache.skywalking.apm.toolkit.trace.Trace;
import org.apache.skywalking.apm.toolkit.trace.ActiveSpan;
public class OrderService {
@Trace(operationName = "createOrder")
@Tag(key = "orderId", value = "arg[0]")
public Order createOrder(String orderId, OrderRequest request) {
// 添加自定义 Tag
ActiveSpan.tag("userId", request.getUserId());
ActiveSpan.tag("amount", String.valueOf(request.getAmount()));
// 记录事件日志
ActiveSpan.info("开始创建订单");
try {
Order order = processOrder(orderId, request);
ActiveSpan.tag("orderStatus", order.getStatus().name());
return order;
} catch (Exception e) {
ActiveSpan.error(e); // 记录异常
ActiveSpan.tag("error", e.getMessage());
throw e;
}
}
// 跨线程追踪
public void asyncProcess(Order order) {
Runnable task = RunnableWrapper.of(() -> {
// 此处自动携带 Trace 上下文
ActiveSpan.tag("async", "true");
doAsyncWork(order);
});
executor.submit(task);
}
}五、应用性能指标
5.1 RED 方法
RED 方法用于面向请求的服务监控:
Prometheus 查询示例:
# 请求速率(QPS)
rate(http_server_requests_seconds_count{service="order-service"}[5m])
# 错误率
sum(rate(http_server_requests_seconds_count{status=~"5..", service="order-service"}[5m]))
/
sum(rate(http_server_requests_seconds_count{service="order-service"}[5m]))
# P99 延迟
histogram_quantile(0.99, rate(http_server_requests_seconds_bucket{service="order-service"}[5m]))5.2 USE 方法
USE 方法用于基础设施资源监控:
# CPU 利用率
mpstat -P ALL 1
# 内存饱和度(swap 使用)
vmstat 1
# 磁盘利用率
iostat -xz 1
# 网络错误
netstat -i5.3 四个黄金信号(Google SRE)
实际运用建议:
服务层面用 RED 方法
基础设施层面用 USE 方法
整体架构层面关注四个黄金信号
六、Java 应用集成
6.1 Spring Boot + OpenTelemetry
方式一:Java Agent(推荐,零代码修改)
# docker-compose.yml
services:
app:
image: my-spring-boot-app:latest
environment:
JAVA_TOOL_OPTIONS: "-javaagent:/app/otel-agent.jar"
OTEL_SERVICE_NAME: order-service
OTEL_EXPORTER_OTLP_ENDPOINT: http://otel-collector:4317
OTEL_TRACES_SAMPLER: parentbased_traceidratio
OTEL_TRACES_SAMPLER_ARG: "0.1"
OTEL_LOGS_EXPORTER: none
OTEL_METRICS_EXPORTER: prometheus
volumes:
- ./opentelemetry-javaagent.jar:/app/otel-agent.jar方式二:Spring Boot Starter(细粒度控制)
<!-- pom.xml -->
<dependencyManagement>
<dependencies>
<dependency>
<groupId>io.opentelemetry</groupId>
<artifactId>opentelemetry-bom</artifactId>
<version>1.40.0</version>
<type>pom</type>
<scope>import</scope>
</dependency>
</dependencies>
</dependencyManagement>
<dependencies>
<dependency>
<groupId>io.opentelemetry.instrumentation</groupId>
<artifactId>opentelemetry-spring-boot-starter</artifactId>
<version>2.6.0-alpha</version>
</dependency>
</dependencies># application.yml
otel:
service:
name: order-service
exporter:
otlp:
endpoint: http://otel-collector:4317
traces:
sampler: parentbased_traceidratio
sampler-arg: 0.1
instrumentations:
spring-webmvc:
enabled: true
jdbc:
enabled: true
logback-appender:
enabled: true6.2 SkyWalking Agent + Spring Boot
# 方式一:命令行参数
java -javaagent:/opt/skywalking/agent/skywalking-agent.jar \
-Dskywalking.agent.service_name=order-service \
-Dskywalking.collector.backend_service=skywalking-oap:11800 \
-jar order-service.jar
# 方式二:环境变量(Docker/K8s)
ENV SW_AGENT_NAME=order-service
ENV SW_AGENT_COLLECTOR_BACKEND_SERVICES=skywalking-oap:11800
ENV SW_AGENT_SAMPLE_N_PER_3_SECS=10
ENV SW_LOGGING_LEVEL=INFOKubernetes 部署(使用 Init Container):
apiVersion: apps/v1
kind: Deployment
metadata:
name: order-service
spec:
template:
spec:
initContainers:
- name: skywalking-agent
image: apache/skywalking-java-agent:9.1.0
command: ['sh', '-c', 'cp -r /agent /skywalking-agent']
volumeMounts:
- name: sw-agent
mountPath: /skywalking-agent
containers:
- name: app
image: order-service:latest
env:
- name: JAVA_TOOL_OPTIONS
value: "-javaagent:/skywalking-agent/skywalking-agent.jar"
- name: SW_AGENT_NAME
value: "order-service"
- name: SW_AGENT_COLLECTOR_BACKEND_SERVICES
value: "skywalking-oap:11800"
volumeMounts:
- name: sw-agent
mountPath: /skywalking-agent
volumes:
- name: sw-agent
emptyDir: {}七、Node.js 应用集成
7.1 OpenTelemetry SDK 初始化
npm install @opentelemetry/sdk-node \
@opentelemetry/auto-instrumentations-node \
@opentelemetry/exporter-trace-otlp-grpc \
@opentelemetry/exporter-metrics-otlp-grpc \
@opentelemetry/resources \
@opentelemetry/semantic-conventions创建 tracing.ts(必须在应用代码之前加载):
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-grpc';
import { OTLPMetricExporter } from '@opentelemetry/exporter-metrics-otlp-grpc';
import { PeriodicExportingMetricReader } from '@opentelemetry/sdk-metrics';
import { Resource } from '@opentelemetry/resources';
import {
ATTR_SERVICE_NAME,
ATTR_SERVICE_VERSION,
} from '@opentelemetry/semantic-conventions';
const sdk = new NodeSDK({
resource: new Resource({
[ATTR_SERVICE_NAME]: process.env.OTEL_SERVICE_NAME || 'node-service',
[ATTR_SERVICE_VERSION]: '1.0.0',
'deployment.environment': process.env.NODE_ENV || 'development',
}),
traceExporter: new OTLPTraceExporter({
url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || 'http://localhost:4317',
}),
metricReader: new PeriodicExportingMetricReader({
exporter: new OTLPMetricExporter({
url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || 'http://localhost:4317',
}),
exportIntervalMillis: 15000,
}),
instrumentations: [
getNodeAutoInstrumentations({
// HTTP 自动埋点配置
'@opentelemetry/instrumentation-http': {
ignoreIncomingPaths: ['/health', '/metrics', '/favicon.ico'],
},
// Express 自动埋点
'@opentelemetry/instrumentation-express': {
enabled: true,
},
// Redis 自动埋点
'@opentelemetry/instrumentation-redis': {
enabled: true,
},
// MySQL 自动埋点
'@opentelemetry/instrumentation-mysql2': {
enabled: true,
},
}),
],
});
sdk.start();
process.on('SIGTERM', () => sdk.shutdown());启动方式:
# 使用 --require 预加载
node --require ./tracing.js app.js
# 或者在入口文件最顶部
import './tracing';
import express from 'express';7.2 自定义 Span
import { trace, SpanStatusCode, context } from '@opentelemetry/api';
const tracer = trace.getTracer('order-service', '1.0.0');
// 方式一:手动创建 Span
async function createOrder(orderData: OrderRequest): Promise<Order> {
return tracer.startActiveSpan('createOrder', async (span) => {
try {
span.setAttribute('order.userId', orderData.userId);
span.setAttribute('order.amount', orderData.amount);
// 添加事件
span.addEvent('validate_order', { 'order.items': orderData.items.length });
const order = await processOrder(orderData);
span.setAttribute('order.id', order.id);
span.setStatus({ code: SpanStatusCode.OK });
return order;
} catch (error) {
span.setStatus({
code: SpanStatusCode.ERROR,
message: error.message,
});
span.recordException(error);
throw error;
} finally {
span.end();
}
});
}
// 方式二:使用装饰器(需配合库)
import { Trace } from '@opentelemetry/auto-instrumentations-node';
class OrderService {
@Trace('processPayment')
async processPayment(orderId: string, amount: number): Promise<PaymentResult> {
// 自动创建 Span,方法名作为操作名
const result = await this.paymentGateway.charge(orderId, amount);
return result;
}
}
// 方式三:跨异步操作传播上下文
async function asyncWork() {
const currentSpan = trace.getActiveSpan();
currentSpan?.setAttribute('async', true);
await Promise.all([
tracer.startActiveSpan('task-1', async (span) => {
await doTask1();
span.end();
}),
tracer.startActiveSpan('task-2', async (span) => {
await doTask2();
span.end();
}),
]);
}7.3 支持的自动埋点框架
Node.js OpenTelemetry 自动埋点支持:
八、追踪数据分析
8.1 延迟分析
火焰图/甘特图分析法:
在 Jaeger UI 中查看 Trace 详情,关注:
最长 Span:找出耗时最大的操作
串行调用:多个可并行的调用被串行执行
重复调用:同一服务被多次调用(N+1 问题)
等待时间:Span 之间的空白(可能是排队或 GC)
# 使用 jaeger-query API 查询慢请求
curl -s "http://jaeger:16686/api/traces?service=order-service&minDuration=2s&limit=20" \
| jq '.data[] | {traceID, spans: [.spans[] | {operationName, duration}]}' P99 延迟分析 PromQL:
# 按服务维度查看 P99 延迟
histogram_quantile(0.99,
sum by (le, service) (
rate(traces_spanmetrics_latency_bucket[5m])
)
)
# 延迟 Top 10 操作
topk(10,
histogram_quantile(0.99,
sum by (le, operation) (
rate(traces_spanmetrics_latency_bucket[5m])
)
)
)8.2 错误定位
# 查询包含错误的 Trace
curl -s "http://jaeger:16686/api/traces?service=order-service&tags=%7B%22error%22%3A%22true%22%7D&limit=20"错误分析检查清单:
查看 Span 的
status字段,确认 ERROR检查
events中是否有异常堆栈关注
attributes中的 HTTP 状态码(4xx/5xx)对比正常和异常 Trace 的差异
8.3 依赖拓扑
Jaeger 和 SkyWalking 都支持自动生成服务依赖拓扑图:
# Jaeger 依赖图 API
curl -s "http://jaeger:16686/api/dependencies?endTs=$(date +%s)000&lookback=86400000"
# SkyWalking 拓扑图 API
curl -s "http://skywalking:12800/api/topology/service?duration.start=2024-01-01&duration.end=2024-01-02&duration.step=MINUTE"分析要点:
扇出比(fanout):一个服务依赖多少下游服务
调用频率:哪些依赖调用量最大
错误传播:错误从哪个节点开始扩散
8.4 性能瓶颈识别
系统化排查流程:
1. 找到慢 Trace → 查看火焰图
2. 识别最慢的 Span → 确定是 CPU/IO/网络/锁
3. 查看该 Span 的 attributes:
- db.statement → SQL 慢查询
- http.url → 外部 API 慢
- net.peer.name → 网络延迟
4. 关联基础设施指标:
- CPU 高 → 计算密集,优化算法
- IO 高 → 数据库/磁盘瓶颈
- 网络延迟 → 跨区/跨机房调用
5. 对比历史数据 → 确认是新问题还是老问题九、监控告警集成
9.1 Prometheus 指标关联
OpenTelemetry Collector 可将 Trace 数据转换为 Prometheus 指标:
# otel-collector-config.yaml
processors:
spanmetrics:
metrics_exporter: prometheus
latency_histogram_buckets: [5ms, 10ms, 25ms, 50ms, 100ms, 250ms, 500ms, 1s, 2.5s, 5s, 10s]
dimensions:
- name: http.method
- name: http.status_code
- name: db.system
exporters:
prometheus:
endpoint: 0.0.0.0:8889
namespace: otel
const_labels:
environment: production
service:
pipelines:
traces:
receivers: [otlp]
processors: [spanmetrics, batch]
exporters: [jaeger]
metrics:
receivers: [spanmetrics]
exporters: [prometheus]生成的关键指标:
# 请求总耗时直方图
otel_traces_spanmetrics_latency_bucket
# 请求总数
otel_traces_spanmetrics_calls_total
# 错误数
otel_traces_spanmetrics_calls_total{status_code="STATUS_CODE_ERROR"}9.2 Grafana Dashboard
RED 方法 Dashboard JSON(导入 Grafana):
{
"dashboard": {
"title": "Service RED Metrics (from Traces)",
"panels": [
{
"title": "Request Rate (QPS)",
"type": "timeseries",
"targets": [
{
"expr": "sum(rate(otel_traces_spanmetrics_calls_total{service_name=\"$service\"}[5m])) by (operation)",
"legendFormat": "{{operation}}"
}
]
},
{
"title": "Error Rate",
"type": "timeseries",
"targets": [
{
"expr": "sum(rate(otel_traces_spanmetrics_calls_total{service_name=\"$service\", status_code=\"STATUS_CODE_ERROR\"}[5m])) / sum(rate(otel_traces_spanmetrics_calls_total{service_name=\"$service\"}[5m]))",
"legendFormat": "Error %"
}
]
},
{
"title": "P99 Latency",
"type": "timeseries",
"targets": [
{
"expr": "histogram_quantile(0.99, sum(rate(otel_traces_spanmetrics_latency_bucket{service_name=\"$service\"}[5m])) by (le, operation))",
"legendFormat": "{{operation}}"
}
]
}
]
}
}9.3 告警规则
Prometheus AlertManager 规则:
# alert-rules.yml
groups:
- name: tracing-alerts
rules:
# P99 延迟超过阈值
- alert: HighP99Latency
expr: |
histogram_quantile(0.99,
sum(rate(otel_traces_spanmetrics_latency_bucket[5m])) by (le, service_name)
) > 3
for: 5m
labels:
severity: warning
annotations:
summary: "{{ $labels.service_name }} P99 延迟超过 3 秒"
description: "当前 P99 延迟: {{ $value }}s"
# 错误率超过阈值
- alert: HighErrorRate
expr: |
sum(rate(otel_traces_spanmetrics_calls_total{status_code="STATUS_CODE_ERROR"}[5m])) by (service_name)
/
sum(rate(otel_traces_spanmetrics_calls_total[5m])) by (service_name)
> 0.05
for: 3m
labels:
severity: critical
annotations:
summary: "{{ $labels.service_name }} 错误率超过 5%"
description: "当前错误率: {{ $value | humanizePercentage }}"
# 请求量突增
- alert: TrafficSpike
expr: |
sum(rate(otel_traces_spanmetrics_calls_total[5m])) by (service_name)
> 3 * sum(rate(otel_traces_spanmetrics_calls_total[1h] offset 1d)) by (service_name)
for: 5m
labels:
severity: warning
annotations:
summary: "{{ $labels.service_name }} 请求量异常增长"十、最佳实践
10.1 采样策略
采样是生产环境必须考虑的问题——不采样会导致存储和网络成本爆炸。
OpenTelemetry 尾部采样配置(Collector 端):
processors:
tail_sampling:
decision_wait: 10s
num_traces: 100000
expected_new_traces_per_sec: 1000
policies:
# 始终保留错误 Trace
- name: errors
type: status_code
status_code:
status_codes: [ERROR]
# 保留慢请求
- name: slow-requests
type: latency
latency:
threshold_ms: 3000
# 按服务差异化采样
- name: payment-service
type: string_attribute
string_attribute:
key: service.name
values: [payment-service]
# 兜底策略
- name: probabilistic
type: probabilistic
probabilistic:
sampling_percentage: 510.2 上下文传播
跨服务传播 Trace 上下文的关键:
# HTTP Header(W3C Trace Context 标准)
traceparent: 00-<trace-id>-<span-id>-<trace-flags>
tracestate: vendor1=value1,vendor2=value2
# gRPC Metadata
grpc-trace-bin: <二进制格式>
# 消息队列
# Kafka: Header "traceparent"
# RabbitMQ: Message properties注意事项:
确保所有服务使用相同的传播格式(推荐 W3C TraceContext)
异步操作(消息队列、定时任务)需要手动传播上下文
跨语言调用统一使用 OTLP 协议
10.3 性能开销控制
# Java Agent 性能优化参数
otel.javaagent.exclude-classes=com.example.NoisyClass
otel.instrumentation.http.client.emit-spans=false # 禁用底层 HTTP Span
otel.instrumentation.common.db-statement-sanitizer.enabled=true # SQL 脱敏
# 批处理优化
otel.bsp.schedule.delay=5000 # 批量发送间隔(ms)
otel.bsp.max.queue.size=2048 # 队列大小
otel.bsp.max.export.batch.size=512 # 每批大小基准测试参考:
Java Agent:增加约 1-3% CPU 开销,10-30MB 内存
Node.js SDK:增加约 2-5% CPU 开销
Collector:单实例可处理 10,000+ spans/s
10.4 生产部署建议
架构建议:
应用集群 → Agent(每主机) → Kafka(缓冲) → Collector 集群 → ES/Cassandra 集群 → Jaeger Query 集群核心要点:
Collector 做缓冲:在应用和存储之间加 Kafka,解耦并平滑流量
ES 索引管理:设置 ILM(Index Lifecycle Management),自动清理过期数据
独立集群:追踪系统不要和业务系统共用基础设施
监控追踪系统自身:用独立的 Prometheus 监控 Collector 和存储的健康状态
安全考虑:
Collector 使用 TLS 加密传输
Span 中不要记录敏感信息(密码、Token、身份证号)
配置
resource.attributes的脱敏规则
容量规划:
每 1000 QPS 约产生 5,000-20,000 spans/s(取决于服务深度)
每个 Span 约 1-2 KB,存储 7 天约需 100-500 GB
# ES 索引生命周期管理
curl -X PUT "http://elasticsearch:9200/_ilm/policy/jaeger-retention" -H 'Content-Type: application/json' -d '{
"policy": {
"phases": {
"hot": {
"actions": {
"rollover": {
"max_size": "50gb",
"max_age": "1d"
}
}
},
"warm": {
"min_age": "3d",
"actions": {
"shrink": { "number_of_shards": 1 },
"forcemerge": { "max_num_segments": 1 }
}
},
"delete": {
"min_age": "7d",
"actions": { "delete": {} }
}
}
}
}'附录:快速选型指南
参考资源
OpenTelemetry 官方文档:https://opentelemetry.io/docs/
Jaeger 官方文档:https://www.jaegertracing.io/docs/
SkyWalking 官方文档:https://skywalking.apache.org/docs/
Google SRE Book - Monitoring: https://sre.google/sre-book/practical-alerting/
W3C Trace Context 规范: https://www.w3.org/TR/trace-context/