本文是「服务器运维笔记」系列第 12 篇,聚焦 Java 应用从部署到运维的全链路实战经验。涵盖 JDK 管理、JVM 调优、GC 分析、Spring Boot 部署、Arthas 诊断、性能分析、监控告警、日志管理和常见故障排查。
一、JDK 安装与管理
1.1 主流 JDK 发行版选择
1.2 安装 Temurin JDK
Linux(手动安装):
# 下载 JDK 21 LTS
wget https://github.com/adoptium/temurin21-binaries/releases/download/jdk-21.0.3%2B7/OpenJDK21U-jdk_x64_linux_hotspot_21.0.3_7.tar.gz
# 解压到 /usr/local/java
sudo mkdir -p /usr/local/java
sudo tar -xzf OpenJDK21U-jdk_x64_linux_hotspot_21.0.3_7.tar.gz -C /usr/local/java
# 配置环境变量
cat >> /etc/profile.d/java.sh << 'EOF'
export JAVA_HOME=/usr/local/java/jdk-21.0.3+7
export PATH=$JAVA_HOME/bin:$PATH
export CLASSPATH=.:$JAVA_HOME/lib
EOF
source /etc/profile.d/java.sh
java -version使用 SDKMAN 管理多版本:
# 安装 SDKMAN
curl -s "https://get.sdkman.io" | bash
source "$HOME/.sdkman/bin/sdkman-init.sh"
# 列出可用 JDK
sdk list java
# 安装指定版本
sdk install java 21.0.3-tem
sdk install java 17.0.11-tem
# 切换版本
sdk use java 17.0.11-tem
sdk default java 21.0.3-tem
# 验证
java -version1.3 多版本共存(update-alternatives)
# 注册多个 JDK
sudo update-alternatives --install /usr/bin/java java /usr/local/java/jdk-17/bin/java 1
sudo update-alternatives --install /usr/bin/java java /usr/local/java/jdk-21/bin/java 2
sudo update-alternatives --install /usr/bin/javac javac /usr/local/java/jdk-17/bin/javac 1
sudo update-alternatives --install /usr/bin/javac javac /usr/local/java/jdk-21/bin/javac 2
# 交互式选择
sudo update-alternatives --config java二、JVM 内存模型
2.1 内存区域全景
┌─────────────────────────────────────────────────┐
│ JVM 内存结构 │
├──────────────┬──────────────┬───────────────────┤
│ 线程私有 │ │ 线程共享 │
├──────────────┤ ├───────────────────┤
│ 程序计数器 │ │ 堆(Heap) │
│ (PC Register)│ │ ┌─────────────┐ │
├──────────────┤ │ │ 新生代 │ │
│ 虚拟机栈 │ │ │ Eden│S0 │ S1 │ │
│ (VM Stack) │ │ └─────────────┘ │
├──────────────┤ │ │ 老年代 │ │
│ 本地方法栈 │ │ └─────────────┘ │
│ (Native) │ ├───────────────────┤
│ │ │ 方法区/元空间 │
│ │ │ (Metaspace) │
└──────────────┴──────────────┴───────────────────┘2.2 各区域详解
堆(Heap) — 对象实例和数组的主要存储区:
# 查看默认堆大小
java -XX:+PrintFlagsFinal -version 2>&1 | grep -E "HeapSize|InitialHeap"
# InitialHeapSize 约为物理内存的 1/64
# MaxHeapSize 约为物理内存的 1/4元空间(Metaspace) — 替代永久代,存储类元数据:
# JDK 8+ 使用 Metaspace,不受堆限制,但受本机内存限制
-XX:MetaspaceSize=256m
-XX:MaxMetaspaceSize=512m线程栈 — 方法调用和局部变量:
# 默认栈大小(Linux x64)
-XX:ThreadStackSize=1m # 64位系统默认 1024KB
# 如果线程数很多(如 1000+),可适当缩小
-Xss512k2.3 对象生命周期
// 1. 对象在 Eden 区分配
Object obj = new Object();
// 2. Minor GC 后存活对象进入 Survivor 区
// 3. 经过多次 Minor GC(默认 15 次)晋升老年代
// 4. Major GC / Full GC 回收老年代
// 大对象直接进入老年代(避免在 Eden 和 Survivor 之间来回复制)
-XX:PretenureSizeThreshold=4m // 大于 4MB 的对象直接分配到老年代三、JVM 调优
3.1 核心参数速查
# === 内存参数 ===
-Xms4g # 初始堆大小(建议与 -Xmx 相同,避免动态扩缩)
-Xmx4g # 最大堆大小
-Xmn1g # 新生代大小(通常为堆的 1/3 ~ 1/2)
-XX:MetaspaceSize=256m # 元空间初始大小
-XX:MaxMetaspaceSize=512m # 元空间最大值
-Xss512k # 线程栈大小
# === GC 参数 ===
-XX:+UseG1GC # 使用 G1 收集器
-XX:MaxGCPauseMillis=200 # G1 目标停顿时间(毫秒)
-XX:G1HeapRegionSize=16m # G1 Region 大小(1~32MB,2的幂)
# === GC 日志 ===
-Xlog:gc*:file=/var/log/app/gc.log:time,uptime,level,tags:filecount=10,filesize=50m
# === OOM 处理 ===
-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/var/log/app/heapdump.hprof
-XX:OnOutOfMemoryError="kill -9 %p"3.2 G1 收集器调优
G1 是 JDK 9+ 的默认收集器,适合大堆(6GB+)场景:
java \
-Xms8g -Xmx8g \
-XX:+UseG1GC \
-XX:MaxGCPauseMillis=200 \
-XX:G1HeapRegionSize=16m \
-XX:InitiatingHeapOccupancyPercent=45 \
-XX:G1ReservePercent=15 \
-XX:ConcGCThreads=4 \
-XX:ParallelGCThreads=8 \
-XX:G1NewSizePercent=30 \
-XX:G1MaxNewSizePercent=50 \
-jar app.jar关键参数说明:
3.3 ZGC — 超低延迟收集器
JDK 15+ 生产就绪,停顿时间 < 1ms:
java \
-Xms16g -Xmx16g \
-XX:+UseZGC \
-XX:+ZGenerational # JDK 21+ 分代 ZGC
-XX:SoftMaxHeapSize=12g # 软性堆上限,ZGC 会尽量保持在此之下
-XX:ConcGCThreads=4 # 并发 GC 线程数
-jar app.jarZGC vs G1 选型:
选 ZGC:堆 > 8GB、延迟敏感(P99 < 10ms)、可接受略低吞吐量
选 G1:堆 4~16GB、延迟要求不极端、追求吞吐量
选 Serial/Parallel:小堆(< 2GB)、批处理任务
3.4 调优实战流程
1. 设定目标 → P99 延迟 < 200ms?吞吐量 > 95%?
↓
2. 压测基准 → 用 JMeter/k6 跑出当前基线
↓
3. 监控 GC → 开启 GC 日志,用 GCEasy 分析
↓
4. 识别瓶颈 → Full GC 频繁?停顿过长?内存泄漏?
↓
5. 调整参数 → 每次只改一个参数,对比效果
↓
6. 验证回归 → 压测 + 线上灰度四、GC 日志分析
4.1 GC 日志配置
JDK 9+ 统一日志框架:
# 基础配置
-Xlog:gc*:file=/var/log/app/gc.log:time,uptime,level,tags
# 详细配置(带轮转)
-Xlog:gc*=debug:file=/var/log/app/gc.log:time,uptime,level,tags:filecount=10,filesize=50m
# 输出到 stdout(容器环境)
-Xlog:gc*:output=stdout:time,uptime,level,tagsJDK 8 配置:
-XX:+PrintGCDetails
-XX:+PrintGCDateStamps
-XX:+PrintGCTimeStamps
-XX:+PrintGCApplicationStoppedTime
-Xloggc:/var/log/app/gc.log
-XX:+UseGCLogFileRotation
-XX:NumberOfGCLogFiles=10
-XX:GCLogFileSize=50m4.2 GCEasy 在线分析
# 上传 GC 日志到 GCEasy(https://gceasy.io)
curl -X POST \
-F "file=@/var/log/app/gc.log" \
"https://api.gceasy.io/analyzeGC?apiKey=YOUR_API_KEY"
# 本地预览 GC 日志关键指标
grep -E "Full GC|Allocation Failure|concurrent mode failure" /var/log/app/gc.log | tail -20
# 统计 GC 次数和耗时
awk '/GC\(/ {count++; sum+=$NF} END {print "GC次数:", count, "总耗时:", sum "ms"}' /var/log/app/gc.log4.3 关键指标解读
五、Spring Boot 应用部署
5.1 构建可执行 JAR
<!-- pom.xml -->
<plugin>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-maven-plugin</artifactId>
<configuration>
<executable>true</executable>
<jvmArguments>
-Xms512m -Xmx512m
-XX:+UseG1GC
-XX:+HeapDumpOnOutOfMemoryError
</jvmArguments>
</configuration>
</plugin>mvn clean package -DskipTests
java -jar target/app.jar --spring.profiles.active=prod5.2 systemd 服务化部署
# /etc/systemd/system/myapp.service
[Unit]
Description=My Spring Boot Application
After=network.target mysql.service
Wants=mysql.service
[Service]
User=appuser
Group=appuser
Type=simple
WorkingDirectory=/opt/myapp
ExecStart=/usr/bin/java \
-Xms2g -Xmx2g \
-XX:+UseG1GC \
-XX:MaxGCPauseMillis=200 \
-XX:+HeapDumpOnOutOfMemoryError \
-XX:HeapDumpPath=/var/log/myapp/heapdump.hprof \
-Xlog:gc*:file=/var/log/myapp/gc.log:time,uptime,level,tags:filecount=5,filesize=50m \
-Dspring.profiles.active=prod \
-Dserver.port=8080 \
-jar /opt/myapp/app.jar
ExecStop=/bin/kill -TERM $MAINPID
Restart=on-failure
RestartSec=10
SuccessExitStatus=143
# 安全加固
NoNewPrivileges=true
ProtectSystem=strict
ReadWritePaths=/var/log/myapp /opt/myapp/data
PrivateTmp=true
# 资源限制
LimitNOFILE=65535
LimitNPROC=4096
MemoryMax=3g
CPUQuota=200%
# 环境变量文件
EnvironmentFile=/opt/myapp/app.env
[Install]
WantedBy=multi-user.target# 部署流程
sudo systemctl daemon-reload
sudo systemctl enable myapp
sudo systemctl start myapp
sudo systemctl status myapp
journalctl -u myapp -f # 查看实时日志5.3 Docker 部署
# Multi-stage build
FROM eclipse-temurin:21-jdk-alpine AS builder
WORKDIR /app
COPY target/app.jar app.jar
RUN java -Djarmode=layertools -jar app.jar extract
FROM eclipse-temurin:21-jre-alpine
RUN addgroup -g 1001 appgroup && \
adduser -u 1001 -G appgroup -s /bin/sh -D appuser
WORKDIR /app
COPY --from=builder /app/dependencies/ ./
COPY --from=builder /app/spring-boot-loader/ ./
COPY --from=builder /app/snapshot-dependencies/ ./
COPY --from=builder /app/application/ ./
# 健康检查
HEALTHCHECK --interval=30s --timeout=10s --retries=3 \
CMD wget -qO- http://localhost:8080/actuator/health || exit 1
USER appuser
EXPOSE 8080
ENTRYPOINT ["java", \
"-XX:+UseZGC", \
"-XX:+ZGenerational", \
"-XX:MaxRAMPercentage=75.0", \
"-XX:InitialRAMPercentage=75.0", \
"-XX:+HeapDumpOnOutOfMemoryError", \
"-XX:HeapDumpPath=/tmp/heapdump.hprof", \
"-Xlog:gc*:output=stdout:time,uptime,level,tags", \
"org.springframework.boot.loader.launch.JarLauncher"]# 构建与运行
docker build -t myapp:1.0.0 .
docker run -d \
--name myapp \
-p 8080:8080 \
-e SPRING_PROFILES_ACTIVE=prod \
-e JAVA_OPTS="-Xms1g -Xmx1g" \
-v /var/log/myapp:/tmp/logs \
--memory=2g --cpus=2 \
--restart=unless-stopped \
myapp:1.0.05.4 Docker Compose 编排
# docker-compose.yml
version: '3.8'
services:
myapp:
build: .
image: myapp:1.0.0
ports:
- "8080:8080"
environment:
- SPRING_PROFILES_ACTIVE=prod
- SPRING_DATASOURCE_URL=jdbc:mysql://db:3306/myapp
- SPRING_DATASOURCE_PASSWORD_FILE=/run/secrets/db_password
deploy:
resources:
limits:
memory: 2G
cpus: '2.0'
reservations:
memory: 1G
cpus: '0.5'
healthcheck:
test: ["CMD", "wget", "-qO-", "http://localhost:8080/actuator/health"]
interval: 30s
timeout: 10s
retries: 3
start_period: 40s
restart: unless-stopped
secrets:
- db_password
logging:
driver: json-file
options:
max-size: "50m"
max-file: "5"
db:
image: mysql:8.0
volumes:
- db_data:/var/lib/mysql
secrets:
- db_password
secrets:
db_password:
file: ./secrets/db_password.txt
volumes:
db_data:六、Arthas 诊断工具
6.1 安装与启动
# 一键安装
curl -O https://arthas.aliyun.com/arthas-boot.jar
# 启动(自动检测 Java 进程)
java -jar arthas-boot.jar
# 指定 PID
java -jar arthas-boot.jar 12345
# Docker 容器内使用
docker exec -it myapp java -jar /opt/arthas/arthas-boot.jar 16.2 常用诊断命令
基础信息:
# 查看 JVM 信息
jvm
# 查看系统属性
sysprop
# 查看环境变量
sysenv
# 查看 Dashboard(CPU/内存/线程)
dashboard方法监控:
# 监控方法调用(每 5 秒刷新)
monitor -c 5 com.example.service.UserService getUserById
# 输出示例:
# timestamp class method total success fail avg-rt(ms)
# 2024-01-15 UserService getUserById 120 118 2 15.3
# 追踪方法调用链路(找出慢调用)
trace com.example.service.OrderService createOrder
# 查看方法调用栈
stack com.example.service.OrderService createOrder
# 观察方法入参和返回值
watch com.example.service.UserService getUserById '{params, returnObj}' -x 2线程诊断:
# 查看所有线程
thread
# 查看最忙的 N 个线程(CPU 占用)
thread -n 5
# 查看阻塞线程
thread -b
# 导出线程快照
thread > /tmp/thread_dump.txt
# 查看指定线程栈
thread 42类与反编译:
# 搜索类
sc com.example.service.*
# 查看类详细信息
sc -d com.example.service.UserService
# 反编译源码
jad com.example.service.UserService
# 搜索方法
sm com.example.service.UserService热更新(慎用,仅限紧急修复):
# 1. 反编译确认当前代码
jad --source-only com.example.service.UserService > /tmp/UserService.java
# 2. 修改代码后编译
mc -c 329a6288 /tmp/UserService.java -d /tmp/classes
# 3. 热替换
retransform /tmp/classes/com/example/service/UserService.class
# 4. 撤销热更新
retransform --deleteAllOgnl 表达式:
# 查看 Spring Bean
ognl '@org.springframework.context.ApplicationContext@getBean("userService")'
# 调用静态方法
ognl '@java.lang.Runtime@getRuntime().availableProcessors()'
# 查看集合内容
ognl '@org.springframework.context.ApplicationContext@getBean("cacheManager").getCacheNames().toArray()' -x 26.3 Arthas Tunnel — 远程诊断
# 启动 Tunnel Server
java -jar arthas-tunnel-server.jar --port 7777
# 目标机器启动 Agent
java -jar arthas-boot.jar --tunnel-server 'ws://tunnel-server:7777/ws' --agent-id myapp-01
# 浏览器访问
# http://tunnel-server:7777七、性能分析
7.1 JFR(Java Flight Recorder)
JDK 内置,开销极低(< 2%),生产环境首选:
# 启动时开启 JFR
java -XX:+FlightRecorder \
-XX:StartFlightRecording=name=myapp,duration=60s,filename=/tmp/recording.jfr \
-jar app.jar
# 运行时开启 JFR
jcmd <pid> JFR.start name=myapp duration=60s filename=/tmp/recording.jfr
jcmd <pid> JFR.check
jcmd <pid> JFR.stop name=myapp
# 持续记录(滚动窗口)
jcmd <pid> JFR.start name=continuous \
settings=profile \
dumponexit=true \
maxage=1h \
maxsize=100m \
filename=/tmp/continuous.jfr
# 分析 JFR 文件
# 使用 JDK Mission Control(JMC)打开 .jfr 文件进行可视化分析
jmc7.2 async-profiler — 无侵入采样
# 安装
wget https://github.com/async-profiler/async-profiler/releases/download/v3.0/async-profiler-3.0-linux-x64.tar.gz
tar -xzf async-profiler-3.0-linux-x64.tar.gz
# CPU 采样(30秒)
./profiler.sh -d 30 -f /tmp/cpu_profile.html <pid>
# 内存分配分析
./profiler.sh -d 30 -e alloc -f /tmp/alloc_profile.html <pid>
# 锁竞争分析
./profiler.sh -d 30 -e lock -f /tmp/lock_profile.html <pid>
# Wall-clock 分析(包含等待时间,适合 I/O 密集型)
./profiler.sh -d 30 -e wall -f /tmp/wall_profile.html <pid>
# 火焰图(同时看 CPU + Wall)
./profiler.sh -d 30 -e cpu,wall -f /tmp/flamegraph.html <pid>
# 输出为 Collapsed stack 格式(兼容 Brendan Gregg 工具链)
./profiler.sh -d 30 -o collapsed -f /tmp/stacks.txt <pid>7.3 死锁检测
# 方式 1:jstack
jstack -l <pid> | grep -A 20 "Found.*deadlock"
# 方式 2:jcmd
jcmd <pid> Thread.print -l | grep -A 20 "deadlock"
# 方式 3:Arthas
thread -b # 直接找到阻塞线程
# 方式 4:代码检测(预防性)// 在代码中主动检测死锁
ThreadMXBean threadMXBean = ManagementFactory.getThreadMXBean();
long[] deadlockedThreads = threadMXBean.findDeadlockedThreads();
if (deadlockedThreads != null) {
ThreadInfo[] threadInfos = threadMXBean.getThreadInfo(deadlockedThreads, true, true);
for (ThreadInfo info : threadInfos) {
log.error("Deadlock detected: {}", info);
}
}7.4 堆转储分析
# 手动触发堆转储
jmap -dump:live,format=b,file=/tmp/heap.hprof <pid>
jcmd <pid> GC.heap_dump /tmp/heap.hprof
# 分析工具
# Eclipse MAT(Memory Analyzer Tool) — 最强大
# https://www.eclipse.org/mat/
# 打开 .hprof 文件 → Leak Suspects Report → Dominator Tree
# 命令行快速分析
jhat /tmp/heap.hprof # 启动 Web 分析器(端口 7000)八、监控与告警
8.1 Spring Boot Actuator
# application.yml
management:
endpoints:
web:
exposure:
include: health,info,metrics,prometheus,env,logfile,threaddump,heapdump
base-path: /actuator
endpoint:
health:
show-details: always
show-components: always
prometheus:
enabled: true
metrics:
tags:
application: ${spring.application.name}
distribution:
percentiles-histogram:
http.server.requests: true
percentiles:
http.server.requests: 0.5, 0.95, 0.99
slo:
http.server.requests: 100ms, 200ms, 500ms<!-- pom.xml 依赖 -->
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-actuator</artifactId>
</dependency>
<dependency>
<groupId>io.micrometer</groupId>
<artifactId>micrometer-registry-prometheus</artifactId>
</dependency>8.2 自定义业务指标
@Component
public class BusinessMetrics {
private final Counter orderCounter;
private final Timer orderProcessingTimer;
private final Gauge activeConnections;
public BusinessMetrics(MeterRegistry registry) {
// 计数器:订单总数
this.orderCounter = Counter.builder("business.orders.total")
.description("Total number of orders")
.tag("type", "all")
.register(registry);
// 计时器:订单处理耗时
this.orderProcessingTimer = Timer.builder("business.order.processing.time")
.description("Order processing time")
.publishPercentiles(0.5, 0.95, 0.99)
.publishPercentileHistogram()
.register(registry);
// 仪表:活跃连接数
this.activeConnections = Gauge.builder("business.connections.active", connectionPool, ConnectionPool::getActiveCount)
.description("Number of active connections")
.register(registry);
}
public void recordOrder() {
orderCounter.increment();
}
public void recordProcessingTime(long durationMs) {
orderProcessingTimer.record(durationMs, TimeUnit.MILLISECONDS);
}
}8.3 Prometheus 配置
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'spring-boot-app'
metrics_path: '/actuator/prometheus'
static_configs:
- targets: ['app1:8080', 'app2:8080']
labels:
env: production
relabel_configs:
- source_labels: [__address__]
target_label: instance8.4 Grafana Dashboard
推荐使用社区 Dashboard:
JVM (Micrometer): Dashboard ID
4701Spring Boot Statistics: Dashboard ID
12900JVM (Actuator): Dashboard ID
9598
# 导入 Dashboard
curl -X POST http://grafana:3000/api/dashboards/import \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR_API_KEY" \
-d '{
"dashboard": {"id": 4701},
"overwrite": true,
"inputs": [{"name": "DS_PROMETHEUS", "type": "datasource", "pluginId": "prometheus", "value": "Prometheus"}]
}'8.5 告警规则
# alerting-rules.yml
groups:
- name: java_app_alerts
rules:
# 堆内存使用率 > 85%
- alert: HighHeapUsage
expr: jvm_memory_used_bytes{area="heap"} / jvm_memory_max_bytes{area="heap"} > 0.85
for: 5m
labels:
severity: warning
annotations:
summary: "JVM heap usage > 85% on {{ $labels.instance }}"
description: "Heap usage is {{ $value | humanizePercentage }}"
# Full GC 频率过高
- alert: FrequentFullGC
expr: increase(jvm_gc_pause_seconds_count{action=~".*major.*|.*Full.*"}[5m]) > 3
for: 2m
labels:
severity: critical
annotations:
summary: "Frequent Full GC on {{ $labels.instance }}"
# HTTP 请求 P99 延迟 > 1s
- alert: HighP99Latency
expr: histogram_quantile(0.99, rate(http_server_requests_seconds_bucket[5m])) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "P99 latency > 1s on {{ $labels.instance }}"
# 应用实例宕机
- alert: AppDown
expr: up{job="spring-boot-app"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "App instance {{ $labels.instance }} is down"九、日志管理
9.1 Logback 配置
<!-- logback-spring.xml -->
<configuration>
<springProperty scope="context" name="APP_NAME" source="spring.application.name"/>
<property name="LOG_PATH" value="/var/log/${APP_NAME}"/>
<!-- 控制台输出 -->
<appender name="CONSOLE" class="ch.qos.logback.core.ConsoleAppender">
<encoder>
<pattern>%d{yyyy-MM-dd HH:mm:ss.SSS} [%thread] %-5level %logger{36} - %msg%n</pattern>
</encoder>
</appender>
<!-- 文件输出(按日滚动) -->
<appender name="FILE" class="ch.qos.logback.core.rolling.RollingFileAppender">
<file>${LOG_PATH}/app.log</file>
<rollingPolicy class="ch.qos.logback.core.rolling.SizeAndTimeBasedRollingPolicy">
<fileNamePattern>${LOG_PATH}/app.%d{yyyy-MM-dd}.%i.log.gz</fileNamePattern>
<maxFileSize>100MB</maxFileSize>
<maxHistory>30</maxHistory>
<totalSizeCap>5GB</totalSizeCap>
</rollingPolicy>
<encoder>
<pattern>%d{yyyy-MM-dd HH:mm:ss.SSS} [%thread] %-5level %logger{36} - %msg%n</pattern>
</encoder>
</appender>
<!-- JSON 格式(用于 ELK) -->
<appender name="JSON_FILE" class="ch.qos.logback.core.rolling.RollingFileAppender">
<file>${LOG_PATH}/app.json</file>
<rollingPolicy class="ch.qos.logback.core.rolling.SizeAndTimeBasedRollingPolicy">
<fileNamePattern>${LOG_PATH}/app.json.%d{yyyy-MM-dd}.%i.gz</fileNamePattern>
<maxFileSize>100MB</maxFileSize>
<maxHistory>15</maxHistory>
</rollingPolicy>
<encoder class="net.logstash.logback.encoder.LoggingEventCompositeJsonEncoder">
<providers>
<timestamp/>
<logLevel/>
<loggerName/>
<threadName/>
<message/>
<mdc/>
<stackTrace/>
</providers>
</encoder>
</appender>
<!-- 异步写入,提升性能 -->
<appender name="ASYNC_FILE" class="ch.qos.logback.classic.AsyncAppender">
<queueSize>1024</queueSize>
<discardingThreshold>0</discardingThreshold>
<neverBlock>true</neverBlock>
<appender-ref ref="JSON_FILE"/>
</appender>
<root level="INFO">
<appender-ref ref="CONSOLE"/>
<appender-ref ref="FILE"/>
<appender-ref ref="ASYNC_FILE"/>
</root>
<!-- 包级别控制 -->
<logger name="com.example.mapper" level="DEBUG"/>
<logger name="org.springframework.web" level="INFO"/>
<logger name="com.zaxxer.hikari" level="WARN"/>
</configuration>9.2 ELK 日志平台
Filebeat 采集配置:
# filebeat.yml
filebeat.inputs:
- type: log
enabled: true
paths:
- /var/log/myapp/app.json
json.keys_under_root: true
json.overwrite_keys: true
fields:
app: myapp
env: production
fields_under_root: true
output.elasticsearch:
hosts: ["elasticsearch:9200"]
indices:
- index: "myapp-%{+yyyy.MM.dd}"
setup.template:
name: myapp
pattern: "myapp-*"
processors:
- add_host_metadata: ~
- add_cloud_metadata: ~Logstash 管道:
# logstash.conf
input {
beats {
port => 5044
}
}
filter {
if [app] == "myapp" {
# 解析异常堆栈
multiline {
pattern => "^\s+(at|\.{3})\s+"
what => "previous"
}
# 提取慢查询
if [message] =~ /slow query/ {
grok {
match => { "message" => "slow query.*?(\d+)ms" }
add_field => { "slow_query_ms" => "%{NUMBER}" }
}
mutate {
convert => { "slow_query_ms" => "integer" }
}
}
}
}
output {
elasticsearch {
hosts => ["elasticsearch:9200"]
index => "%{[@metadata][beat]}-%{+YYYY.MM.dd}"
}
}9.3 日志分析实战
# 实时查看 ERROR 日志
tail -f /var/log/myapp/app.log | grep "ERROR"
# 统计最近 1 小时的异常类型
grep "ERROR" /var/log/myapp/app.log | \
awk '{print $NF}' | sort | uniq -c | sort -rn | head -20
# 查找特定请求链路的日志
grep "traceId=abc123" /var/log/myapp/app.log
# 使用 GoAccess 分析访问日志
goaccess /var/log/nginx/access.log --log-format=COMBINED -o /var/www/html/report.html十、常见问题排查
10.1 OOM(OutOfMemoryError)
排查步骤:
# 1. 确认 OOM 类型
grep "OutOfMemoryError" /var/log/myapp/app.log
# Java heap space → 堆内存不足
# Metaspace → 类加载过多
# Direct buffer memory → 直接内存不足
# GC overhead limit exceeded → GC 耗时过长
# 2. 检查堆转储
# 如果配置了 HeapDumpOnOutOfMemoryError,直接分析 .hprof 文件
# 否则手动触发
jcmd <pid> GC.heap_dump /tmp/heap.hprof
# 3. 分析堆转储
# Eclipse MAT → Leak Suspects Report → Dominator Tree
# 查找占用内存最多的对象和引用链
# 4. 常见原因
# - 集合类无限增长(List/Map 未清理)
# - 缓存未设置上限
# - 数据库查询返回大量数据未分页
# - ThreadLocal 未清理(线程池场景)
# - 类加载泄漏(动态代理/反射)预防措施:
// 1. 使用软引用/弱引用缓存
Cache<String, Object> cache = Caffeine.newBuilder()
.maximumSize(10_000)
.expireAfterWrite(Duration.ofMinutes(30))
.build();
// 2. 分页查询
@Query("SELECT u FROM User u WHERE u.status = :status")
Page<User> findByStatus(@Param("status") String status, Pageable pageable);
// 3. 大文件流式处理
try (Stream<String> lines = Files.lines(Paths.get("large.txt"))) {
lines.filter(line -> line.contains("ERROR"))
.forEach(this::process);
}10.2 CPU 飙高
# 1. 找到 CPU 最高的 Java 进程
top -c -p $(pgrep -f java | tr '\n' ',' | sed 's/,$//')
# 2. 找到 CPU 最高的线程
top -Hp <pid> # 线程级 top
# 3. 将线程 ID 转为十六进制
printf "0x%x\n" <tid>
# 4. 在线程栈中查找
jstack <pid> | grep -A 30 "nid=0x<hex_tid>"
# 5. 快捷方式:用 Arthas
thread -n 5 # 直接看最忙的 5 个线程
# 6. 使用 async-profiler 定位热点方法
./profiler.sh -d 30 -f /tmp/cpu.html <pid>常见原因:
正则表达式灾难性回溯(ReDoS)
死循环(while 条件永远为 true)
频繁 Full GC(GC 线程 CPU 占用高)
序列化/反序列化大量 JSON
加密/解密操作
10.3 Full GC 频繁
# 1. 查看 GC 日志
tail -f /var/log/myapp/gc.log | grep "Full GC"
# 2. 分析 GC 日志(GCEasy 或 gceasy.io)
# 关注:
# - Full GC 后老年代使用率是否仍然很高(内存泄漏)
# - Full GC 前的堆使用趋势
# 3. 检查老年代对象
jmap -histo:live <pid> | head -30
# 4. 常见原因与解决方案
# a. 内存泄漏 → 分析 heap dump
# b. 老年代太小 → 增大 -Xmx 或调整新生代比例
# c. 大对象直接进老年代 → 降低 PretenureSizeThreshold
# d. Survivor 区太小 → 对象过早晋升
# e. 元空间不足 → 增大 MetaspaceSize
# 5. 紧急处理
# 触发 Full GC
jcmd <pid> GC.run
# 如果持续 Full GC 且无法恢复,重启应用(最后手段)
systemctl restart myapp10.4 线程池耗尽
# 1. 查看线程状态
jstack <pid> | grep "java.lang.Thread.State" | sort | uniq -c
# 2. 查看线程池状态(Arthas)
thread | grep -E "pool|Pool"
# 3. 查看线程池配置
# 代码中搜索 ThreadPoolExecutor 或 @Async// 预防线程池耗尽
@Configuration
public class ThreadPoolConfig {
@Bean("taskExecutor")
public ThreadPoolTaskExecutor taskExecutor() {
ThreadPoolTaskExecutor executor = new ThreadPoolTaskExecutor();
executor.setCorePoolSize(10);
executor.setMaxPoolSize(50);
executor.setQueueCapacity(200); // 合理的队列大小
executor.setKeepAliveSeconds(60);
executor.setThreadNamePrefix("task-");
executor.setRejectedExecutionHandler(
new ThreadPoolExecutor.CallerRunsPolicy() // 调用方执行,避免任务丢失
);
executor.setWaitForTasksToCompleteOnShutdown(true);
executor.setAwaitTerminationSeconds(30);
executor.initialize();
return executor;
}
}10.5 连接池泄漏
# HikariCP 连接池监控
# 检查 Actuator 指标
curl -s http://localhost:8080/actuator/metrics/hikaricp.connections.active
curl -s http://localhost:8080/actuator/metrics/hikaricp.connections.pending
curl -s http://localhost:8080/actuator/metrics/hikaricp.connections.timeout
# 日志中查找连接泄漏
grep "Connection is not available" /var/log/myapp/app.log# HikariCP 配置
spring:
datasource:
hikari:
maximum-pool-size: 20
minimum-idle: 5
idle-timeout: 300000
max-lifetime: 1800000
connection-timeout: 10000
leak-detection-threshold: 30000 # 30秒未归还则报警十一、运维命令速查表
11.1 JVM 诊断命令
# 查看 Java 进程
jps -lv
# 堆信息
jmap -heap <pid>
# 堆直方图(对象统计)
jmap -histo <pid> | head -30
jmap -histo:live <pid> | head -30 # 触发 GC 后统计
# 堆转储
jmap -dump:live,format=b,file=/tmp/heap.hprof <pid>
jcmd <pid> GC.heap_dump /tmp/heap.hprof
# 线程栈
jstack <pid>
jstack -l <pid> # 包含锁信息
jstack -F <pid> # 强制(进程无响应时)
# JVM 统计信息
jstat -gc <pid> 1000 10 # 每秒打印 GC 信息,共 10 次
jstat -gcutil <pid> 1000 # GC 使用率百分比
# 类加载统计
jstat -class <pid>
# 编译统计
jstat -compiler <pid>
# 综合诊断(推荐)
jcmd <pid> help # 列出所有可用命令
jcmd <pid> VM.version
jcmd <pid> VM.flags
jcmd <pid> VM.system_properties
jcmd <pid> Thread.print
jcmd <pid> GC.heap_info
jcmd <pid> GC.heap_dump /tmp/heap.hprof
jcmd <pid> JFR.start duration=60s filename=/tmp/recording.jfr11.2 系统级诊断
# 查看进程资源使用
pidstat -p <pid> 1 # CPU 使用
pidstat -r -p <pid> 1 # 内存使用
pidstat -d -p <pid> 1 # I/O 使用
# 查看进程打开的文件数
ls /proc/<pid>/fd | wc -l
lsof -p <pid> | wc -l
# 查看进程内存映射
pmap -x <pid> | tail -1
# 查看系统内存
free -h
cat /proc/meminfo
# 查看 CPU 信息
lscpu
nproc
# 网络连接
ss -tlnp | grep <port>
netstat -anp | grep <port>
# 查看进程限制
cat /proc/<pid>/limits
# strace 跟踪系统调用(排查 I/O 问题)
strace -f -e trace=network -p <pid>11.3 快速排查清单
# === 一键健康检查脚本 ===
#!/bin/bash
PID=$(pgrep -f "myapp" | head -1)
echo "=== PID: $PID ==="
echo "--- JVM Version ---"
jcmd $PID VM.version 2>/dev/null
echo "--- Uptime ---"
jcmd $PID VM.uptime 2>/dev/null
echo "--- Heap ---"
jcmd $PID GC.heap_info 2>/dev/null
echo "--- Thread Count ---"
jstack $PID 2>/dev/null | grep "java.lang.Thread.State" | wc -l
echo "--- File Descriptors ---"
ls /proc/$PID/fd 2>/dev/null | wc -l
echo "--- Top CPU Threads ---"
top -Hp $PID -b -n 1 | head -8
echo "--- GC Stats ---"
jstat -gcutil $PID 2>/dev/null十二、最佳实践总结
12.1 部署清单
□ JDK 版本确认(LTS 版本优先:17/21)
□ JVM 参数配置(堆大小、GC、OOM dump)
□ GC 日志开启
□ Actuator 端点配置(健康检查、Prometheus)
□ 日志级别和轮转配置
□ 健康检查端点(Kubernetes/Docker)
□ 文件描述符限制(ulimit -n 65535)
□ 时区设置(-Duser.timezone=Asia/Shanghai)
□ DNS 缓存(networkaddress.cache.ttl=60)
□ 安全加固(无 root 运行、只读文件系统)12.2 调优原则
1. 先测量,再优化 — 没有数据支撑的调优是盲目的
2. 一次只改一个参数 — 否则无法判断哪个参数有效
3. 优先解决架构问题 — JVM 调优无法解决设计缺陷
4. 留有余量 — 堆使用率长期 > 80% 就该扩容
5. 自动化一切 — GC 日志、堆转储、告警都要自动化
6. 文档化 — 记录每次调优的原因、参数和效果