本篇是服务器运维系列的第 15 篇,聚焦 Ansible 自动化运维。从基础架构到实战场景,覆盖日常运维中最常用的操作模式,所有示例均可直接复用。


一、Ansible 基础

1.1 架构原理

Ansible 是一个无代理(Agentless)的自动化工具,通过 SSH 连接目标主机执行任务。核心架构:

控制节点 (Control Node)
    │
    ├── Inventory(主机清单)
    ├── Playbook(剧本)
    ├── Modules(模块)
    └── Plugins(插件)
          │ SSH
          ▼
    目标主机 (Managed Nodes) ← 无需安装 Agent

核心组件说明:

组件

作用

控制节点

运行 Ansible 的机器,建议使用 Linux

Inventory

定义要管理的主机和分组

Playbook

YAML 格式的任务编排文件

Module

执行具体操作的最小单元(如 copy、yum)

Plugin

扩展 Ansible 功能(连接、回调、过滤器等)

Facts

自动收集的目标主机信息

1.2 安装配置

方式一:pip 安装(推荐)

# 安装 Python 3 和 pip
sudo apt update && sudo apt install -y python3 python3-pip  # Debian/Ubuntu
sudo yum install -y python3 python3-pip                     # CentOS/RHEL

# 安装 Ansible
pip3 install ansible

# 验证安装
ansible --version

方式二:系统包管理器

# Ubuntu/Debian
sudo apt-add-repository ppa:ansible/ansible
sudo apt update && sudo apt install -y ansible

# CentOS/RHEL
sudo yum install -y epel-release
sudo yum install -y ansible

Ansible 配置文件优先级(从高到低):

  1. ANSIBLE_CONFIG 环境变量指定的文件

  2. ./ansible.cfg(当前目录)

  3. ~/.ansible.cfg(用户家目录)

  4. /etc/ansible/ansible.cfg(全局配置)

常用配置项 ansible.cfg

[defaults]
inventory = ./inventory/hosts
remote_user = deploy
private_key_file = ~/.ssh/id_ed25519
host_key_checking = False
timeout = 30
forks = 20                    # 并行执行数
log_path = ./ansible.log
retry_files_enabled = False   # 禁用 retry 文件
stdout_callback = yaml        # 更易读的输出格式

[privilege_escalation]
become = True
become_method = sudo
become_user = root
become_ask_pass = False

[ssh_connection]
ssh_args = -o ControlMaster=auto -o ControlPersist=60s -o StrictHostKeyChecking=no
pipelining = True             # 提升性能
control_path_dir = ~/.ansible/cp

1.3 Inventory 管理

静态 Inventory 文件(INI 格式):

# inventory/hosts

# Web 服务器组
[web]
web01 ansible_host=192.168.1.101
web02 ansible_host=192.168.1.102
web03 ansible_host=192.168.1.103

# 数据库服务器组
[db]
db01 ansible_host=192.168.1.201 ansible_port=22022
db02 ansible_host=192.168.1.202 ansible_port=22022

# 缓存服务器
[cache]
redis01 ansible_host=192.168.1.211

# 生产环境 = web + db + cache
[production:children]
web
db
cache

# 全局变量
[all:vars]
ansible_user=deploy
ansible_ssh_private_key_file=~/.ssh/id_ed25519
ansible_python_interpreter=/usr/bin/python3

# 组变量
[web:vars]
http_port=80
nginx_version=1.24.0

[db:vars]
mysql_port=3306

YAML 格式 Inventory:

# inventory/hosts.yml
all:
  vars:
    ansible_user: deploy
    ansible_python_interpreter: /usr/bin/python3
  children:
    web:
      hosts:
        web01:
          ansible_host: 192.168.1.101
        web02:
          ansible_host: 192.168.1.102
      vars:
        http_port: 80
    db:
      hosts:
        db01:
          ansible_host: 192.168.1.201
          ansible_port: 22022
    production:
      children:
        web:
        db:

验证 Inventory:

# 列出所有主机
ansible-inventory --list -i inventory/hosts.yml

# 图形化展示
ansible-inventory --graph -i inventory/hosts.yml

# 测试连通性
ansible all -i inventory/hosts.yml -m ping

二、Ad-hoc 命令

Ad-hoc 命令适合快速执行一次性操作,无需编写 Playbook。

2.1 基本语法

ansible <主机模式> -m <模块名> -a '<模块参数>' [选项]

2.2 常用模块示例

# 测试所有主机连通性
ansible all -m ping

# 查看所有主机的系统信息(收集 Facts)
ansible all -m setup -a 'filter=ansible_distribution*'

# 执行命令
ansible web -m command -a 'uptime'
ansible web -m shell -a 'df -h | grep /dev/sda'

# 复制文件
ansible web -m copy -a 'src=./app.conf dest=/etc/app.conf mode=0644 backup=yes'

# 安装软件包
ansible web -m yum -a 'name=nginx state=present'
ansible web -m apt -a 'name=nginx state=present update_cache=yes'

# 管理服务
ansible web -m service -a 'name=nginx state=started enabled=yes'

# 创建用户
ansible all -m user -a 'name=deploy shell=/bin/bash groups=wheel append=yes'

# 文件操作
ansible web -m file -a 'path=/data/app state=directory mode=0755 owner=deploy'
ansible web -m file -a 'path=/tmp/test.log state=touch mode=0644'

# 下载文件
ansible web -m get_url -a 'url=https://example.com/app.tar.gz dest=/tmp/ mode=0644'

2.3 并行执行控制

# 同时在 10 台机器上执行
ansible all -m shell -a 'yum update -y' --forks 10

# 逐台执行(串行)
ansible web -m shell -a 'systemctl restart nginx' --forks 1

# 限制到特定主机
ansible web -m shell -a 'hostname' --limit web01,web02

# 从文件读取主机列表
ansible all -m shell -a 'uptime' --limit @host_list.txt

# 失败百分比阈值(超过 25% 失败则停止)
ansible all -m shell -a 'systemctl restart app' -p 25

2.4 实用 Ad-hoc 组合

# 批量查看磁盘使用率,超过 80% 的标记告警
ansible all -m shell -a 'df -h | awk "NR>1 && int(\$5)>80 {print \$0}"' -o

# 批量同步时间
ansible all -m shell -a 'chronyc makestep' --become

# 批量查找大文件
ansible all -m shell -a 'find /var/log -type f -size +100M -exec ls -lh {} \;' -o

# 批量清理 Docker 资源
ansible all -m shell -a 'docker system prune -f' --become

三、Playbook 编写

3.1 基本语法结构

---
# deploy-nginx.yml
- name: 安装并配置 Nginx
  hosts: web
  become: yes
  vars:
    nginx_port: 80
    server_name: example.com

  tasks:
    - name: 安装 Nginx
      yum:
        name: nginx
        state: present

    - name: 复制 Nginx 配置
      template:
        src: templates/nginx.conf.j2
        dest: /etc/nginx/nginx.conf
        owner: root
        group: root
        mode: '0644'
      notify: Restart Nginx

    - name: 启动 Nginx 并设置开机自启
      service:
        name: nginx
        state: started
        enabled: yes

  handlers:
    - name: Restart Nginx
      service:
        name: nginx
        state: restarted

执行命令:

ansible-playbook deploy-nginx.yml -i inventory/hosts.yml
ansible-playbook deploy-nginx.yml -i inventory/hosts.yml --check   # Dry Run
ansible-playbook deploy-nginx.yml -i inventory/hosts.yml --diff    # 显示变更差异

3.2 变量

变量定义的多种方式(优先级从高到低):

---
# 1. Play 级别 vars
- hosts: web
  vars:
    app_version: "3.2.1"
    app_port: 8080

  # 2. vars_prompt(交互式输入)
  vars_prompt:
    - name: db_password
      prompt: "请输入数据库密码"
      private: yes

  # 3. vars_files(外部变量文件)
  vars_files:
    - vars/common.yml
    - "vars/{{ ansible_distribution }}.yml"

  tasks:
    # 4. register(捕获任务输出作为变量)
    - name: 获取磁盘信息
      shell: df -h /
      register: disk_info

    - name: 打印磁盘信息
      debug:
        msg: "{{ disk_info.stdout_lines }}"

    # 5. set_fact(动态设置变量)
    - name: 计算内存阈值
      set_fact:
        memory_threshold_mb: "{{ (ansible_memtotal_mb * 0.8) | int }}"

    # 6. 使用变量
    - name: 输出版本信息
      debug:
        msg: "部署 {{ app_version }} 到端口 {{ app_port }}"

外部变量文件 vars/common.yml

---
app_name: myapp
app_user: appuser
app_group: appgroup
log_dir: /var/log/{{ app_name }}
data_dir: /data/{{ app_name }}

3.3 条件判断(when)

tasks:
  # 基本条件
  - name: 仅在 CentOS 上安装 EPEL
    yum:
      name: epel-release
      state: present
    when: ansible_distribution == "CentOS"

  # 多条件
  - name: 仅在 CentOS 7 或 RHEL 7 上执行
    shell: some_command
    when:
      - ansible_distribution in ["CentOS", "RedHat"]
      - ansible_distribution_major_version == "7"

  # 基于变量的条件
  - name: 仅在主节点上执行
    shell: init_master.sh
    when: is_master | default(false) | bool

  # 基于 register 的条件
  - name: 检查服务是否运行
    shell: systemctl is-active nginx
    register: nginx_status
    ignore_errors: yes

  - name: 如果 Nginx 未运行则启动
    service:
      name: nginx
      state: started
    when: nginx_status.rc != 0

  # 条件取反
  - name: 如果文件不存在则创建
    file:
      path: /etc/app/config.yml
      state: touch
    when: not config_file.stat.exists

3.4 循环

tasks:
  # 简单列表循环
  - name: 安装多个软件包
    yum:
      name: "{{ item }}"
      state: present
    loop:
      - nginx
      - redis
      - mysql-server
      - python3

  # 更推荐的写法(直接传列表给 name 参数)
  - name: 安装多个软件包(推荐写法)
    yum:
      name:
        - nginx
        - redis
        - mysql-server
      state: present

  # 字典循环
  - name: 创建多个用户
    user:
      name: "{{ item.name }}"
      groups: "{{ item.groups }}"
      shell: "{{ item.shell }}"
    loop:
      - { name: "dev01", groups: "developers", shell: "/bin/bash" }
      - { name: "ops01", groups: "operations", shell: "/bin/zsh" }
      - { name: "test01", groups: "testers", shell: "/bin/bash" }

  # 嵌套循环
  - name: 给多个用户授权多个目录
    file:
      path: "{{ item.1 }}"
      owner: "{{ item.0 }}"
      recurse: yes
    loop: "{{ ['user1', 'user2'] | product(['/data', '/logs']) | list }}"

  # 使用 loop_control
  - name: 部署多个虚拟主机
    template:
      src: vhost.conf.j2
      dest: "/etc/nginx/conf.d/{{ item.name }}.conf"
    loop: "{{ virtual_hosts }}"
    loop_control:
      label: "{{ item.name }}"     # 精简输出
      pause: 2                      # 每次循环暂停 2 秒

  # 使用 until 重试
  - name: 等待服务就绪
    uri:
      url: "http://localhost:{{ app_port }}/health"
      status_code: 200
    register: health_check
    until: health_check.status == 200
    retries: 30
    delay: 5

3.5 Handlers

---
- name: 配置管理
  hosts: web
  become: yes

  tasks:
    - name: 更新 Nginx 配置
      template:
        src: nginx.conf.j2
        dest: /etc/nginx/nginx.conf
      notify:
        - Validate Nginx Config
        - Reload Nginx

    - name: 更新应用配置
      template:
        src: app.conf.j2
        dest: /etc/app/config.yml
      notify: Restart App

  handlers:
    - name: Validate Nginx Config
      command: nginx -t
      listen: "Validate and Reload Nginx"

    - name: Reload Nginx
      service:
        name: nginx
        state: restarted
      listen: "Validate and Reload Nginx"

    - name: Restart App
      service:
        name: myapp
        state: restarted

    # 使用 flush_handlers 在中间触发
    # - meta: flush_handlers

3.6 Tags

---
- name: 系统初始化
  hosts: all
  become: yes

  tasks:
    - name: 设置时区
      timezone:
        name: Asia/Shanghai
      tags: [timezone, init]

    - name: 配置 NTP
      template:
        src: chrony.conf.j2
        dest: /etc/chrony.conf
      notify: Restart Chrony
      tags: [ntp, init]

    - name: 配置 SSH
      template:
        src: sshd_config.j2
        dest: /etc/ssh/sshd_config
      notify: Restart SSH
      tags: [ssh, security]

    - name: 配置防火墙
      firewalld:
        port: "{{ item }}/tcp"
        permanent: yes
        state: enabled
      loop: [22, 80, 443]
      tags: [firewall, security]

  handlers:
    - name: Restart Chrony
      service: { name: chronyd, state: restarted }
    - name: Restart SSH
      service: { name: sshd, state: restarted }
# 只执行特定 tag 的任务
ansible-playbook site.yml --tags "security"
ansible-playbook site.yml --tags "ntp,ssh"

# 排除特定 tag
ansible-playbook site.yml --skip-tags "firewall"

# 列出所有 tag
ansible-playbook site.yml --list-tags

四、Role 组织

4.1 目录结构

roles/
└── nginx/
    ├── defaults/
    │   └── main.yml          # 默认变量(最低优先级)
    ├── vars/
    │   └── main.yml          # 角色变量(高优先级)
    ├── tasks/
    │   └── main.yml          # 任务清单
    ├── handlers/
    │   └── main.yml          # Handler 定义
    ├── templates/
    │   └── nginx.conf.j2     # Jinja2 模板
    ├── files/
    │   └── ssl.crt           # 静态文件
    ├── meta/
    │   └── main.yml          # 角色元数据和依赖
    ├── tests/
    │   ├── inventory
    │   └── test.yml
    └── README.md

4.2 Role 示例:Nginx

roles/nginx/defaults/main.yml

---
nginx_worker_processes: "{{ ansible_processor_vcpus }}"
nginx_worker_connections: 1024
nginx_keepalive_timeout: 65
nginx_client_max_body_size: 50m
nginx_server_name: localhost
nginx_ssl_enabled: false
nginx_listen_port: 80

roles/nginx/tasks/main.yml

---
- name: 安装 Nginx
  package:
    name: nginx
    state: present

- name: 确保配置目录存在
  file:
    path: "{{ item }}"
    state: directory
    mode: '0755'
  loop:
    - /etc/nginx/conf.d
    - /etc/nginx/ssl

- name: 部署主配置文件
  template:
    src: nginx.conf.j2
    dest: /etc/nginx/nginx.conf
    owner: root
    group: root
    mode: '0644'
    validate: "nginx -t -c %s"
  notify: Reload Nginx

- name: 部署虚拟主机配置
  template:
    src: vhost.conf.j2
    dest: "/etc/nginx/conf.d/{{ item.server_name }}.conf"
  loop: "{{ nginx_vhosts | default([]) }}"
  notify: Reload Nginx

- name: 启动并启用 Nginx
  service:
    name: nginx
    state: started
    enabled: yes

- name: 配置防火墙放行
  firewalld:
    port: "{{ nginx_listen_port }}/tcp"
    permanent: yes
    state: enabled
  notify: Reload Firewalld
  when: ansible_os_family == "RedHat"

roles/nginx/handlers/main.yml

---
- name: Reload Nginx
  service:
    name: nginx
    state: reloaded

- name: Reload Firewalld
  service:
    name: firewalld
    state: reloaded

roles/nginx/templates/nginx.conf.j2

# Managed by Ansible — DO NOT EDIT MANUALLY
user nginx;
worker_processes {{ nginx_worker_processes }};
error_log /var/log/nginx/error.log warn;
pid /run/nginx.pid;

events {
    worker_connections {{ nginx_worker_connections }};
    use epoll;
    multi_accept on;
}

http {
    include       /etc/nginx/mime.types;
    default_type  application/octet-stream;

    log_format main '$remote_addr - $remote_user [$time_local] '
                    '"$request" $status $body_bytes_sent '
                    '"$http_referer" "$http_user_agent"';

    access_log /var/log/nginx/access.log main;

    sendfile on;
    tcp_nopush on;
    tcp_nodelay on;
    keepalive_timeout {{ nginx_keepalive_timeout }};
    client_max_body_size {{ nginx_client_max_body_size }};

    gzip on;
    gzip_types text/plain text/css application/json application/javascript;
    gzip_min_length 1000;

    include /etc/nginx/conf.d/*.conf;
}

4.3 使用 Role

---
# site.yml — 主入口 Playbook
- name: 配置 Web 服务器
  hosts: web
  become: yes
  roles:
    - role: common          # 基础配置
    - role: nginx           # Nginx
    - role: app             # 应用部署

# 带条件和标签使用
- name: 数据库服务器
  hosts: db
  become: yes
  roles:
    - role: common
    - role: mysql
      vars:
        mysql_root_password: "{{ vault_mysql_root_password }}"
      tags: [db, mysql]

4.4 Ansible Galaxy

# 从 Galaxy 安装 Role
ansible-galaxy install geerlingguy.nginx
ansible-galaxy install geerlingguy.mysql -p roles/

# 从 requirements 文件批量安装
cat > requirements.yml << 'EOF'
---
roles:
  - name: geerlingguy.nginx
    version: "3.1.0"
  - name: geerlingguy.mysql
    version: "4.0.0"
  - src: https://github.com/company/ansible-role-app.git
    name: company.app
    version: main
EOF

ansible-galaxy install -r requirements.yml

# 创建自定义 Role 脚手架
ansible-galaxy init roles/myapp

五、常用模块详解

5.1 file 模块

# 创建目录
- name: 创建应用目录
  file:
    path: /opt/myapp/{{ item }}
    state: directory
    owner: appuser
    group: appgroup
    mode: '0755'
  loop: ['bin', 'conf', 'logs', 'data']

# 创建符号链接
- name: 链接到当前版本
  file:
    src: /opt/myapp/releases/{{ app_version }}
    dest: /opt/myapp/current
    state: link
    owner: appuser

# 删除文件
- name: 清理临时文件
  file:
    path: "{{ item }}"
    state: absent
  loop:
    - /tmp/app_build.tar.gz
    - /tmp/app_build/

5.2 copy 模块

# 复制文件
- name: 部署配置文件
  copy:
    src: files/app.conf
    dest: /etc/myapp/app.conf
    owner: root
    group: root
    mode: '0644'
    backup: yes          # 覆盖前备份

# 内联内容
- name: 创建 MOTD
  copy:
    content: |
      ====================================
        Server: {{ inventory_hostname }}
        Environment: {{ env | default('production') }}
        Managed by Ansible
      ====================================
    dest: /etc/motd
    mode: '0644'

5.3 template 模块

- name: 部署应用配置
  template:
    src: templates/app.conf.j2
    dest: /etc/myapp/app.conf
    owner: appuser
    group: appgroup
    mode: '0640'
    validate: "/usr/bin/myapp check-config %s"   # 部署前验证
    backup: yes

5.4 service 模块

- name: 管理服务
  service:
    name: "{{ item.name }}"
    state: "{{ item.state }}"
    enabled: "{{ item.enabled }}"
  loop:
    - { name: nginx, state: started, enabled: true }
    - { name: redis, state: started, enabled: true }
    - { name: firewalld, state: stopped, enabled: false }

5.5 yum/apt 模块

# CentOS/RHEL
- name: 安装软件包
  yum:
    name:
      - nginx
      - redis
      - git
      - vim
    state: present
    enablerepo: epel

# 安装本地 RPM
- name: 安装本地包
  yum:
    name: /tmp/app-1.0.rpm
    state: present

# Ubuntu/Debian
- name: 安装软件包
  apt:
    name:
      - nginx
      - redis-server
      - git
    state: present
    update_cache: yes
    cache_valid_time: 3600

# 添加 APT 仓库
- name: 添加 Docker GPG Key
  apt_key:
    url: https://download.docker.com/linux/ubuntu/gpg
    state: present

- name: 添加 Docker 仓库
  apt_repository:
    repo: "deb [arch=amd64] https://download.docker.com/linux/ubuntu {{ ansible_distribution_release }} stable"
    state: present

5.6 user 模块

- name: 创建应用用户
  user:
    name: appuser
    comment: "Application User"
    shell: /bin/bash
    home: /home/appuser
    create_home: yes
    system: yes
    groups: "{{ item.groups | default(omit) }}"
    append: yes

# 授权 sudo
- name: 配置 sudo 权限
  copy:
    content: "appuser ALL=(ALL) NOPASSWD: /usr/bin/systemctl restart myapp\n"
    dest: /etc/sudoers.d/appuser
    mode: '0440'
    validate: "visudo -cf %s"

5.7 lineinfile 模块

# 修改配置文件中的单行
- name: 设置 SSH 端口
  lineinfile:
    path: /etc/ssh/sshd_config
    regexp: '^#?Port '
    line: 'Port 22022'
    state: present
  notify: Restart SSH

# 确保某行存在
- name: 添加 hosts 记录
  lineinfile:
    path: /etc/hosts
    line: "{{ item.ip }} {{ item.hostname }}"
    state: present
  loop:
    - { ip: "192.168.1.10", hostname: "app-server" }
    - { ip: "192.168.1.20", hostname: "db-server" }

# 删除匹配的行
- name: 移除危险配置
  lineinfile:
    path: /etc/ssh/sshd_config
    regexp: '^PermitRootLogin'
    state: absent

5.8 blockinfile 模块

- name: 添加自定义配置块
  blockinfile:
    path: /etc/sysctl.conf
    block: |
      # Ansible managed — network optimization
      net.core.somaxconn = 65535
      net.ipv4.tcp_max_syn_backlog = 65535
      net.ipv4.ip_local_port_range = 1024 65535
      net.ipv4.tcp_tw_reuse = 1
    marker: "# {mark} ANSIBLE MANAGED BLOCK - Network"
    state: present
  notify: Reload Sysctl

六、Jinja2 模板

6.1 基础语法

{# 这是注释 #}

{# 变量输出 #}
主机名: {{ inventory_hostname }}
IP 地址: {{ ansible_default_ipv4.address }}
CPU 核数: {{ ansible_processor_vcpus }}

{# 条件判断 #}
{% if ansible_memtotal_mb > 8192 %}
worker_connections 4096;
{% elif ansible_memtotal_mb > 4096 %}
worker_connections 2048;
{% else %}
worker_connections 1024;
{% endif %}

{# 循环 #}
{% for user in users %}
{{ user.name }}:{{ user.uid }}:{{ user.shell }}
{% endfor %}

6.2 常用过滤器

{# 字符串过滤器 #}
{{ app_name | upper }}                          {# MYAPP #}
{{ app_name | lower }}                          {# myapp #}
{{ app_name | capitalize }}                     {# Myapp #}
{{ app_name | replace('old', 'new') }}          {# 替换 #}
{{ path | basename }}                           {# 文件名 #}
{{ path | dirname }}                            {# 目录名 #}

{# 数值过滤器 #}
{{ value | int }}                               {# 转整数 #}
{{ value | float }}                             {# 浮点数 #}
{{ ansible_memtotal_mb * 0.8 | round(0) | int }}{# 计算并取整 #}

{# 列表过滤器 #}
{{ list | unique }}                             {# 去重 #}
{{ list | sort }}                               {# 排序 #}
{{ list | length }}                             {# 长度 #}
{{ list | join(', ') }}                         {# 连接 #}
{{ list | first }}                              {# 第一个 #}
{{ list | last }}                               {# 最后一个 #}
{{ list | default(['item1']) }}                 {# 默认值 #}
{{ list | map('upper') | list }}               {# 映射 #}
{{ list | select('match', '^app') | list }}    {# 过滤 #}

{# 字典过滤器 #}
{{ dict | dict2items }}                         {# 转列表 #}
{{ items | items2dict }}                        {# 转字典 #}
{{ dict | combine(other_dict) }}               {# 合并字典 #}
{{ dict.keys() | list }}                        {# 获取键 #}

{# JSON/YAML #}
{{ config | to_json }}                          {# 转 JSON #}
{{ config | to_nice_json(indent=2) }}           {# 美化 JSON #}
{{ config | to_yaml }}                          {# 转 YAML #}

{# 哈希/加密 #}
{{ password | password_hash('sha512') }}        {# 生成密码哈希 #}
{{ content | hash('md5') }}                     {# MD5 #}
{{ content | b64encode }}                       {# Base64 编码 #}
{{ content | b64decode }}                       {# Base64 解码 #}

6.3 高级模板示例

{# 动态生成 Nginx upstream 配置 #}
upstream {{ app_name }}_backend {
    least_conn;
{% for host in groups['web'] %}
    server {{ hostvars[host]['ansible_host'] }}:{{ app_port }} weight={{ hostvars[host].weight | default(1) }} max_fails=3 fail_timeout=30s;
{% endfor %}
}

{# 根据主机名生成不同配置 #}
{% set host_num = inventory_hostname | regex_replace('^web(\d+)$', '\1') | int %}
{% if host_num % 2 == 0 %}
    {# 偶数节点作为备用 #}
    server_role: standby
{% else %}
    server_role: primary
{% endif %}

{# 条件合并字典 #}
{% set default_config = {'workers': 4, 'max_conn': 1000} %}
{% set final_config = default_config | combine(custom_config | default({})) %}
workers: {{ final_config.workers }}
max_conn: {{ final_config.max_conn }}

七、Inventory 进阶

7.1 host_vars 和 group_vars

inventory/
├── hosts.yml
├── host_vars/
│   ├── web01.yml          # web01 专属变量
│   ├── web02.yml          # web02 专属变量
│   └── db01.yml           # db01 专属变量
└── group_vars/
    ├── all.yml             # 所有主机共享变量
    ├── web.yml             # web 组变量
    ├── db.yml              # db 组变量
    └── production.yml      # production 组变量

inventory/group_vars/web.yml

---
nginx_worker_processes: 4
app_port: 8080
app_env: production
deploy_user: deploy

inventory/host_vars/web01.yml

---
nginx_worker_processes: 8    # 覆盖组变量
is_primary: true

7.2 动态 Inventory

动态 Inventory 从外部数据源(云 API、CMDB、脚本)实时获取主机信息。

自定义动态 Inventory 脚本:

#!/usr/bin/env python3
"""dynamic_inventory.py — 从 API 获取主机清单"""
import json
import sys
import requests

def get_inventory():
    # 从 CMDB/API 获取主机列表
    resp = requests.get("http://cmdb.internal/api/hosts")
    hosts = resp.json()

    inventory = {
        "_meta": {"hostvars": {}},
        "all": {"children": []}
    }

    groups = {}
    for host in hosts:
        name = host["hostname"]
        group = host["role"]  # web, db, cache 等

        if group not in groups:
            groups[group] = {"hosts": [], "vars": {}}
        groups[group]["hosts"].append(name)

        inventory["_meta"]["hostvars"][name] = {
            "ansible_host": host["ip"],
            "ansible_port": host.get("ssh_port", 22),
            "ansible_user": host.get("ssh_user", "deploy"),
            "env": host.get("env", "production"),
        }

    inventory.update(groups)
    inventory["all"]["children"] = list(groups.keys())
    return inventory

if __name__ == "__main__":
    if len(sys.argv) == 2 and sys.argv[1] == "--list":
        print(json.dumps(get_inventory(), indent=2))
    elif len(sys.argv) == 3 and sys.argv[1] == "--host":
        print(json.dumps({}))
    else:
        sys.exit(1)
# 使用动态 Inventory
chmod +x dynamic_inventory.py
ansible -i dynamic_inventory.py all -m ping

# AWS EC2 动态 Inventory 插件
pip install boto3
cat > aws_ec2.yml << 'EOF'
plugin: amazon.aws.aws_ec2
regions:
  - ap-southeast-1
keyed_groups:
  - key: tags.Environment
    prefix: env
  - key: instance_type
    prefix: type
filters:
  tag:Managed: ansible
compose:
  ansible_host: public_ip_address
EOF

ansible -i aws_ec2.yml all -m ping

八、Ansible Vault

8.1 基本操作

# 创建加密文件
ansible-vault create secrets.yml

# 加密已有文件
ansible-vault encrypt vars/production/secrets.yml

# 编辑加密文件
ansible-vault edit secrets.yml

# 解密文件
ansible-vault decrypt secrets.yml

# 查看加密内容
ansible-vault view secrets.yml

# 更换密码
ansible-vault rekey secrets.yml

# 加密单个字符串(内联使用)
ansible-vault encrypt_string 'SuperSecret123' --name 'db_password'
# 输出:
# db_password: !vault |
#   $ANSIBLE_VAULT;1.1;AES256
#   66386439653236336...

8.2 Vault 实战用法

# group_vars/production/vault.yml(加密文件)
---
vault_db_password: !vault |
  $ANSIBLE_VAULT;1.1;AES256
  ...
vault_api_key: !vault |
  $ANSIBLE_VAULT;1.1;AES256
  ...
# 执行时提供密码
ansible-playbook site.yml --ask-vault-pass

# 使用密码文件(CI/CD 推荐)
ansible-playbook site.yml --vault-password-file ~/.vault_pass

# 多 Vault ID(不同密钥加密不同级别)
ansible-vault encrypt_string --vault-id prod@prompt 'secret' --name 'api_key'
ansible-vault encrypt_string --vault-id dev@dev_pass.txt 'devsecret' --name 'dev_key'

ansible-playbook site.yml --vault-id prod@prompt --vault-id dev@dev_pass.txt

8.3 CI/CD 集成

# GitLab CI 示例
deploy:
  stage: deploy
  script:
    - echo "$VAULT_PASSWORD" > ~/.vault_pass
    - chmod 600 ~/.vault_pass
    - ansible-playbook -i inventory/production site.yml
      --vault-password-file ~/.vault_pass
      --limit "$TARGET_HOSTS"
  variables:
    ANSIBLE_HOST_KEY_CHECKING: "False"
  only:
    - main
# Jenkins Pipeline 中使用 Credentials Binding
# 将 vault 密码存入 Jenkins Credentials,通过环境变量传递
echo "${ANSIBLE_VULT_PASS}" > .vault_pass
ansible-playbook -i inventory site.yml --vault-password-file .vault_pass
rm -f .vault_pass

九、实战场景

9.1 系统初始化 Playbook

---
# playbooks/init-server.yml
- name: 系统初始化
  hosts: all
  become: yes
  gather_facts: yes

  vars:
    timezone: Asia/Shanghai
    ssh_port: 22022
    swap_size_mb: 2048
    sysctl_params:
      net.core.somaxconn: 65535
      net.ipv4.tcp_max_syn_backlog: 65535
      net.ipv4.ip_local_port_range: "1024 65535"
      vm.swappiness: 10
      fs.file-max: 655350

  tasks:
    - name: 设置时区
      timezone:
        name: "{{ timezone }}"

    - name: 安装基础软件包
      package:
        name:
          - vim
          - git
          - curl
          - wget
          - htop
          - iotop
          - net-tools
          - lsof
          - strace
          - chrony
          - bash-completion
        state: present

    - name: 配置 NTP
      template:
        src: chrony.conf.j2
        dest: /etc/chrony.conf
      notify: Restart Chrony

    - name: 设置系统参数
      sysctl:
        name: "{{ item.key }}"
        value: "{{ item.value }}"
        state: present
        reload: yes
        sysctl_file: /etc/sysctl.d/99-ansible.conf
      loop: "{{ sysctl_params | dict2items }}"

    - name: 配置文件描述符限制
      pam_limits:
        domain: '*'
        limit_type: "{{ item.type }}"
        limit_item: nofile
        value: "{{ item.value }}"
      loop:
        - { type: soft, value: 655350 }
        - { type: hard, value: 655350 }

    - name: 配置 SSH
      lineinfile:
        path: /etc/ssh/sshd_config
        regexp: "{{ item.regexp }}"
        line: "{{ item.line }}"
      loop:
        - { regexp: '^#?Port ', line: "Port {{ ssh_port }}" }
        - { regexp: '^#?PermitRootLogin', line: 'PermitRootLogin no' }
        - { regexp: '^#?PasswordAuthentication', line: 'PasswordAuthentication no' }
        - { regexp: '^#?UseDNS', line: 'UseDNS no' }
      notify: Restart SSH

    - name: 创建 Swap 文件
      block:
        - name: 检查 Swap
          command: swapon --show
          register: swap_check
          changed_when: false

        - name: 创建 Swap
          command: >
            dd if=/dev/zero of=/swapfile bs=1M count={{ swap_size_mb }}
            && chmod 600 /swapfile
            && mkswap /swapfile
            && swapon /swapfile
          when: swap_check.stdout == ""

        - name: 写入 fstab
          lineinfile:
            path: /etc/fstab
            line: "/swapfile swap swap defaults 0 0"
          when: swap_check.stdout == ""

  handlers:
    - name: Restart Chrony
      service: { name: chronyd, state: restarted }
    - name: Restart SSH
      service: { name: sshd, state: restarted }

9.2 批量部署应用

---
# playbooks/deploy-app.yml
- name: 部署应用
  hosts: web
  become: yes
  serial: "30%"     # 滚动更新,每次 30%
  max_fail_percentage: 10   # 失败超过 10% 停止

  vars:
    app_name: myapp
    app_version: "3.2.1"
    app_repo: "registry.example.com/{{ app_name }}"
    deploy_dir: /opt/{{ app_name }}
    release_dir: "{{ deploy_dir }}/releases/{{ app_version }}"
    current_link: "{{ deploy_dir }}/current"

  pre_tasks:
    - name: 从负载均衡器摘除
      uri:
        url: "http://lb.internal/api/remove"
        method: POST
        body: '{"host": "{{ inventory_hostname }}"}'
        body_format: json
      delegate_to: localhost
      run_once: true

  tasks:
    - name: 创建目录结构
      file:
        path: "{{ item }}"
        state: directory
        owner: deploy
        mode: '0755'
      loop:
        - "{{ release_dir }}"
        - "{{ deploy_dir }}/shared/config"
        - "{{ deploy_dir }}/shared/logs"

    - name: 拉取应用镜像
      command: "docker pull {{ app_repo }}:{{ app_version }}"

    - name: 部署配置文件
      template:
        src: templates/app.conf.j2
        dest: "{{ deploy_dir }}/shared/config/app.conf"
        owner: deploy
        mode: '0640'
      notify: Restart App

    - name: 停止旧容器
      docker_container:
        name: "{{ app_name }}"
        state: stopped
      ignore_errors: yes

    - name: 启动新容器
      docker_container:
        name: "{{ app_name }}"
        image: "{{ app_repo }}:{{ app_version }}"
        state: started
        restart_policy: unless-stopped
        ports:
          - "{{ app_port }}:8080"
        volumes:
          - "{{ deploy_dir }}/shared/config:/app/config:ro"
          - "{{ deploy_dir }}/shared/logs:/app/log"
        env:
          APP_ENV: "{{ app_env }}"
          DB_HOST: "{{ db_host }}"

    - name: 等待健康检查通过
      uri:
        url: "http://localhost:{{ app_port }}/health"
        status_code: 200
      register: health
      until: health.status == 200
      retries: 30
      delay: 5

    - name: 更新软链接
      file:
        src: "{{ release_dir }}"
        dest: "{{ current_link }}"
        state: link
        owner: deploy

  post_tasks:
    - name: 重新加入负载均衡
      uri:
        url: "http://lb.internal/api/add"
        method: POST
        body: '{"host": "{{ inventory_hostname }}"}'
        body_format: json
      delegate_to: localhost

  handlers:
    - name: Restart App
      docker_container:
        name: "{{ app_name }}"
        state: started
        restart: yes

9.3 安全加固 Playbook

---
# playbooks/hardening.yml
- name: 服务器安全加固
  hosts: all
  become: yes

  tasks:
    - name: 禁用不需要的服务
      service:
        name: "{{ item }}"
        state: stopped
        enabled: no
      loop:
        - cups
        - avahi-daemon
        - postfix
      ignore_errors: yes

    - name: 配置密码策略
      lineinfile:
        path: /etc/login.defs
        regexp: "{{ item.regexp }}"
        line: "{{ item.line }}"
      loop:
        - { regexp: '^PASS_MAX_DAYS', line: 'PASS_MAX_DAYS 90' }
        - { regexp: '^PASS_MIN_DAYS', line: 'PASS_MIN_DAYS 7' }
        - { regexp: '^PASS_MIN_LEN', line: 'PASS_MIN_LEN 12' }

    - name: 锁定 root 账户
      user:
        name: root
        password_lock: yes

    - name: 配置 fail2ban
      block:
        - name: 安装 fail2ban
          package:
            name: fail2ban
            state: present

        - name: 配置 fail2ban
          copy:
            content: |
              [DEFAULT]
              bantime = 3600
              findtime = 600
              maxretry = 5

              [sshd]
              enabled = true
              port = {{ ssh_port }}
              logpath = /var/log/secure
            dest: /etc/fail2ban/jail.local
          notify: Restart fail2ban

        - name: 启动 fail2ban
          service:
            name: fail2ban
            state: started
            enabled: yes

    - name: 配置审计规则
      copy:
        content: |
          -w /etc/passwd -p wa -k identity
          -w /etc/shadow -p wa -k identity
          -w /etc/sudoers -p wa -k sudoers
          -w /var/log/ -p wa -k logs
        dest: /etc/audit/rules.d/ansible.rules
      notify: Restart auditd

  handlers:
    - name: Restart fail2ban
      service: { name: fail2ban, state: restarted }
    - name: Restart auditd
      service: { name: auditd, state: restarted }

9.4 配置管理

---
# playbooks/config-management.yml
- name: 配置文件管理
  hosts: all
  become: yes

  vars:
    config_files:
      - src: sshd_config.j2
        dest: /etc/ssh/sshd_config
        mode: '0600'
        notify: Restart SSH
      - src: sysctl.conf.j2
        dest: /etc/sysctl.d/99-custom.conf
        mode: '0644'
        notify: Reload Sysctl
      - src: logrotate.conf.j2
        dest: /etc/logrotate.d/custom
        mode: '0644'

  tasks:
    - name: 批量部署配置文件
      template:
        src: "templates/{{ item.src }}"
        dest: "{{ item.dest }}"
        owner: root
        group: root
        mode: "{{ item.mode }}"
        validate: "{{ item.validate | default(omit) }}"
      loop: "{{ config_files }}"
      notify: "{{ item.notify | default(omit) }}"

    - name: 检查配置语法
      command: "{{ item.check_cmd }}"
      loop: "{{ config_files }}"
      when: item.check_cmd is defined
      changed_when: false

  handlers:
    - name: Restart SSH
      service: { name: sshd, state: restarted }
    - name: Reload Sysctl
      command: sysctl --system

十、最佳实践

10.1 幂等性

幂等性(Idempotency):多次执行结果一致,这是 Ansible 的核心设计原则。

# ❌ 错误写法:不是幂等的
- name: 添加配置行
  shell: echo "option=value" >> /etc/app.conf

# ✅ 正确写法:幂等
- name: 添加配置行
  lineinfile:
    path: /etc/app.conf
    line: "option=value"
    state: present

# ❌ 错误写法:每次都执行
- name: 初始化数据库
  shell: /opt/app/bin/init-db.sh

# ✅ 正确写法:检查后再执行
- name: 检查数据库是否已初始化
  command: /opt/app/bin/check-db.sh
  register: db_check
  changed_when: false
  failed_when: false

- name: 初始化数据库
  command: /opt/app/bin/init-db.sh
  when: db_check.rc != 0

10.2 错误处理

tasks:
  # ignore_errors — 忽略错误继续执行
  - name: 尝试停止可能未运行的服务
    service:
      name: myapp
      state: stopped
    ignore_errors: yes

  # block/rescue/always — 类似 try/catch/finally
  - name: 带错误处理的部署流程
    block:
      - name: 部署新版本
        command: deploy.sh

      - name: 运行健康检查
        uri:
          url: "http://localhost/health"
          status_code: 200
        retries: 10
        delay: 5

    rescue:
      - name: 回滚到旧版本
        command: rollback.sh

      - name: 发送告警
        slack:
          token: "{{ slack_token }}"
          msg: "部署失败,已回滚: {{ inventory_hostname }}"
          channel: "#ops-alerts"

    always:
      - name: 清理构建产物
        file:
          path: /tmp/build
          state: absent

  # failed_when — 自定义失败条件
  - name: 执行迁移脚本
    shell: migrate.sh
    register: migrate_result
    failed_when:
      - migrate_result.rc != 0
      - "'already migrated' not in migrate_result.stderr"

  # changed_when — 自定义变更判定
  - name: 检查配置是否需要更新
    shell: md5sum /etc/app.conf
    register: config_hash
    changed_when: false

10.3 性能优化

# ansible.cfg 性能优化配置
[defaults]
forks = 50                    # 增加并行数
gathering = smart             # 智能收集 Facts
fact_caching = jsonfile       # 缓存 Facts
fact_caching_connection = /tmp/ansible_facts_cache
fact_caching_timeout = 86400  # 缓存 24 小时

[ssh_connection]
pipelining = True             # 启用管道模式(关键优化)
ssh_args = -o ControlMaster=auto -o ControlPersist=600s
# Playbook 级别优化
- name: 优化示例
  hosts: web
  gather_facts: yes
  strategy: free              # free 策略:不等待慢主机

  tasks:
    # 仅在需要时收集指定 Facts
    - name: 收集网络信息
      setup:
        gather_subset:
          - network
          - hardware
      when: need_network_facts | default(false)

    # 使用 async 异步执行长任务
    - name: 异步安装系统更新
      yum:
        name: "*"
        state: latest
      async: 600
      poll: 0
      register: yum_update

    # ... 其他任务 ...

    - name: 等待系统更新完成
      async_status:
        jid: "{{ yum_update.ansible_job_id }}"
      register: job_result
      until: job_result.finished
      retries: 60
      delay: 10

10.4 调试技巧

# 详细输出
ansible-playbook site.yml -v       # 基本
ansible-playbook site.yml -vvv     # 详细
ansible-playbook site.yml -vvvv    # 连接级别调试

# 只检查不执行
ansible-playbook site.yml --check --diff

# 逐步确认
ansible-playbook site.yml --step

# 限制到特定主机
ansible-playbook site.yml --limit web01

# 从特定任务开始
ansible-playbook site.yml --start-at-task="Deploy Config"

# 列出所有任务
ansible-playbook site.yml --list-tasks

# 列出所有主机
ansible-playbook site.yml --list-hosts

# 性能分析
ANSIBLE_CALLBACK_WHITELIST=timer,profile_tasks ansible-playbook site.yml
# Playbook 中的调试任务
- name: 打印所有变量
  debug:
    var: hostvars[inventory_hostname]
    verbosity: 1    # 仅在 -v 时显示

- name: 打印特定信息
  debug:
    msg: |
      主机名: {{ inventory_hostname }}
      IP 地址: {{ ansible_default_ipv4.address }}
      系统: {{ ansible_distribution }} {{ ansible_distribution_version }}
      内存: {{ ansible_memtotal_mb }}MB
      CPU: {{ ansible_processor_vcpus }} cores

# 使用 assert 进行断言检查
- name: 验证前置条件
  assert:
    that:
      - ansible_memtotal_mb >= 4096
      - ansible_distribution in ["CentOS", "Ubuntu", "Debian"]
      - ansible_distribution_major_version | int >= 7
    fail_msg: "主机不满足最低系统要求"
    success_msg: "前置条件检查通过"

10.5 项目结构推荐

ansible-project/
├── ansible.cfg
├── site.yml                  # 主入口
├── requirements.yml          # Galaxy 依赖
├── inventory/
│   ├── production/
│   │   ├── hosts.yml
│   │   ├── group_vars/
│   │   │   ├── all.yml
│   │   │   ├── web.yml
│   │   │   └── vault.yml     # 加密变量
│   │   └── host_vars/
│   └── staging/
│       ├── hosts.yml
│       └── group_vars/
├── playbooks/
│   ├── init-server.yml
│   ├── deploy-app.yml
│   └── hardening.yml
├── roles/
│   ├── common/
│   ├── nginx/
│   ├── mysql/
│   └── app/
├── templates/
├── files/
├── vars/
└── scripts/
    └── dynamic_inventory.py

附录:常用命令速查

命令

用途

ansible all -m ping

测试所有主机连通性

ansible-playbook site.yml

执行 Playbook

ansible-playbook site.yml --check

Dry Run 模式

ansible-playbook site.yml --diff

显示文件差异

ansible-playbook site.yml --tags "deploy"

只执行指定 tag

ansible-playbook site.yml --limit web01

限制到特定主机

ansible-playbook site.yml -v

详细输出

ansible-inventory --graph

图形化展示 Inventory

ansible-galaxy install role_name

安装 Galaxy Role

ansible-vault encrypt file.yml

加密文件

ansible-vault edit file.yml

编辑加密文件

ansible-doc module_name

查看模块文档

ansible all -m setup

收集主机 Facts

ansible-console

交互式命令行


💡 持续更新:本文档会随着实际运维经验的积累持续完善。如有疑问或建议,欢迎反馈。