The 3 AM Call That Changed Everything
It was 3:17 AM when my phone shattered the silence. Our entire e-commerce platform was down. Black Friday weekend. Millions in revenue hanging in the balance. The cause? A single Ansible playbook that was supposed to apply a minor security patch had instead corrupted every database server in our production cluster.
That night cost us $2.3 million in lost sales and taught me that Ansible, despite its reputation for simplicity, can be devastatingly dangerous when wielded carelessly. Over the past five years, I've witnessed seven recurring patterns of Ansible misuse that have brought down infrastructures, corrupted data, and ended careers.
This is the guide I wish I'd had before that fateful Black Friday. These aren't theoretical scenarios, they're real disasters that happen to real companies every day. Learn from our pain.
Mistake 1: The Idempotency Illusion
THE MYTH: "Ansible is idempotent by nature, so I can run playbooks multiple times safely."
THE REALITY: Idempotency is not automatic. It's a property you must design into your playbooks.
The Disaster Scenario
Consider this seemingly innocent playbook that nearly destroyed a financial services company's trading platform:
---
- name: Update trading application configuration
hosts: trading_servers
become: yes
tasks:
- name: Backup current config
shell: cp /opt/trading/config.xml /opt/trading/config.xml.backup
- name: Update trading limits
lineinfile:
path: /opt/trading/config.xml
line: "<limit>{{ new_trading_limit }}</limit>"
insertafter: "<trading_config>"
- name: Restart trading service
systemd:
name: trading-engine
state: restartedThis playbook was run three times during a deployment. Each execution added another <limit> line to the configuration file, creating invalid XML that crashed the trading system during market hours. The financial impact was catastrophic.
The Correct Approach
Here's how to implement true idempotency:
---
- name: Update trading application configuration (SAFE VERSION)
hosts: trading_servers
become: yes
vars:
config_file: /opt/trading/config.xml
backup_dir: /opt/trading/backups
tasks:
- name: Create backup directory
file:
path: "{{ backup_dir }}"
state: directory
mode: '0755'
- name: Create timestamped backup
copy:
src: "{{ config_file }}"
dest: "{{ backup_dir }}/config.xml.{{ ansible_date_time.epoch }}"
remote_src: yes
changed_when: false
- name: Check if trading limit already exists
xml:
path: "{{ config_file }}"
xpath: "/trading_config/limit"
count: yes
register: existing_limits
- name: Remove existing trading limits
xml:
path: "{{ config_file }}"
xpath: "/trading_config/limit"
state: absent
when: existing_limits.count > 0
- name: Set new trading limit
xml:
path: "{{ config_file }}"
xpath: "/trading_config"
add_children:
- limit: "{{ new_trading_limit }}"
pretty_print: yes
notify: restart_trading_service
- name: Validate configuration file
xml:
path: "{{ config_file }}"
xpath: "/trading_config/limit"
content: text
register: config_validation
failed_when: config_validation.matches[0].limit != new_trading_limit
handlers:
- name: restart_trading_service
systemd:
name: trading-engine
state: restarted
listen: "restart_trading_service"Key Idempotency Principles
- Always check current state before making changes
- Use modules designed for idempotency (xml, lineinfile with regexp, etc.)
- Validate the desired state after changes
- Use handlers for actions that should only run when changes occur
Mistake 2: The Privilege Escalation Trap
THE MYTH: "I'll just run everything as root to avoid permission issues."
THE REALITY: Excessive privileges create security vulnerabilities and mask underlying problems.
The Disaster Scenario
A DevOps engineer at a healthcare company wrote this playbook to deploy a web application:
---
- name: Deploy patient portal application
hosts: web_servers
become: yes
become_user: root
tasks:
- name: Download application package
get_url:
url: "{{ app_download_url }}"
dest: /tmp/app.tar.gz
- name: Extract application
unarchive:
src: /tmp/app.tar.gz
dest: /opt/
remote_src: yes
- name: Set permissions
file:
path: /opt/patient-portal
owner: root
group: root
mode: '0777'
recurse: yes
- name: Start application
shell: |
cd /opt/patient-portal
./start.sh &This playbook created a massive security vulnerability. The application ran as root with world-writable permissions, making it trivial for attackers to gain full system access. When the inevitable breach occurred, attackers had immediate root access to servers containing thousands of patient records.
The Secure Approach
---
- name: Deploy patient portal application (SECURE VERSION)
hosts: web_servers
become: yes
vars:
app_user: patient-portal
app_group: patient-portal
app_home: /opt/patient-portal
app_version: "{{ app_version | default('latest') }}"
tasks:
- name: Create application user
user:
name: "{{ app_user }}"
group: "{{ app_group }}"
home: "{{ app_home }}"
shell: /bin/false
system: yes
create_home: no
- name: Create application directory
file:
path: "{{ app_home }}"
state: directory
owner: "{{ app_user }}"
group: "{{ app_group }}"
mode: '0755'
- name: Download application package
get_url:
url: "{{ app_download_url }}"
dest: "/tmp/app-{{ app_version }}.tar.gz"
mode: '0644'
checksum: "{{ app_checksum }}"
become: no
delegate_to: localhost
run_once: true
- name: Copy application package to servers
copy:
src: "/tmp/app-{{ app_version }}.tar.gz"
dest: "/tmp/app-{{ app_version }}.tar.gz"
owner: "{{ app_user }}"
group: "{{ app_group }}"
mode: '0644'
- name: Extract application
unarchive:
src: "/tmp/app-{{ app_version }}.tar.gz"
dest: "{{ app_home }}"
remote_src: yes
owner: "{{ app_user }}"
group: "{{ app_group }}"
creates: "{{ app_home }}/app.py"
- name: Set secure permissions on application files
file:
path: "{{ app_home }}"
owner: "{{ app_user }}"
group: "{{ app_group }}"
mode: '0755'
recurse: yes
- name: Set executable permissions on startup script
file:
path: "{{ app_home }}/start.sh"
mode: '0755'
- name: Create systemd service file
template:
src: patient-portal.service.j2
dest: /etc/systemd/system/patient-portal.service
mode: '0644'
notify: reload_systemd
- name: Enable and start patient portal service
systemd:
name: patient-portal
enabled: yes
state: started
daemon_reload: yes
- name: Clean up temporary files
file:
path: "/tmp/app-{{ app_version }}.tar.gz"
state: absent
handlers:
- name: reload_systemd
systemd:
daemon_reload: yesThe Systemd Service Template
# templates/patient-portal.service.j2
[Unit]
Description=Patient Portal Application
After=network.target
[Service]
Type=simple
User={{ app_user }}
Group={{ app_group }}
WorkingDirectory={{ app_home }}
ExecStart={{ app_home }}/start.sh
Restart=always
RestartSec=10
StandardOutput=syslog
StandardError=syslog
SyslogIdentifier=patient-portal
# Security settings
NoNewPrivileges=true
ProtectSystem=strict
ProtectHome=true
ReadWritePaths={{ app_home }}/logs {{ app_home }}/tmp
PrivateTmp=true
[Install]
WantedBy=multi-user.targetMistake 3: The Variable Injection Vulnerability
THE MYTH: "User input is automatically sanitized in Ansible."
THE REALITY: Ansible can execute arbitrary commands if variables aren't properly validated.
The Disaster Scenario
A CI/CD pipeline used this playbook to deploy applications based on user input:
---
- name: Deploy user application
hosts: "{{ target_environment }}"
become: yes
tasks:
- name: Create deployment directory
file:
path: "/opt/{{ app_name }}"
state: directory
- name: Deploy application
shell: |
cd /opt/{{ app_name }}
wget {{ app_url }}
{{ deployment_commands }}An attacker submitted a deployment request with:
app_name:../../../etcdeployment_commands:rm -rf / --no-preserve-root
The result was complete system destruction across multiple servers.
The Safe Approach
---
- name: Deploy user application (SECURE VERSION)
hosts: "{{ target_environment }}"
become: yes
vars:
allowed_environments:
- development
- staging
- production
app_name_pattern: '^[a-zA-Z0-9][a-zA-Z0-9_-]*[a-zA-Z0-9]$'
max_app_name_length: 50
allowed_deployment_commands:
- start
- stop
- restart
- status
pre_tasks:
- name: Validate target environment
fail:
msg: "Invalid target environment: {{ target_environment }}"
when: target_environment not in allowed_environments
- name: Validate application name format
fail:
msg: "Invalid application name format: {{ app_name }}"
when:
- app_name | length > max_app_name_length
- not (app_name | regex_search(app_name_pattern))
- name: Validate application URL
fail:
msg: "Invalid application URL: {{ app_url }}"
when: not (app_url | regex_search('^https?://[a-zA-Z0-9.-]+/.*$'))
- name: Validate deployment commands
fail:
msg: "Invalid deployment command: {{ item }}"
when: item not in allowed_deployment_commands
loop: "{{ deployment_commands.split(',') }}"
tasks:
- name: Create secure deployment directory
file:
path: "/opt/deployments/{{ app_name }}"
state: directory
owner: deploy
group: deploy
mode: '0755'
- name: Download application with validation
get_url:
url: "{{ app_url }}"
dest: "/opt/deployments/{{ app_name }}/app.tar.gz"
owner: deploy
group: deploy
mode: '0644'
timeout: 30
validate_certs: yes
- name: Execute validated deployment commands
systemd:
name: "{{ app_name }}"
state: "{{ item }}"
loop: "{{ deployment_commands.split(',') }}"
when: item in ['started', 'stopped', 'restarted']
become_user: deployInput Validation Best Practices
# vars/validation.yml
input_validation_rules:
app_name:
pattern: '^[a-zA-Z0-9][a-zA-Z0-9_-]*[a-zA-Z0-9]$'
max_length: 50
min_length: 3
version:
pattern: '^v?[0-9]+\.[0-9]+\.[0-9]+$'
environment:
allowed_values:
- dev
- staging
- prod
file_path:
pattern: '^[a-zA-Z0-9/_.-]+$'
forbidden_patterns:
- '\.\.'
- '/etc'
- '/bin'
- '/usr/bin'Mistake 4: The Secret Exposure Epidemic
THE MYTH: "My secrets are safe in group_vars files."
THE REALITY: Unencrypted secrets in version control are a ticking time bomb.
The Disaster Scenario
This configuration file was committed to a public GitHub repository:
# group_vars/production.yml
database_password: SuperSecret123!
api_key: sk-1234567890abcdef
aws_secret_key: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
ssl_private_key: |
-----BEGIN PRIVATE KEY-----
MIIEvQIBADANBgkqhkiG9w0BAQEFAASCBKcwggSjAgEAAoIBAQC...
-----END PRIVATE KEY-----Within hours of the commit, automated bots had harvested these credentials. The attackers gained full access to the production database, AWS infrastructure, and SSL certificates. The breach affected 100,000 customer accounts and resulted in $5 million in damages.
The Secure Approach
# group_vars/production.yml (ENCRYPTED VERSION)
database_password: !vault |
$ANSIBLE_VAULT;1.1;AES256
66386439653236336464616464376131643938346438336465376631323430643835316566396438
3962616330366436373536366636613731373135353261630a373939316136306265313534643630
65366434343139303937646366346134633565333332616666636536373938663065303136636261
3464373762643065660a376330363362653537636237643936393539323139396664663864376336
32
api_key: !vault |
$ANSIBLE_VAULT;1.1;AES256
63353361656262353661383539366262356239643239323037623131363264623638353936656665
3764656264623334356435316664656438633035313535610a643861663464666265373138393836
32366139656334666635353231613831643163646533616234313730363539396133303139333232
3662363863333062640a373762616265323863643565646164333236313833316437393966643938
6338
# Use vault file for sensitive configurations
vault_ssl_cert: "{{ vault_ssl_certificate }}"
vault_ssl_key: "{{ vault_ssl_private_key }}"Comprehensive Secrets Management
---
- name: Secure secrets management example
hosts: production
become: yes
vars:
secrets_dir: /etc/app/secrets
tasks:
- name: Create secure secrets directory
file:
path: "{{ secrets_dir }}"
state: directory
owner: root
group: app
mode: '0750'
- name: Deploy database credentials
template:
src: database.conf.j2
dest: "{{ secrets_dir }}/database.conf"
owner: root
group: app
mode: '0640'
vars:
db_host: "{{ vault_database_host }}"
db_user: "{{ vault_database_user }}"
db_password: "{{ vault_database_password }}"
no_log: true
- name: Deploy API configuration
template:
src: api.conf.j2
dest: "{{ secrets_dir }}/api.conf"
owner: root
group: app
mode: '0640'
vars:
api_endpoint: "{{ vault_api_endpoint }}"
api_key: "{{ vault_api_key }}"
api_secret: "{{ vault_api_secret }}"
no_log: true
- name: Set up SSL certificates
copy:
content: "{{ item.content }}"
dest: "{{ item.dest }}"
owner: root
group: ssl-cert
mode: "{{ item.mode }}"
loop:
- content: "{{ vault_ssl_certificate }}"
dest: "/etc/ssl/certs/{{ inventory_hostname }}.crt"
mode: '0644'
- content: "{{ vault_ssl_private_key }}"
dest: "/etc/ssl/private/{{ inventory_hostname }}.key"
mode: '0600'
no_log: true
- name: Verify secrets are not logged
debug:
msg: "Configuration deployed successfully"
when: ansible_verbosity < 3Vault Management Script
#!/bin/bash
# scripts/manage_vault.sh
VAULT_FILE="group_vars/vault.yml"
VAULT_PASSWORD_FILE=".vault_password"
case "$1" in
"encrypt")
ansible-vault encrypt "$VAULT_FILE" --vault-password-file "$VAULT_PASSWORD_FILE"
;;
"decrypt")
ansible-vault decrypt "$VAULT_FILE" --vault-password-file "$VAULT_PASSWORD_FILE"
;;
"edit")
ansible-vault edit "$VAULT_FILE" --vault-password-file "$VAULT_PASSWORD_FILE"
;;
"rotate")
ansible-vault rekey "$VAULT_FILE" --vault-password-file "$VAULT_PASSWORD_FILE"
;;
*)
echo "Usage: $0 {encrypt|decrypt|edit|rotate}"
exit 1
;;
esacMistake 5: The Rollback Nightmare
THE MYTH: "I can just revert my Git commit if something goes wrong."
THE REALITY: Infrastructure changes aren't automatically reversible like code deployments.
The Disaster Scenario
This playbook was used to upgrade a critical database cluster:
---
- name: Upgrade database cluster
hosts: db_servers
serial: 1
tasks:
- name: Stop database service
systemd:
name: postgresql
state: stopped
- name: Upgrade PostgreSQL
package:
name: postgresql-13
state: latest
- name: Migrate data
shell: |
pg_upgrade \
--old-datadir=/var/lib/postgresql/12/main \
--new-datadir=/var/lib/postgresql/13/main \
--old-bindir=/usr/lib/postgresql/12/bin \
--new-bindir=/usr/lib/postgresql/13/bin
- name: Start new database service
systemd:
name: postgresql
state: startedThe upgrade failed halfway through the cluster. The playbook had no rollback mechanism, leaving half the cluster on PostgreSQL 12 and half on PostgreSQL 13, with corrupted replication. The service was down for 18 hours while engineers manually recovered each server.
The Bulletproof Approach
---
- name: Upgrade database cluster with rollback capability
hosts: db_servers
serial: 1
vars:
old_pg_version: "12"
new_pg_version: "13"
backup_dir: "/backup/pg_upgrade_{{ ansible_date_time.epoch }}"
max_downtime_minutes: 30
pre_tasks:
- name: Create backup directory
file:
path: "{{ backup_dir }}"
state: directory
mode: '0700'
- name: Check available disk space
shell: df -h /var/lib/postgresql | awk 'NR==2 {print $4}'
register: available_space
- name: Verify sufficient space for backup
fail:
msg: "Insufficient disk space for backup"
when: available_space.stdout | regex_replace('[^0-9]', '') | int < 10000
- name: Create pre-upgrade backup
postgresql_db:
name: "{{ item }}"
state: dump
target: "{{ backup_dir }}/{{ item }}_pre_upgrade.sql"
loop: "{{ databases }}"
become_user: postgres
tasks:
- name: Record upgrade start time
set_fact:
upgrade_start_time: "{{ ansible_date_time.epoch }}"
- name: Create rollback script
template:
src: rollback_script.sh.j2
dest: "{{ backup_dir }}/rollback.sh"
mode: '0755'
- name: Stop application services
systemd:
name: "{{ item }}"
state: stopped
loop: "{{ dependent_services }}"
- name: Backup current PostgreSQL data
archive:
path: "/var/lib/postgresql/{{ old_pg_version }}"
dest: "{{ backup_dir }}/postgresql_{{ old_pg_version }}_data.tar.gz"
format: gz
- name: Stop PostgreSQL service
systemd:
name: postgresql
state: stopped
- name: Install new PostgreSQL version
package:
name: "postgresql-{{ new_pg_version }}"
state: present
- name: Initialize new PostgreSQL cluster
command: |
/usr/lib/postgresql/{{ new_pg_version }}/bin/initdb \
-D /var/lib/postgresql/{{ new_pg_version }}/main \
--auth-local=peer --auth-host=md5
become_user: postgres
args:
creates: "/var/lib/postgresql/{{ new_pg_version }}/main/PG_VERSION"
- name: Run pg_upgrade with compatibility check
command: |
/usr/lib/postgresql/{{ new_pg_version }}/bin/pg_upgrade \
--old-datadir=/var/lib/postgresql/{{ old_pg_version }}/main \
--new-datadir=/var/lib/postgresql/{{ new_pg_version }}/main \
--old-bindir=/usr/lib/postgresql/{{ old_pg_version }}/bin \
--new-bindir=/usr/lib/postgresql/{{ new_pg_version }}/bin \
--check
become_user: postgres
register: pg_upgrade_check
- name: Fail if compatibility check fails
fail:
msg: "PostgreSQL upgrade compatibility check failed: {{ pg_upgrade_check.stderr }}"
when: pg_upgrade_check.rc != 0
- name: Perform actual upgrade
command: |
/usr/lib/postgresql/{{ new_pg_version }}/bin/pg_upgrade \
--old-datadir=/var/lib/postgresql/{{ old_pg_version }}/main \
--new-datadir=/var/lib/postgresql/{{ new_pg_version }}/main \
--old-bindir=/usr/lib/postgresql/{{ old_pg_version }}/bin \
--new-bindir=/usr/lib/postgresql/{{ new_pg_version }}/bin
become_user: postgres
register: pg_upgrade_result
- name: Check if upgrade exceeded maximum downtime
set_fact:
upgrade_duration: "{{ (ansible_date_time.epoch | int - upgrade_start_time | int) / 60 }}"
- name: Trigger rollback if upgrade took too long
include_tasks: rollback_tasks.yml
when: upgrade_duration | int > max_downtime_minutes
- name: Update PostgreSQL configuration
template:
src: postgresql.conf.j2
dest: "/etc/postgresql/{{ new_pg_version }}/main/postgresql.conf"
backup: yes
notify: restart_postgresql
- name: Start PostgreSQL service
systemd:
name: postgresql
state: started
enabled: yes
- name: Verify database connectivity
postgresql_ping:
db: postgres
become_user: postgres
register: db_ping
retries: 5
delay: 10
- name: Verify all databases are accessible
postgresql_query:
db: "{{ item }}"
query: "SELECT version();"
loop: "{{ databases }}"
become_user: postgres
- name: Start dependent services
systemd:
name: "{{ item }}"
state: started
loop: "{{ dependent_services }}"
- name: Run post-upgrade statistics update
postgresql_query:
db: "{{ item }}"
query: "ANALYZE;"
loop: "{{ databases }}"
become_user: postgres
rescue:
- name: Execute emergency rollback
include_tasks: rollback_tasks.yml
handlers:
- name: restart_postgresql
systemd:
name: postgresql
state: restartedRollback Tasks
# tasks/rollback_tasks.yml
---
- name: Stop PostgreSQL service for rollback
systemd:
name: postgresql
state: stopped
- name: Remove failed upgrade data
file:
path: "/var/lib/postgresql/{{ new_pg_version }}"
state: absent
- name: Restore original data from backup
unarchive:
src: "{{ backup_dir }}/postgresql_{{ old_pg_version }}_data.tar.gz"
dest: "/var/lib/postgresql/"
remote_src: yes
owner: postgres
group: postgres
- name: Start original PostgreSQL service
systemd:
name: postgresql
state: started
- name: Verify rollback successful
postgresql_ping:
db: postgres
become_user: postgres
- name: Restore databases from backup if needed
postgresql_db:
name: "{{ item }}"
state: restore
target: "{{ backup_dir }}/{{ item }}_pre_upgrade.sql"
loop: "{{ databases }}"
become_user: postgres
when: restore_from_backup | default(false)
- name: Send rollback notification
mail:
to: "{{ ops_team_email }}"
subject: "CRITICAL: Database upgrade rollback executed"
body: |
Database upgrade failed and rollback was executed on {{ inventory_hostname }}.
Rollback completed at: {{ ansible_date_time.iso8601 }}
Backup location: {{ backup_dir }}
Manual verification required before resuming operations.Mistake 6: The Concurrency Catastrophe
THE MYTH: "Running playbooks in parallel will speed up deployments."
THE REALITY: Uncontrolled concurrency can cause race conditions and resource conflicts.
The Disaster Scenario
This playbook was designed to quickly scale up a web application cluster:
---
- name: Scale up web application cluster
hosts: web_servers
strategy: free
tasks:
- name: Download application update
get_url:
url: "{{ app_update_url }}"
dest: /tmp/app-update.tar.gz
- name: Update shared configuration
lineinfile:
path: /shared/nfs/config/app.conf
line: "version={{ new_version }}"
regexp: "^version="
- name: Deploy application
unarchive:
src: /tmp/app-update.tar.gz
dest: /opt/webapp/
remote_src: yes
- name: Restart application
systemd:
name: webapp
state: restartedAll 50 servers executed simultaneously, creating a race condition on the shared NFS configuration file. Multiple servers corrupted the file while trying to write to it concurrently. The application failed to start on any server, causing a complete service outage.
The Coordinated Approach
---
- name: Scale up web application cluster (SAFE VERSION)
hosts: web_servers
serial: 5 # Process 5 servers at a time
vars:
app_version: "{{ new_version }}"
deployment_lock_file: "/shared/nfs/locks/deployment.lock"
max_concurrent_deployments: 3
pre_tasks:
- name: Check if deployment is already in progress
stat:
path: "{{ deployment_lock_file }}"
register: deployment_lock
delegate_to: "{{ groups['web_servers'][0] }}"
run_once: true
- name: Wait for existing deployment to complete
wait_for:
path: "{{ deployment_lock_file }}"
state: absent
timeout: 1800 # 30 minutes
delegate_to: "{{ groups['web_servers'][0] }}"
run_once: true
when: deployment_lock.stat.exists
- name: Create deployment lock
copy:
content: |
Deployment started: {{ ansible_date_time.iso8601 }}
Initiated by: {{ ansible_user_id }}
Version: {{ app_version }}
Servers: {{ ansible_play_hosts | join(',') }}
dest: "{{ deployment_lock_file }}"
delegate_to: "{{ groups['web_servers'][0] }}"
run_once: true
tasks:
- name: Create server-specific temporary directory
tempfile:
state: directory
prefix: "deploy_{{ inventory_hostname }}_"
register: temp_deploy_dir
- name: Download application update to temporary location
get_url:
url: "{{ app_update_url }}"
dest: "{{ temp_deploy_dir.path }}/app-update.tar.gz"
checksum: "{{ app_update_checksum }}"
timeout: 300
- name: Acquire configuration update lock
lockfile:
path: "/shared/nfs/locks/config_update_{{ inventory_hostname }}.lock"
timeout: 300
register: config_lock
- name: Update shared configuration atomically
block:
- name: Create temporary config file
copy:
src: /shared/nfs/config/app.conf
dest: "{{ temp_deploy_dir.path }}/app.conf.tmp"
remote_src: yes
- name: Update version in temporary config
lineinfile:
path: "{{ temp_deploy_dir.path }}/app.conf.tmp"
line: "version={{ app_version }}"
regexp: "^version="
- name: Validate configuration syntax
command: /opt/webapp/bin/validate-config "{{ temp_deploy_dir.path }}/app.conf.tmp"
register: config_validation
- name: Atomically replace configuration file
copy:
src: "{{ temp_deploy_dir.path }}/app.conf.tmp"
dest: /shared/nfs/config/app.conf
remote_src: yes
backup: yes
always:
- name: Release configuration update lock
file:
path: "/shared/nfs/locks/config_update_{{ inventory_hostname }}.lock"
state: absent
- name: Stop application service
systemd:
name: webapp
state: stopped
- name: Backup current application
archive:
path: /opt/webapp
dest: "{{ temp_deploy_dir.path }}/webapp_backup_{{ ansible_date_time.epoch }}.tar.gz"
format: gz
- name: Deploy new application version
unarchive:
src: "{{ temp_deploy_dir.path }}/app-update.tar.gz"
dest: /opt/webapp/
remote_src: yes
owner: webapp
group: webapp
backup: yes
- name: Validate application deployment
stat:
path: /opt/webapp/bin/webapp
register: app_binary
- name: Fail if application binary missing
fail:
msg: "Application binary not found after deployment"
when: not app_binary.stat.exists
- name: Start application service
systemd:
name: webapp
state: started
- name: Wait for application to be ready
uri:
url: "http://{{ ansible_default_ipv4.address }}:8080/health"
method: GET
status_code: 200
register: health_check
retries: 30
delay: 10
- name: Clean up temporary files
file:
path: "{{ temp_deploy_dir.path }}"
state: absent
post_tasks:
- name: Remove deployment lock
file:
path: "{{ deployment_lock_file }}"
state: absent
delegate_to: "{{ groups['web_servers'][0] }}"
run_once: true
- name: Record successful deployment
lineinfile:
path: /var/log/deployments.log
line: "{{ ansible_date_time.iso8601 }} - {{ inventory_hostname }} - {{ app_version }} - SUCCESS"
create: yes
delegate_to: localhost
rescue:
- name: Rollback on failure
block:
- name: Stop failed application
systemd:
name: webapp
state: stopped
- name: Restore previous version
unarchive:
src: "{{ temp_deploy_dir.path }}/webapp_backup_{{ ansible_date_time.epoch }}.tar.gz"
dest: /opt/
remote_src: yes
- name: Start restored application
systemd:
name: webapp
state: started
- name: Record failed deployment
lineinfile:
path: /var/log/deployments.log
line: "{{ ansible_date_time.iso8601 }} - {{ inventory_hostname }} - {{ app_version }} - FAILED_ROLLBACK"
create: yes
delegate_to: localhost
always:
- name: Clean up on failure
file:
path: "{{ temp_deploy_dir.path }}"
state: absent
- name: Remove deployment lock on failure
file:
path: "{{ deployment_lock_file }}"
state: absent
delegate_to: "{{ groups['web_servers'][0] }}"
### Advanced Concurrency Control
```yaml
# roles/deployment_controller/tasks/main.yml
---
- name: Implement distributed deployment coordination
hosts: localhost
vars:
redis_host: "{{ coordination_redis_host }}"
deployment_id: "{{ deployment_name }}_{{ ansible_date_time.epoch }}"
max_parallel_deployments: 5
tasks:
- name: Register deployment intent
uri:
url: "http://{{ coordination_service }}/api/deployments"
method: POST
body_format: json
body:
deployment_id: "{{ deployment_id }}"
target_hosts: "{{ ansible_play_hosts }}"
max_parallel: "{{ max_parallel_deployments }}"
timeout: 3600
status_code: [200, 201]
register: deployment_registration
- name: Wait for deployment slot
uri:
url: "http://{{ coordination_service }}/api/deployments/{{ deployment_id }}/wait"
method: GET
register: deployment_slot
retries: 60
delay: 30
until: deployment_slot.json.status == "ready"
- name: Execute coordinated deployment
include_tasks: coordinated_deploy.yml
vars:
deployment_token: "{{ deployment_slot.json.token }}"
assigned_batch: "{{ deployment_slot.json.batch_id }}"Mistake 7: The Monitoring Blind Spot
THE MYTH: "If the playbook completes successfully, everything is fine."
THE REALITY: Ansible success doesn't guarantee application health or performance.
The Disaster Scenario
This playbook was used to deploy a critical payment processing service:
---
- name: Deploy payment processing service
hosts: payment_servers
tasks:
- name: Deploy new payment processor
copy:
src: payment-processor-v2.jar
dest: /opt/payment/payment-processor.jar
- name: Restart payment service
systemd:
name: payment-processor
state: restarted
- name: Verify service is running
systemd:
name: payment-processor
state: started
register: service_status
- name: Report deployment success
debug:
msg: "Payment processor deployed successfully"
when: service_status.status.ActiveState == "active"The playbook reported success, but the new version had a critical bug that caused payment failures. The service appeared healthy to system monitoring, but was silently dropping 30% of transactions. The issue went undetected for hours, resulting in massive revenue loss and customer complaints.
The Comprehensive Monitoring Approach
---
- name: Deploy payment processing service with comprehensive monitoring
hosts: payment_servers
vars:
health_check_timeout: 300
performance_baseline_file: "/opt/monitoring/payment_baseline.json"
alert_thresholds:
max_response_time: 500 # milliseconds
min_success_rate: 99.5 # percentage
max_error_rate: 0.5 # percentage
pre_tasks:
- name: Capture pre-deployment baseline
uri:
url: "http://{{ ansible_default_ipv4.address }}:8080/metrics"
method: GET
return_content: yes
register: pre_deployment_metrics
ignore_errors: yes
- name: Store baseline metrics
copy:
content: "{{ pre_deployment_metrics.content }}"
dest: "{{ performance_baseline_file }}.pre"
when: pre_deployment_metrics.status == 200
tasks:
- name: Create deployment manifest
template:
src: deployment_manifest.j2
dest: "/opt/payment/deployment_manifest.json"
mode: '0644'
vars:
deployment_time: "{{ ansible_date_time.iso8601 }}"
version: "{{ payment_processor_version }}"
deployed_by: "{{ ansible_user_id }}"
- name: Backup current payment processor
copy:
src: /opt/payment/payment-processor.jar
dest: "/opt/payment/backups/payment-processor-{{ ansible_date_time.epoch }}.jar"
remote_src: yes
- name: Deploy new payment processor
copy:
src: "payment-processor-{{ payment_processor_version }}.jar"
dest: /opt/payment/payment-processor.jar
owner: payment
group: payment
mode: '0755'
backup: yes
notify: restart_payment_service
- name: Wait for service restart
meta: flush_handlers
- name: Verify service process is running
command: pgrep -f payment-processor.jar
register: process_check
retries: 10
delay: 5
until: process_check.rc == 0
- name: Wait for application initialization
wait_for:
port: 8080
host: "{{ ansible_default_ipv4.address }}"
timeout: "{{ health_check_timeout }}"
- name: Perform basic health check
uri:
url: "http://{{ ansible_default_ipv4.address }}:8080/health"
method: GET
status_code: 200
register: basic_health_check
retries: 20
delay: 15
- name: Perform comprehensive application testing
include_tasks: payment_integration_tests.yml
vars:
test_timeout: 180
- name: Monitor post-deployment metrics
uri:
url: "http://{{ ansible_default_ipv4.address }}:8080/metrics"
method: GET
return_content: yes
register: post_deployment_metrics
retries: 5
delay: 30
- name: Analyze performance regression
script: analyze_performance_metrics.py
args:
- "{{ performance_baseline_file }}.pre"
- "{{ post_deployment_metrics.content }}"
register: performance_analysis
delegate_to: localhost
- name: Fail deployment if performance regression detected
fail:
msg: "Deployment failed performance checks: {{ performance_analysis.stdout }}"
when:
- performance_analysis.rc != 0
- not ignore_performance_regression | default(false)
- name: Setup continuous monitoring
template:
src: payment_monitor.py.j2
dest: /opt/monitoring/payment_monitor.py
mode: '0755'
notify: restart_monitoring_service
- name: Configure alerting rules
template:
src: payment_alerts.yml.j2
dest: /etc/prometheus/rules/payment_alerts.yml
mode: '0644'
notify: reload_prometheus
- name: Verify end-to-end transaction flow
include_tasks: e2e_transaction_test.yml
vars:
test_transaction_amount: 1.00
expected_response_time: "{{ alert_thresholds.max_response_time }}"
post_tasks:
- name: Record deployment in audit log
uri:
url: "{{ audit_service_url }}/api/deployments"
method: POST
body_format: json
body:
service: "payment-processor"
version: "{{ payment_processor_version }}"
host: "{{ inventory_hostname }}"
status: "success"
deployment_time: "{{ ansible_date_time.iso8601 }}"
health_checks_passed: "{{ health_check_results | default([]) | length }}"
delegate_to: localhost
- name: Send deployment notification
slack:
token: "{{ slack_token }}"
msg: |
Payment Processor Deployment Successful
Host: {{ inventory_hostname }}
Version: {{ payment_processor_version }}
Health Status: All checks passed
Performance: Within acceptable thresholds
Deployment completed at {{ ansible_date_time.iso8601 }}
channel: "#payments-ops"
delegate_to: localhost
rescue:
- name: Execute emergency rollback
block:
- name: Stop failed service
systemd:
name: payment-processor
state: stopped
- name: Restore previous version
copy:
src: "/opt/payment/backups/payment-processor-{{ ansible_date_time.epoch }}.jar"
dest: /opt/payment/payment-processor.jar
remote_src: yes
- name: Start restored service
systemd:
name: payment-processor
state: started
- name: Verify rollback successful
uri:
url: "http://{{ ansible_default_ipv4.address }}:8080/health"
method: GET
status_code: 200
retries: 10
delay: 10
- name: Send rollback notification
slack:
token: "{{ slack_token }}"
msg: |
CRITICAL: Payment Processor Deployment Failed - Rollback Executed
Host: {{ inventory_hostname }}
Failed Version: {{ payment_processor_version }}
Rollback Status: {{ rollback_status | default('In Progress') }}
Immediate attention required!
channel: "#payments-critical"
delegate_to: localhost
handlers:
- name: restart_payment_service
systemd:
name: payment-processor
state: restarted
daemon_reload: yes
- name: restart_monitoring_service
systemd:
name: payment-monitor
state: restarted
- name: reload_prometheus
systemd:
name: prometheus
state: reloadedIntegration Test Suite
# tasks/payment_integration_tests.yml
---
- name: Execute payment processing integration tests
vars:
test_results: []
block:
- name: Test credit card payment processing
uri:
url: "http://{{ ansible_default_ipv4.address }}:8080/api/payments"
method: POST
body_format: json
body:
type: "credit_card"
amount: 10.00
currency: "USD"
card_number: "4111111111111111" # Test card
expiry: "12/25"
cvv: "123"
status_code: 200
timeout: "{{ alert_thresholds.max_response_time / 1000 }}"
register: cc_test_result
- name: Validate credit card response
set_fact:
test_results: "{{ test_results + ['credit_card_test: PASS'] }}"
when:
- cc_test_result.json.status == "approved"
- cc_test_result.json.transaction_id is defined
- name: Test ACH payment processing
uri:
url: "http://{{ ansible_default_ipv4.address }}:8080/api/payments"
method: POST
body_format: json
body:
type: "ach"
amount: 25.00
currency: "USD"
routing_number: "021000021" # Test routing
account_number: "1234567890" # Test account
status_code: 200
timeout: "{{ alert_thresholds.max_response_time / 1000 }}"
register: ach_test_result
- name: Validate ACH response
set_fact:
test_results: "{{ test_results + ['ach_test: PASS'] }}"
when:
- ach_test_result.json.status == "pending"
- ach_test_result.json.transaction_id is defined
- name: Test payment refund functionality
uri:
url: "http://{{ ansible_default_ipv4.address }}:8080/api/refunds"
method: POST
body_format: json
body:
transaction_id: "{{ cc_test_result.json.transaction_id }}"
amount: 5.00
reason: "integration_test"
status_code: 200
register: refund_test_result
when: cc_test_result.json.transaction_id is defined
- name: Test fraud detection integration
uri:
url: "http://{{ ansible_default_ipv4.address }}:8080/api/payments"
method: POST
body_format: json
body:
type: "credit_card"
amount: 9999.99 # Triggers fraud detection
currency: "USD"
card_number: "4000000000000002" # Test fraud card
expiry: "12/25"
cvv: "123"
status_code: 403
register: fraud_test_result
- name: Validate fraud detection
set_fact:
test_results: "{{ test_results + ['fraud_detection: PASS'] }}"
when: fraud_test_result.json.status == "declined"
- name: Verify all critical tests passed
fail:
msg: "Integration tests failed: {{ test_results }}"
when: test_results | length < 3
- name: Record integration test results
set_fact:
health_check_results: "{{ test_results }}"Performance Analysis Script
#!/usr/bin/env python3
# scripts/analyze_performance_metrics.py
import json
import sys
from datetime import datetime
def analyze_metrics(baseline_file, current_metrics):
try:
with open(baseline_file, 'r') as f:
baseline = json.loads(f.read())
current = json.loads(current_metrics)
# Define critical performance metrics
critical_metrics = {
'avg_response_time': {'threshold': 0.2, 'direction': 'lower'},
'success_rate': {'threshold': 0.05, 'direction': 'higher'},
'error_rate': {'threshold': 0.01, 'direction': 'lower'},
'throughput': {'threshold': 0.1, 'direction': 'higher'}
}
issues = []
for metric, config in critical_metrics.items():
if metric in baseline and metric in current:
baseline_val = float(baseline[metric])
current_val = float(current[metric])
if config['direction'] == 'lower':
regression = (current_val - baseline_val) / baseline_val
if regression > config['threshold']:
issues.append(f"{metric}: {regression:.2%} increase (threshold: {config['threshold']:.1%})")
else: # direction == 'higher'
regression = (baseline_val - current_val) / baseline_val
if regression > config['threshold']:
issues.append(f"{metric}: {regression:.2%} decrease (threshold: {config['threshold']:.1%})")
if issues:
print(f"Performance regressions detected: {'; '.join(issues)}")
sys.exit(1)
else:
print("Performance analysis passed - no significant regressions detected")
sys.exit(0)
except Exception as e:
print(f"Error analyzing performance metrics: {str(e)}")
sys.exit(2)
if __name__ == "__main__":
if len(sys.argv) != 3:
print("Usage: analyze_performance_metrics.py <baseline_file> <current_metrics_json>")
sys.exit(1)
analyze_metrics(sys.argv[1], sys.argv[2])The Recovery Playbook: When Everything Goes Wrong
Even with all precautions, disasters still happen. Here's the emergency response playbook I use when systems are down and executives are breathing down your neck:
---
- name: Emergency Infrastructure Recovery
hosts: "{{ target_hosts | default('all') }}"
gather_facts: no
vars:
recovery_mode: "{{ recovery_mode | default('conservative') }}"
max_recovery_time: "{{ max_recovery_time | default(1800) }}" # 30 minutes
pre_tasks:
- name: Record recovery start time
set_fact:
recovery_start: "{{ ansible_date_time.epoch }}"
- name: Create emergency backup
include_tasks: emergency_backup.yml
when: recovery_mode != "aggressive"
tasks:
- name: Stop all non-essential services
systemd:
name: "{{ item }}"
state: stopped
loop: "{{ non_essential_services }}"
ignore_errors: yes
- name: Check system resources
shell: |
echo "CPU: $(top -bn1 | grep "Cpu(s)" | awk '{print $2}' | cut -d'%' -f1)"
echo "Memory: $(free | grep Mem | awk '{printf "%.1f", $3/$2 * 100.0}')"
echo "Disk: $(df -h / | awk 'NR==2{printf "%s", $5}')"
register: resource_check
- name: Apply emergency fixes based on common issues
include_tasks: "emergency_fix_{{ ansible_os_family | lower }}.yml"
- name: Restart critical services in order
systemd:
name: "{{ item }}"
state: restarted
loop: "{{ critical_services_order }}"
register: service_restart
- name: Verify system recovery
include_tasks: system_health_check.yml
- name: Calculate recovery time
set_fact:
recovery_duration: "{{ (ansible_date_time.epoch | int - recovery_start | int) / 60 }}"
- name: Send recovery notification
mail:
to: "{{ emergency_contacts }}"
subject: "System Recovery {{ 'COMPLETED' if recovery_successful else 'FAILED' }}"
body: |
Emergency recovery {{ 'completed successfully' if recovery_successful else 'failed' }}
Recovery time: {{ recovery_duration }} minutes
Affected hosts: {{ ansible_play_hosts | join(', ') }}
{{ recovery_summary | default('No additional details') }}
vars:
recovery_successful: "{{ service_restart is succeeded }}"Lessons from the Trenches
After five years of Ansible disasters and recoveries, here are the non-negotiable principles I follow:
1. Always Test in Production-Like Environments
Your staging environment should be identical to production. Not similar. Identical.
2. Implement Circuit Breakers
- name: Circuit breaker pattern
fail:
msg: "Too many failures detected - stopping deployment"
when: failed_deployments | length > max_allowed_failures3. Use Canary Deployments
Never deploy to all servers simultaneously. Start with one server, validate, then gradually roll out.
4. Monitor Everything
If you can't measure it, you can't manage it. Monitor not just system metrics, but business metrics too.
5. Practice Disaster Recovery
Run chaos engineering exercises. Break things intentionally when you have time to fix them properly.
The Path Forward
Ansible is incredibly powerful, but with great power comes great responsibility. The mistakes I've outlined here have collectively cost companies millions of dollars and countless hours of downtime. Learn from these failures instead of repeating them.
The next time you write an Ansible playbook, ask yourself:
- What happens if this runs twice?
- What permissions does this actually need?
- How do I validate user input?
- Where are my secrets stored?
- Can I roll this back?
- What if multiple servers run this simultaneously?
- How do I know if this actually worked?
Your infrastructure depends on getting these answers right. Your career might too.
Remember: The best Ansible engineers are the ones who assume everything will go wrong and plan accordingly.
Link to my other Interesting Blogs:
How I Automated My Entire Infrastructure with One Tool (And Saved 20 Hours a Week
Your Personal AI Assistant: How to Run ChatGPT-Level Models on Ubuntu (Without Paying a Cent)
50 Super Rare Linux Commands that most users have never encountered
The Pentester's Arsenal: 25 Commands and Payloads That Actually Work in 2025
Linux Filesystem Decoded: The Ultimate Directory Cheat Sheet Every Developer Needs
Linux Filesystem Decoded: The Ultimate Directory Cheat Sheet Every Developer Needs
From Disaster to Recovery in Minutes: 50 Essential Cron Jobs
Trigger Azure Functions Like a Pro: Postman Secrets Devs Don't Talk About
This One Trick Connects Logic Apps to Function Apps Like Magic
30 essential Google Cloud Platform (GCP) CLI commands Every User should know
30 essential AWS CLI commands for managing your cloud infrastructure
30 Rare and Advanced Azure CLI commands that are extremely powerful
30 essential Azure Sentinel CLI commands for managing your SIEM environment
50 Super Rare KQL Commands that most users have never encountered