Introduction: Why Alerting Alone Isn't Enough
During one of my earlier production outages, I realized that simply monitoring a system isn't enough if there's no proper alerting in place. We had all the metrics — CPU usage, memory trends, pod health — but no one was notified when things broke. By the time the issue was discovered, customers had already been impacted.
That's when I started focusing on not just monitoring — but building a complete alerting and incident management pipeline using Prometheus and Alertmanager. It's not about bombarding your team with alerts. It's about configuring meaningful, actionable, and timely alerts that enable fast response and minimal downtime.
In this article, I'll walk you through:
- Why alerting is critical in production Kubernetes environments
- How Alertmanager works with Prometheus
- How I set up real-world alert rules for pods, nodes, and services
- Two detailed production scenarios where alerting helped detect issues before they escalated
- Best practices for alert routing, silencing, and escalation
Understanding Prometheus + Alertmanager Architecture
In Kubernetes, Prometheus scrapes metrics from services and nodes. But to take action based on these metrics — like sending Slack alerts or PagerDuty notifications — we need Alertmanager.
Architecture Overview:
[Prometheus] → [Alertmanager] → [Email, Slack, Webhooks, Opsgenie, PagerDuty]- Prometheus evaluates alerting rules
- When a condition is met, it sends the alert to Alertmanager
- Alertmanager handles deduplication, silencing, grouping, and routing
- Finally, notifications are sent to your preferred communication channels
Deploying Alertmanager in Kubernetes
Step 1: Install kube-prometheus-stack (Recommended)
The easiest way to deploy Prometheus + Alertmanager in Kubernetes is via Helm:
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install monitoring prometheus-community/kube-prometheus-stack --namespace monitoringThis stack includes:
- Prometheus
- Alertmanager
- Node Exporter
- Grafana
- Default alerting rules
Step 2: Configure Alertmanager Receivers
Create a secret with your Slack webhook or email credentials:
apiVersion: v1
kind: Secret
metadata:
name: alertmanager-secret
namespace: monitoring
stringData:
alertmanager.yaml: |
global:
smtp_smarthost: 'smtp.example.com:587'
smtp_from: 'alerts@example.com'
route:
group_by: ['alertname']
receiver: 'team-email'
receivers:
- name: 'team-email'
email_configs:
- to: 'ops@example.com'Apply it:
kubectl apply -f alertmanager-secret.yamlScenario 1: Detecting Node Memory Pressure Before It Causes Pod Evictions
The Problem
In a high-traffic environment, a few nodes started to consume memory rapidly during peak hours. No one noticed until Kubernetes started evicting pods due to memory pressure. By then, the impact had reached customers.
The Fix: Proactive Alerting with Prometheus + Alertmanager
Step 1: Alert Rule for Node Memory Usage
groups:
- name: node-alerts
rules:
- alert: HighNodeMemoryUsage
expr: (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) < 0.15
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage on node {{ $labels.instance }}"
description: "Available memory is less than 15% on node {{ $labels.instance }}."Step 2: Route Alert to Slack Channel
Configured Alertmanager to send this to the #k8s-alerts Slack channel via webhook.
Result
The team received an alert 5 minutes before pod evictions began, allowing us to cordon the node and redistribute workloads without impact.
Scenario 2: Detecting Application-Level Errors with Custom Alerts
The Problem
Our backend API had a sudden spike in 5xx responses. However, CPU and memory were normal, and no pod restarts occurred. The only signal was in the metrics exposed by the app — which we had instrumented using Prometheus.
The Fix: Custom Application Alert
We exposed the following metric in the application:
http_requests_total{status="5xx"}Step 1: Alert Rule for 5xx Rate
- alert: HighHTTPErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
for: 2m
labels:
severity: critical
annotations:
summary: "High 5xx error rate on {{ $labels.job }}"
description: "More than 5% of requests are failing on {{ $labels.instance }}."Result
We got an alert when 5xx errors exceeded 5% within 2 minutes — well before customers reported issues. It turned out to be a misconfigured external dependency, which we fixed immediately.
Best Practices for Alerting & Incident Management
✔ Set thresholds based on baselines, not assumptions ✔ Group alerts (e.g., all node issues together) to avoid alert fatigue ✔ Use labels (severity, team) to route alerts properly ✔ Silence alerts during maintenance windows ✔ Send alerts to multiple channels (email, Slack, PagerDuty) based on severity ✔ Combine metrics + logs + traces for full incident context
Final Thoughts
Alertmanager is a critical part of any production Kubernetes setup. It's not just about sending notifications — it's about creating a system that helps your team act early, respond confidently, and prevent incidents from escalating.
With a well-tuned Prometheus + Alertmanager setup, I've been able to:
- Proactively catch issues before impact
- Route alerts to the right people
- Reduce noise and avoid false positives
How do you handle alerting in your clusters? Do you have thresholds that truly reflect your workloads? Let's share ideas.