Alerting & Incident Management in Kubernetes — Configuring Alerts with Alertmanager

Introduction: Why Alerting Alone Isn't Enough

Bavithran

~4 min read · March 22, 2025 (Updated: March 22, 2025) · Free: Yes

Introduction: Why Alerting Alone Isn't Enough

During one of my earlier production outages, I realized that simply monitoring a system isn't enough if there's no proper alerting in place. We had all the metrics — CPU usage, memory trends, pod health — but no one was notified when things broke. By the time the issue was discovered, customers had already been impacted.

That's when I started focusing on not just monitoring — but building a complete alerting and incident management pipeline using Prometheus and Alertmanager. It's not about bombarding your team with alerts. It's about configuring meaningful, actionable, and timely alerts that enable fast response and minimal downtime.

In this article, I'll walk you through:

Why alerting is critical in production Kubernetes environments
How Alertmanager works with Prometheus
How I set up real-world alert rules for pods, nodes, and services
Two detailed production scenarios where alerting helped detect issues before they escalated
Best practices for alert routing, silencing, and escalation

Understanding Prometheus + Alertmanager Architecture

In Kubernetes, Prometheus scrapes metrics from services and nodes. But to take action based on these metrics — like sending Slack alerts or PagerDuty notifications — we need Alertmanager.

Architecture Overview:

[Prometheus] → [Alertmanager] → [Email, Slack, Webhooks, Opsgenie, PagerDuty]

Prometheus evaluates alerting rules
When a condition is met, it sends the alert to Alertmanager
Alertmanager handles deduplication, silencing, grouping, and routing
Finally, notifications are sent to your preferred communication channels

Deploying Alertmanager in Kubernetes

Step 1: Install kube-prometheus-stack (Recommended)

The easiest way to deploy Prometheus + Alertmanager in Kubernetes is via Helm:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install monitoring prometheus-community/kube-prometheus-stack --namespace monitoring

This stack includes:

Prometheus
Alertmanager
Node Exporter
Grafana
Default alerting rules

Step 2: Configure Alertmanager Receivers

Create a secret with your Slack webhook or email credentials:

apiVersion: v1
kind: Secret
metadata:
  name: alertmanager-secret
  namespace: monitoring
stringData:
  alertmanager.yaml: |
    global:
      smtp_smarthost: 'smtp.example.com:587'
      smtp_from: 'alerts@example.com'
    route:
      group_by: ['alertname']
      receiver: 'team-email'
    receivers:
    - name: 'team-email'
      email_configs:
      - to: 'ops@example.com'

Apply it:

kubectl apply -f alertmanager-secret.yaml

Scenario 1: Detecting Node Memory Pressure Before It Causes Pod Evictions

The Problem

In a high-traffic environment, a few nodes started to consume memory rapidly during peak hours. No one noticed until Kubernetes started evicting pods due to memory pressure. By then, the impact had reached customers.

The Fix: Proactive Alerting with Prometheus + Alertmanager

Step 1: Alert Rule for Node Memory Usage

groups:
- name: node-alerts
  rules:
  - alert: HighNodeMemoryUsage
    expr: (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) < 0.15
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High memory usage on node {{ $labels.instance }}"
      description: "Available memory is less than 15% on node {{ $labels.instance }}."

Step 2: Route Alert to Slack Channel

Configured Alertmanager to send this to the #k8s-alerts Slack channel via webhook.

Result

The team received an alert 5 minutes before pod evictions began, allowing us to cordon the node and redistribute workloads without impact.

Scenario 2: Detecting Application-Level Errors with Custom Alerts

The Problem

Our backend API had a sudden spike in 5xx responses. However, CPU and memory were normal, and no pod restarts occurred. The only signal was in the metrics exposed by the app — which we had instrumented using Prometheus.

The Fix: Custom Application Alert

We exposed the following metric in the application:

http_requests_total{status="5xx"}

Step 1: Alert Rule for 5xx Rate

- alert: HighHTTPErrorRate
  expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
  for: 2m
  labels:
    severity: critical
  annotations:
    summary: "High 5xx error rate on {{ $labels.job }}"
    description: "More than 5% of requests are failing on {{ $labels.instance }}."

Result

We got an alert when 5xx errors exceeded 5% within 2 minutes — well before customers reported issues. It turned out to be a misconfigured external dependency, which we fixed immediately.

Best Practices for Alerting & Incident Management

✔ Set thresholds based on baselines, not assumptions ✔ Group alerts (e.g., all node issues together) to avoid alert fatigue ✔ Use labels (severity, team) to route alerts properly ✔ Silence alerts during maintenance windows ✔ Send alerts to multiple channels (email, Slack, PagerDuty) based on severity ✔ Combine metrics + logs + traces for full incident context

Final Thoughts

Alertmanager is a critical part of any production Kubernetes setup. It's not just about sending notifications — it's about creating a system that helps your team act early, respond confidently, and prevent incidents from escalating.

With a well-tuned Prometheus + Alertmanager setup, I've been able to:

Proactively catch issues before impact
Route alerts to the right people
Reduce noise and avoid false positives

How do you handle alerting in your clusters? Do you have thresholds that truly reflect your workloads? Let's share ideas.

< Go to the original

Alerting & Incident Management in Kubernetes — Configuring Alerts with Alertmanager

Introduction: Why Alerting Alone Isn't Enough

Introduction: Why Alerting Alone Isn't Enough

Understanding Prometheus + Alertmanager Architecture

Architecture Overview:

Deploying Alertmanager in Kubernetes

Step 1: Install kube-prometheus-stack (Recommended)

Step 2: Configure Alertmanager Receivers

Scenario 1: Detecting Node Memory Pressure Before It Causes Pod Evictions

The Problem

The Fix: Proactive Alerting with Prometheus + Alertmanager

Step 1: Alert Rule for Node Memory Usage

Step 2: Route Alert to Slack Channel

Result

Scenario 2: Detecting Application-Level Errors with Custom Alerts

The Problem

The Fix: Custom Application Alert

Step 1: Alert Rule for 5xx Rate

Result

Best Practices for Alerting & Incident Management

Final Thoughts

Reporting a Problem