Introduction: Why Alerting Alone Isn't Enough

During one of my earlier production outages, I realized that simply monitoring a system isn't enough if there's no proper alerting in place. We had all the metrics — CPU usage, memory trends, pod health — but no one was notified when things broke. By the time the issue was discovered, customers had already been impacted.

That's when I started focusing on not just monitoring — but building a complete alerting and incident management pipeline using Prometheus and Alertmanager. It's not about bombarding your team with alerts. It's about configuring meaningful, actionable, and timely alerts that enable fast response and minimal downtime.

In this article, I'll walk you through:

  • Why alerting is critical in production Kubernetes environments
  • How Alertmanager works with Prometheus
  • How I set up real-world alert rules for pods, nodes, and services
  • Two detailed production scenarios where alerting helped detect issues before they escalated
  • Best practices for alert routing, silencing, and escalation

Understanding Prometheus + Alertmanager Architecture

In Kubernetes, Prometheus scrapes metrics from services and nodes. But to take action based on these metrics — like sending Slack alerts or PagerDuty notifications — we need Alertmanager.

Architecture Overview:

[Prometheus] → [Alertmanager] → [Email, Slack, Webhooks, Opsgenie, PagerDuty]
  • Prometheus evaluates alerting rules
  • When a condition is met, it sends the alert to Alertmanager
  • Alertmanager handles deduplication, silencing, grouping, and routing
  • Finally, notifications are sent to your preferred communication channels

Deploying Alertmanager in Kubernetes

Step 1: Install kube-prometheus-stack (Recommended)

The easiest way to deploy Prometheus + Alertmanager in Kubernetes is via Helm:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install monitoring prometheus-community/kube-prometheus-stack --namespace monitoring

This stack includes:

  • Prometheus
  • Alertmanager
  • Node Exporter
  • Grafana
  • Default alerting rules

Step 2: Configure Alertmanager Receivers

Create a secret with your Slack webhook or email credentials:

apiVersion: v1
kind: Secret
metadata:
  name: alertmanager-secret
  namespace: monitoring
stringData:
  alertmanager.yaml: |
    global:
      smtp_smarthost: 'smtp.example.com:587'
      smtp_from: 'alerts@example.com'
    route:
      group_by: ['alertname']
      receiver: 'team-email'
    receivers:
    - name: 'team-email'
      email_configs:
      - to: 'ops@example.com'

Apply it:

kubectl apply -f alertmanager-secret.yaml

Scenario 1: Detecting Node Memory Pressure Before It Causes Pod Evictions

The Problem

In a high-traffic environment, a few nodes started to consume memory rapidly during peak hours. No one noticed until Kubernetes started evicting pods due to memory pressure. By then, the impact had reached customers.

The Fix: Proactive Alerting with Prometheus + Alertmanager

Step 1: Alert Rule for Node Memory Usage

groups:
- name: node-alerts
  rules:
  - alert: HighNodeMemoryUsage
    expr: (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) < 0.15
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High memory usage on node {{ $labels.instance }}"
      description: "Available memory is less than 15% on node {{ $labels.instance }}."

Step 2: Route Alert to Slack Channel

Configured Alertmanager to send this to the #k8s-alerts Slack channel via webhook.

Result

The team received an alert 5 minutes before pod evictions began, allowing us to cordon the node and redistribute workloads without impact.

Scenario 2: Detecting Application-Level Errors with Custom Alerts

The Problem

Our backend API had a sudden spike in 5xx responses. However, CPU and memory were normal, and no pod restarts occurred. The only signal was in the metrics exposed by the app — which we had instrumented using Prometheus.

The Fix: Custom Application Alert

We exposed the following metric in the application:

http_requests_total{status="5xx"}

Step 1: Alert Rule for 5xx Rate

- alert: HighHTTPErrorRate
  expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
  for: 2m
  labels:
    severity: critical
  annotations:
    summary: "High 5xx error rate on {{ $labels.job }}"
    description: "More than 5% of requests are failing on {{ $labels.instance }}."

Result

We got an alert when 5xx errors exceeded 5% within 2 minutes — well before customers reported issues. It turned out to be a misconfigured external dependency, which we fixed immediately.

Best Practices for Alerting & Incident Management

Set thresholds based on baselines, not assumptions ✔ Group alerts (e.g., all node issues together) to avoid alert fatigue ✔ Use labels (severity, team) to route alerts properly ✔ Silence alerts during maintenance windowsSend alerts to multiple channels (email, Slack, PagerDuty) based on severity ✔ Combine metrics + logs + traces for full incident context

Final Thoughts

Alertmanager is a critical part of any production Kubernetes setup. It's not just about sending notifications — it's about creating a system that helps your team act early, respond confidently, and prevent incidents from escalating.

With a well-tuned Prometheus + Alertmanager setup, I've been able to:

  • Proactively catch issues before impact
  • Route alerts to the right people
  • Reduce noise and avoid false positives

How do you handle alerting in your clusters? Do you have thresholds that truly reflect your workloads? Let's share ideas.