Prometheus Chaos Edition ⭐ Official

# Pull the chaos edition sidecar docker pull quay.io/prometheuschaos/chaos-sidecar:latest docker run -d --name prometheus-chaos --network container:prometheus quay.io/prometheuschaos/chaos-sidecar

| Risk | Mitigation | | --- | --- | | PCE accidentally runs on production | Use namespace isolation, explicit --chaos.enabled=false flag in prod. | | Permanent data loss | Run against a replica Prometheus with --storage.tsdb.retention.time=6h . | | Alert fatigue | Notify a separate “chaos channel” during experiments. | | Controller plane overload | Limit chaos duration (e.g., 5 minutes max). |

We all love Prometheus. It scrapes metrics, fires alerts, and helps us sleep at night. But here’s a painful truth most engineers realize at 3 AM: Your monitoring system can fail, and you won’t know about it until the real outage happens. prometheus chaos edition

Run this between Prometheus and your real exporters. Watch Prometheus log parse error and target down – then verify your alerts fire correctly.

Prometheus Chaos Edition turns the old monitoring paradox on its head. Instead of trusting your monitoring blindly, you break it on purpose – gently, repeatedly, and observably. # Pull the chaos edition sidecar docker pull quay

Create a small proxy that intercepts /metrics endpoints:

| | With PCE | | --- | --- | | You assume Prometheus is always healthy. | You prove it can survive partial failures. | | Alertmanager might be misconfigured for months. | You test silences, inhibitions, and receivers. | | A slow scrape delays critical alerts. | You detect latency thresholds before they matter. | | Grafana dashboards freeze, but no one notices. | You build fallback visualizations. | | | Controller plane overload | Limit chaos duration (e

What happens when your Prometheus server runs out of memory? What if a metric scrape takes 30 seconds because a target is thrashing? What if your alerting rules become corrupt?