Skip to content
Curriculum 7 posts · ~1.3h total

Observability & SLOs for Trading Systems

If your p99 dashboard says 50µs, your p99 is probably 2ms

Instrumentation and alerting for trading systems: SLO design with PnL-equivalent error budgets, Prometheus cardinality traps, HDR histogram vs t-digest, distributed tracing hot paths, and incident response.

What you'll master

  • SLO design with error budgets tied to PnL
  • Prometheus cardinality management
  • HDR histogram for accurate latency percentiles
  • Tail-based distributed tracing for hot paths
  • Alerting hygiene: 40 pages/week to fewer than 5

Why this matters

The most dangerous systems are the ones that look healthy on dashboards while silently degrading. These seven posts document the observability patterns that caught a $200K undisclosed position, a latency regression worth $50K/day, and a Prometheus OOM that took down all metrics at the worst possible time.

The Curriculum — 7 modules