Curriculum 7 posts · ~1.3h total

Observability & SLOs for Trading Systems

If your p99 dashboard says 50µs, your p99 is probably 2ms

Instrumentation and alerting for trading systems: SLO design with PnL-equivalent error budgets, Prometheus cardinality traps, HDR histogram vs t-digest, distributed tracing hot paths, and incident response.

What you'll master

SLO design with error budgets tied to PnL
Prometheus cardinality management
HDR histogram for accurate latency percentiles
Tail-based distributed tracing for hot paths
Alerting hygiene: 40 pages/week to fewer than 5

Why this matters

The most dangerous systems are the ones that look healthy on dashboards while silently degrading. These seven posts document the observability patterns that caught a $200K undisclosed position, a latency regression worth $50K/day, and a Prometheus OOM that took down all metrics at the worst possible time.

The Curriculum — 7 modules

Part 1 Jan 2026 12 min

Nikhil Padala

Observability & SLOs for Trading Systems

What you'll master

Why this matters

The Curriculum — 7 modules

SLOs for Systems That Can't Degrade Gracefully: Error-Budget Math When Downtime = Direct PnL Loss

Prometheus at Trading-Firm Scale: Federation, Thanos vs Mimir, and the Cardinality Trap

Latency Histograms Done Right: HDR, t-digest, and Building Sub-Millisecond Dashboards

Distributed Tracing for Trading Hot Paths: Sampling Strategies That Don't Distort the Signal

Alerting Hygiene for a 24/7 Trading Desk: The Page-Tax and How to Pay It Down

Incident Response for Trading Systems: Why You Can't 'Just Roll Back' a Trade

Continuous Performance Benchmarking: Catching the 5% Regression That Costs $50K/Day