Trading System Metrics That Actually Matter
Fill latency, position drift, market data staleness. The SLOs that prevent losses — not just track uptime.
🎯 What You'll Learn
- Identify trading-specific metrics beyond standard SRE
- Define SLOs for fill latency and market data staleness
- Configure Prometheus/Grafana for trading dashboards
- Build alerts that prevent losses, not just track outages
Beyond Uptime: Trading SLOs
Standard SRE dashboards track uptime, error rates, and latency. For trading, that’s not enough.
Web app SLO: 99.9% availability, p99 < 200ms
Trading SLO: 99.99% availability, p99 < 500µs,
fill rate > 95%, market data staleness < 1ms
```bash
If your market data is 5ms stale, you're trading on old prices. That's not an "outage" — but it costs money.
---
## The Four Trading Metrics
| Metric | What It Measures | Why It Matters |
|--------|-----------------|----------------|
| **Fill Latency** | Time from order to execution | Slower = worse prices |
| **Market Data Staleness** | Age of latest price data | Stale data = wrong decisions |
| **Position Drift** | Expected vs. actual positions | Detects execution failures |
| **Quote-to-Trade Ratio** | Orders/trades | Indicates strategy health |
---
## The Core Distinction
In trading, performance IS correctness. A web page that loads in 1s vs 100ms annoys users. A trade that executes in 10ms vs 100µs means you got a worse price — or no fill at all.
Your monitoring system needs to treat latency like a business metric, not just a technical one.
---
## Key Metrics Implementation
### 1. Fill Latency
```python
# Prometheus metrics (Python example)
from prometheus_client import Histogram
fill_latency = Histogram(
'order_fill_latency_seconds',
'Time from order submission to fill confirmation',
buckets=[0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1], # 100µs to 100ms
labelnames=['exchange', 'instrument']
)
# In your order handler
with fill_latency.labels(exchange='binance', instrument='BTC-USD').time():
submit_and_wait_for_fill(order)
```text
## 2. Market Data Staleness
```python
from prometheus_client import Gauge
import time
data_staleness = Gauge(
'market_data_staleness_seconds',
'Age of latest market data',
labelnames=['exchange', 'symbol']
)
# In your market data handler
def on_tick(symbol, exchange_timestamp):
staleness = time.time() - exchange_timestamp
data_staleness.labels(exchange='binance', symbol=symbol).set(staleness)
```text
## 3. Position Drift
```python
position_drift = Gauge(
'position_drift_absolute',
'Difference between expected and actual position',
labelnames=['symbol']
)
# Periodic reconciliation
def reconcile_positions():
for symbol in tracked_symbols:
expected = internal_position[symbol]
actual = query_exchange_position(symbol)
drift = abs(expected - actual)
position_drift.labels(symbol=symbol).set(drift)
```bash
---
## SLO Definitions
| Metric | Target | Alert Threshold | Business Impact |
|--------|--------|-----------------|-----------------|
| **Fill Latency p99** | < 1ms | > 2ms for 1min | Worse prices |
| **Fill Latency p99.9** | < 10ms | > 20ms for 30s | Strategy disabled |
| **Market Data Staleness** | < 500µs | > 2ms for 10s | Wrong pricing |
| **Position Drift** | 0 | > 0 for 5min | Inventory risk |
The "$10K/day lost" figures in many monitoring articles are illustrative — actual impact depends entirely on your strategy's position sizes and markets.
---
## Prometheus Queries
### Fill Latency Percentiles
```promql
# P99 fill latency by exchange
histogram_quantile(0.99,
sum(rate(order_fill_latency_seconds_bucket[5m])) by (le, exchange)
)
# P99.9 for catching tail latency
histogram_quantile(0.999,
sum(rate(order_fill_latency_seconds_bucket[5m])) by (le, exchange)
)
```text
## Market Data Freshness
```promql
# Alert if any symbol is stale > 2ms
max(market_data_staleness_seconds) by (exchange) > 0.002
```text
## Position Reconciliation
```promql
# Any position drift is a problem
sum(position_drift_absolute) > 0
```sql
---
## Common Misconceptions
**Myth:** "p99 latency is enough."
**Reality:** p99 hides your worst 1%. In trading, one 100ms spike during volatile markets can cause significant losses. Track p99.9 or even p99.99.
**Myth:** "We monitor latency, so we're fine."
**Reality:** Latency to where? You need to measure the complete path: tick-to-trade (market data received to order sent) and order-to-fill (order sent to fill confirmed).
**Myth:** "Position drift can wait for daily reconciliation."
**Reality:** If you're trading many orders per day and one fails silently, you could have an unhedged position for hours. Real-time reconciliation is mandatory.
---
## Grafana Dashboard Layout
Recommended panels:
```bash
+------------------+------------------+------------------+
| Fill Latency | Market Data | Position Drift |
| (Heatmap) | Staleness | (per symbol) |
+------------------+------------------+------------------+
| Fill Rate % | Order Flow | PnL Real-time |
| (by venue) | (orders/sec) | (streaming) |
+------------------+------------------+------------------+
| Error Rate | Quote/Trade | System Alerts |
| (by type) | Ratio | (last 24h) |
+------------------+------------------+------------------+
```diff
---
## Alerting Rules
```yaml
# prometheus/alerts.yml
groups:
- name: trading
rules:
- alert: FillLatencyHigh
expr: histogram_quantile(0.99, sum(rate(order_fill_latency_seconds_bucket[5m])) by (le)) > 0.002
for: 1m
labels:
severity: warning
annotations:
summary: "Fill latency p99 above 2ms"
- alert: MarketDataStale
expr: max(market_data_staleness_seconds) > 0.005
for: 10s
labels:
severity: critical
annotations:
summary: "Market data > 5ms stale"
- alert: PositionDrift
expr: sum(position_drift_absolute) > 0
for: 5m
labels:
severity: critical
annotations:
summary: "Position mismatch detected"
```diff
---
## Practice Exercises
### Exercise 1: Implement Fill Latency
```python
# Add timing around your order flow
# Measure: order_created → order_sent → exchange_ack → fill_confirmed
```text
## Exercise 2: Set Up Staleness Check
```python
# Compare exchange timestamp to local time
# Alert if difference > 2ms
```text
## Exercise 3: Build a Grafana Dashboard
```text
# Create panels for:
# - Fill latency heatmap (last 1 hour)
# - Staleness gauge (current value)
# - Position reconciliation status
Key Takeaways
- Trading metrics differ from SRE metrics - Latency is a business metric
- p99.9 matters more than p99 - Tail latency costs money
- Market data staleness is invisible - You’re trading on old prices without knowing
- Position drift = silent failure - Real-time reconciliation is mandatory
What’s Next?
Trading Metrics: What SRE Dashboards Miss
Pro Version: For production implementation, see Monitoring Trading Systems
Want to go deeper?
Weekly infrastructure insights for engineers who build trading systems.
Free forever. Unsubscribe anytime.
You're in. Check your inbox.
Questions about this lesson? Working on related infrastructure?
Let's discuss