Trading System Metrics That Actually Matter

Fill latency, position drift, market data staleness. The SLOs that prevent losses — not just track uptime.

Intermediate • 20 min read • Expert Version →

What you'll learn

Identify trading-specific metrics beyond standard SRE
Define SLOs for fill latency and market data staleness
Configure Prometheus/Grafana for trading dashboards
Build alerts that prevent losses, not just track outages

📚 Prerequisites

Before this lesson, you should understand:

Observability: The Physics of Seeing

Beyond Uptime: Trading SLOs

Standard SRE dashboards track uptime, error rates, and latency. For trading, that’s not enough.

Web app SLO:  99.9% availability, p99 < 200ms
Trading SLO: 99.99% availability, p99 < 500µs,
             fill rate > 95%, market data staleness < 1ms
```bash

If your market data is 5ms stale, you're trading on old prices. That's not an "outage" — but it costs money.

---

## The Four Trading Metrics

| Metric | What It Measures | Why It Matters |
|--------|-----------------|----------------|
| **Fill Latency** | Time from order to execution | Slower = worse prices |
| **Market Data Staleness** | Age of latest price data | Stale data = wrong decisions |
| **Position Drift** | Expected vs. actual positions | Detects execution failures |
| **Quote-to-Trade Ratio** | Orders/trades | Indicates strategy health |

---

## The Core Distinction

In trading, performance IS correctness. A web page that loads in 1s vs 100ms annoys users. A trade that executes in 10ms vs 100µs means you got a worse price — or no fill at all.

Your monitoring system needs to treat latency like a business metric, not just a technical one.

---

## Key Metrics Implementation

### 1. Fill Latency

```python
# Prometheus metrics (Python example)
from prometheus_client import Histogram

fill_latency = Histogram(
    'order_fill_latency_seconds',
    'Time from order submission to fill confirmation',
    buckets=[0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1],  # 100µs to 100ms
    labelnames=['exchange', 'instrument']
)

# In your order handler
with fill_latency.labels(exchange='binance', instrument='BTC-USD').time():
    submit_and_wait_for_fill(order)
```text

## 2. Market Data Staleness

```python
from prometheus_client import Gauge
import time

data_staleness = Gauge(
    'market_data_staleness_seconds',
    'Age of latest market data',
    labelnames=['exchange', 'symbol']
)

# In your market data handler
def on_tick(symbol, exchange_timestamp):
    staleness = time.time() - exchange_timestamp
    data_staleness.labels(exchange='binance', symbol=symbol).set(staleness)
```text

## 3. Position Drift

```python
position_drift = Gauge(
    'position_drift_absolute',
    'Difference between expected and actual position',
    labelnames=['symbol']
)

# Periodic reconciliation
def reconcile_positions():
    for symbol in tracked_symbols:
        expected = internal_position[symbol]
        actual = query_exchange_position(symbol)
        drift = abs(expected - actual)
        position_drift.labels(symbol=symbol).set(drift)
```bash

---

## SLO Definitions

| Metric | Target | Alert Threshold | Business Impact |
|--------|--------|-----------------|-----------------|
| **Fill Latency p99** | < 1ms | > 2ms for 1min | Worse prices |
| **Fill Latency p99.9** | < 10ms | > 20ms for 30s | Strategy disabled |
| **Market Data Staleness** | < 500µs | > 2ms for 10s | Wrong pricing |
| **Position Drift** | 0 | > 0 for 5min | Inventory risk |

The "$10K/day lost" figures in many monitoring articles are illustrative — actual impact depends entirely on your strategy's position sizes and markets.

---

## Prometheus Queries

### Fill Latency Percentiles

```promql
# P99 fill latency by exchange
histogram_quantile(0.99,
  sum(rate(order_fill_latency_seconds_bucket[5m])) by (le, exchange)
)

# P99.9 for catching tail latency
histogram_quantile(0.999,
  sum(rate(order_fill_latency_seconds_bucket[5m])) by (le, exchange)
)
```text

## Market Data Freshness

```promql
# Alert if any symbol is stale > 2ms
max(market_data_staleness_seconds) by (exchange) > 0.002
```text

## Position Reconciliation

```promql
# Any position drift is a problem
sum(position_drift_absolute) > 0
```sql

---

## Common Misconceptions

**Myth:** "p99 latency is enough."
**Reality:** p99 hides your worst 1%. In trading, one 100ms spike during volatile markets can cause significant losses. Track p99.9 or even p99.99.

**Myth:** "We monitor latency, so we're fine."
**Reality:** Latency to where? You need to measure the complete path: tick-to-trade (market data received to order sent) and order-to-fill (order sent to fill confirmed).

**Myth:** "Position drift can wait for daily reconciliation."
**Reality:** If you're trading many orders per day and one fails silently, you could have an unhedged position for hours. Real-time reconciliation is mandatory.

---

## Grafana Dashboard Layout

Recommended panels:

```bash
+------------------+------------------+------------------+
|  Fill Latency    |  Market Data     |  Position Drift  |
|  (Heatmap)       |  Staleness       |  (per symbol)    |
+------------------+------------------+------------------+
|  Fill Rate %     |  Order Flow      |  PnL Real-time   |
|  (by venue)      |  (orders/sec)    |  (streaming)     |
+------------------+------------------+------------------+
|  Error Rate      |  Quote/Trade     |  System Alerts   |
|  (by type)       |  Ratio           |  (last 24h)      |
+------------------+------------------+------------------+
```diff

---

## Alerting Rules

```yaml
# prometheus/alerts.yml
groups:
  - name: trading
    rules:
      - alert: FillLatencyHigh
        expr: histogram_quantile(0.99, sum(rate(order_fill_latency_seconds_bucket[5m])) by (le)) > 0.002
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: "Fill latency p99 above 2ms"

      - alert: MarketDataStale
        expr: max(market_data_staleness_seconds) > 0.005
        for: 10s
        labels:
          severity: critical
        annotations:
          summary: "Market data > 5ms stale"

      - alert: PositionDrift
        expr: sum(position_drift_absolute) > 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Position mismatch detected"
```diff

---

## Practice Exercises

### Exercise 1: Implement Fill Latency
```python
# Add timing around your order flow
# Measure: order_created → order_sent → exchange_ack → fill_confirmed
```text

## Exercise 2: Set Up Staleness Check
```python
# Compare exchange timestamp to local time
# Alert if difference > 2ms
```text

## Exercise 3: Build a Grafana Dashboard
```text
# Create panels for:
# - Fill latency heatmap (last 1 hour)
# - Staleness gauge (current value)
# - Position reconciliation status

Key Takeaways

Trading metrics differ from SRE metrics - Latency is a business metric
p99.9 matters more than p99 - Tail latency costs money
Market data staleness is invisible - You’re trading on old prices without knowing
Position drift = silent failure - Real-time reconciliation is mandatory

What’s Next?

eBPF Profiling

Trading Metrics: What SRE Dashboards Miss

Pro Version: For production implementation, see Monitoring Trading Systems

Want to go deeper?

Weekly infrastructure insights for engineers who build trading systems.

Free forever. Unsubscribe anytime.

You're in. Check your inbox.

Questions about this lesson? Working on related infrastructure?

Let's discuss