Kubernetes StatefulSets: Why Trading Systems Need State

Deployments assume pods are fungible. Any instance can handle any request. Trading systems are the opposite.

Your trading engine holds state: exchange connections, position tracking, order IDs. Kill a pod, lose the state, lose money. Restart with a different identity, create duplicate orders.

This post covers why StatefulSets are essential for trading, how they work internally, and the complete configuration pattern.

The Problem {#the-problem}

Deployment failure mode:

# WRONG
apiVersion: apps/v1
kind: Deployment
metadata:
  name: trading-bot
spec:
  replicas: 3
```text

**What happens:**
1. All 3 pods connect to Binance
2. All 3 receive same market data
3. All 3 try to execute same trade
4. 2/3 rejected as duplicates
5. Rate limits exhausted

**Root cause:** Pods have random names (`trading-bot-7d8f9-xyz`). No identity assignment. No leader election. No state persistence.

For the broader architecture context, see [First Principles](/blog/first-trading-infrastructure-principles). For kernel-level tuning on Kubernetes nodes, see [CPU Optimization](/blog/cpu-optimization-linux-latency).


## Background: Kubernetes Scheduling {#background}

### How Deployments Work

Deployments manage ReplicaSets ([controller source](https://github.com/kubernetes/kubernetes/tree/master/pkg/controller/deployment)):

```text
Deployment → ReplicaSet → Pods
```text

**Key behaviors:**
- Pods get random suffixes
- Any pod can be killed first during scale-down
- PersistentVolumeClaims are shared (if any)
- No ordering guarantees

### How StatefulSets Work

StatefulSets provide ordered, persistent identity ([controller source](https://github.com/kubernetes/kubernetes/tree/master/pkg/controller/statefulset)):

```python
StatefulSet → Pods with stable names
                ↓
    pod-0, pod-1, pod-2 (always)
```bash

**Key behaviors:**
- Pods get ordinal names: `{statefulset}-0`, `{statefulset}-1`
- Ordered creation: 0 must be Running before 1 starts
- Ordered deletion: N-1 deleted before N-2
- Stable network identity via headless service
- Per-pod PersistentVolumeClaims

### Why This Matters for Trading

| Requirement | Deployment | StatefulSet |
|-------------|------------|-------------|
| Stable identity | No | Yes |
| Per-pod storage | Shared only | Per-pod |
| Ordered scaling | No | Yes |
| Network identity | Random | Stable DNS |


## Fix 1: StatefulSets for Identity {#statefulsets}

### The Pattern

```yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: trading-engine
  namespace: trading
spec:
  serviceName: "trading-headless"
  replicas: 3
  podManagementPolicy: Parallel  # All start together (fast)
  updateStrategy:
    type: RollingUpdate
  selector:
    matchLabels:
      app: trading-engine
  template:
    metadata:
      labels:
        app: trading-engine
    spec:
      containers:
      - name: engine
        image: trading-engine:latest
        env:
        - name: POD_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        - name: ASSIGNED_MARKET
          # Application reads POD_NAME and derives assignment
          # trading-engine-0 → BTC
          # trading-engine-1 → ETH
          # trading-engine-2 → SOL
```text

### Application Logic

```python
import os

POD_NAME = os.environ.get('POD_NAME', 'trading-engine-0')
POD_ORDINAL = int(POD_NAME.split('-')[-1])

MARKET_ASSIGNMENTS = {
    0: ['BTCUSDT', 'BTCUSD'],
    1: ['ETHUSDT', 'ETHUSD'],
    2: ['SOLUSDT', 'SOLUSD'],
}

my_markets = MARKET_ASSIGNMENTS.get(POD_ORDINAL, [])
print(f"Pod {POD_ORDINAL} handling markets: {my_markets}")
```bash

### Expected Behavior

| Event | Result |
|-------|--------|
| Pod-0 crashes | Pod-0 restarts (same identity, same markets) |
| Scale to 4 | Pod-3 created, gets new market assignment |
| Scale to 2 | Pod-2 deleted first (reverse order) |


## Fix 2: Headless Services {#headless}

### The Problem

ClusterIP services load-balance. You can't connect to a specific pod.

### The Fix

```yaml
apiVersion: v1
kind: Service
metadata:
  name: trading-headless
spec:
  clusterIP: None  # Headless
  selector:
    app: trading-engine
  ports:
  - port: 8080
    name: http
  - port: 9090
    name: metrics
```text

### How It Works

With headless service, each pod gets stable DNS:
- `trading-engine-0.trading-headless.trading.svc.cluster.local`
- `trading-engine-1.trading-headless.trading.svc.cluster.local`

Your risk engine can connect directly to each trading engine:

```python
TRADING_ENGINES = [
    "trading-engine-0.trading-headless.trading.svc.cluster.local:8080",
    "trading-engine-1.trading-headless.trading.svc.cluster.local:8080",
]

for engine in TRADING_ENGINES:
    position = get_position(engine)
```text

No load balancer in the path. Direct TCP connections.


## Fix 3: Persistent Volumes {#pv}

### The Problem

Trading engines need persistent state:
- Order history (for reconciliation)
- Position snapshots (for crash recovery)
- WAL logs (for replay)

Without persistence, restart = lost state.

### The Fix

```yaml
volumeClaimTemplates:
- metadata:
    name: trading-data
  spec:
    accessModes: ["ReadWriteOnce"]
    storageClassName: "gp3-encrypted"
    resources:
      requests:
        storage: 50Gi
```text

**StorageClass for EKS:**

```yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: gp3-encrypted
provisioner: ebs.csi.aws.com
volumeBindingMode: WaitForFirstConsumer
parameters:
  type: gp3
  encrypted: "true"
  iops: "3000"
  throughput: "125"
```sql

### How It Works

Each pod gets its own PVC:
- `trading-data-trading-engine-0`
- `trading-data-trading-engine-1`

PVCs persist across pod restarts. Delete pod → PVC remains → New pod gets same PVC.

For EBS optimization, see [Storage Deep Dive](/blog/storage-io-linux-latency#ebs).


## Fix 4: Graceful Shutdown {#shutdown}

### The Problem

Default termination: SIGTERM → wait 30s → SIGKILL.

Trading needs:
1. Cancel open orders (5-30s)
2. Wait for exchange confirmations (10s)
3. Flush state (1s)

30 seconds isn't enough if exchange is slow.

### The Fix

```yaml
spec:
  terminationGracePeriodSeconds: 120  # 2 minutes
  containers:
  - name: engine
    lifecycle:
      preStop:
        exec:
          command:
          - /bin/sh
          - -c
          - |
            # Signal application to stop trading
            curl -X POST http://localhost:8080/shutdown

            # Wait for order cancellations
            sleep 60

            # Final state flush happens in SIGTERM handler
```text

### Application Pattern

```python
import signal
import time

shutdown_requested = False

def handle_sigterm(signum, frame):
    global shutdown_requested
    shutdown_requested = True

    # Cancel all open orders
    for order in get_open_orders():
        cancel_order(order)

    # Wait for confirmations
    while get_open_orders():
        time.sleep(1)

    # Flush state
    save_state_to_disk()

    sys.exit(0)

signal.signal(signal.SIGTERM, handle_sigterm)
```text


## Fix 5: Pod Disruption Budgets {#pdb}

### The Problem

Kubernetes can evict pods during:
- Node upgrades
- Cluster autoscaler decisions
- Spot instance reclaims

Without protection, all pods could be evicted simultaneously.

### The Fix

```yaml
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: trading-pdb
spec:
  minAvailable: 2  # At least 2 pods always running
  selector:
    matchLabels:
      app: trading-engine
```text

### How It Works

**Voluntary disruptions** (upgrades, autoscaler) respect PDB:
- Want to evict pod-0
- Check PDB: 3 running, need 2 minimum
- Eviction allowed (3-1=2 ≥ 2)

**Involuntary disruptions** (node crash) don't check PDB. You need multi-AZ for that.


## Complete StatefulSet Example

```yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: trading-engine
  namespace: trading
spec:
  serviceName: "trading-headless"
  replicas: 3
  podManagementPolicy: Parallel
  updateStrategy:
    type: RollingUpdate
    rollingUpdate:
      partition: 0
  selector:
    matchLabels:
      app: trading-engine
  template:
    metadata:
      labels:
        app: trading-engine
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "9090"
    spec:
      terminationGracePeriodSeconds: 120

      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: node-type
                operator: In
                values:
                - trading
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchLabels:
                  app: trading-engine
              topologyKey: topology.kubernetes.io/zone

      containers:
      - name: engine
        image: trading-engine:v1.2.3
        env:
        - name: POD_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        - name: NODE_NAME
          valueFrom:
            fieldRef:
              fieldPath: spec.nodeName

        resources:
          requests:
            memory: "4Gi"
            cpu: "2000m"
          limits:
            memory: "8Gi"
            cpu: "4000m"

        ports:
        - containerPort: 8080
          name: http
        - containerPort: 9090
          name: metrics

        volumeMounts:
        - name: trading-data
          mountPath: /data

        readinessProbe:
          httpGet:
            path: /ready
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 5

        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10

        lifecycle:
          preStop:
            exec:
              command:
              - /bin/sh
              - -c
              - "curl -X POST localhost:8080/shutdown && sleep 60"

  volumeClaimTemplates:
  - metadata:
      name: trading-data
    spec:
      accessModes: ["ReadWriteOnce"]
      storageClassName: "gp3-encrypted"
      resources:
        requests:
          storage: 50Gi

Design Philosophy {#design-philosophy}

Stateless vs Stateful

Kubernetes was designed for stateless. Original patterns assumed:

Ephemeral pods
Shared state in databases
Any pod handles any request

Trading is inherently stateful:

Exchange connections are stateful (WebSocket)
Position tracking requires memory
Order IDs need persistence

StatefulSets bridge this gap.

The Tradeoff

Deployment	StatefulSet
Simple scaling	Ordered scaling
Fast rollouts	Careful rollouts
No identity	Stable identity
Shared state	Per-pod state

StatefulSets are more complex. That complexity is the cost of correctness.

Audit Your Infrastructure

Running trading on Kubernetes? The underlying nodes still need kernel tuning. Check out latency-audit to verify CPU governors, memory settings, and network configurations on your node pools.

Topic	Next Post
Design philosophy & architecture decisions	Trading Infrastructure: First Principles That Scale
CPU governors, C-states, NUMA, isolation	CPU Isolation for HFT: The isolcpus Lie and What Actually Works
NIC offloads, IRQ affinity, kernel bypass	Network Optimization: Kernel Bypass and the Art of Busy Polling
SLOs, metrics that matter, alerting	Trading Metrics: What SRE Dashboards Miss
The 5 kernel settings that cost you latency	Linux Defaults That Cost You Latency

Nikhil Padala

Kubernetes StatefulSets: Why Trading Systems Need State

The Problem {#the-problem}

Design Philosophy {#design-philosophy}

Stateless vs Stateful

The Tradeoff

Audit Your Infrastructure

Trading Metrics: What SRE Dashboards Miss

Continue Reading

Self-Custody vs CEX: The Speed Myth That Cost Me Money

My Bot Signs 1,000 Trades a Day. Here's How I Sleep at Night.

42µs: How Deterministic Execution Beats Cloud Signing

Reading Path