Skip to content
STAGING — not production

Infrastructure

Kubernetes StatefulSets: Why Trading Systems Need State

Deep dive into StatefulSets vs Deployments, pod identity, PersistentVolumes, and graceful shutdown patterns for trading infrastructure.

6 min
#kubernetes #trading #statefulsets #eks #infrastructure #devops

Deployments assume pods are fungible. Any instance can handle any request. Trading systems are the opposite.

Your trading engine holds state: exchange connections, position tracking, order IDs. Kill a pod, lose the state, lose money. Restart with a different identity, create duplicate orders.

This post covers why StatefulSets are essential for trading, how they work internally, and the complete configuration pattern.

The Problem {#the-problem}

Deployment failure mode:

# WRONG
apiVersion: apps/v1
kind: Deployment
metadata:
  name: trading-bot
spec:
  replicas: 3
```text

**What happens:**
1. All 3 pods connect to Binance
2. All 3 receive same market data
3. All 3 try to execute same trade
4. 2/3 rejected as duplicates
5. Rate limits exhausted

**Root cause:** Pods have random names (`trading-bot-7d8f9-xyz`). No identity assignment. No leader election. No state persistence.

For the broader architecture context, see [First Principles](/blog/first-trading-infrastructure-principles). For kernel-level tuning on Kubernetes nodes, see [CPU Optimization](/blog/cpu-optimization-linux-latency).


## Background: Kubernetes Scheduling {#background}

### How Deployments Work

Deployments manage ReplicaSets ([controller source](https://github.com/kubernetes/kubernetes/tree/master/pkg/controller/deployment)):

```text
Deployment → ReplicaSet → Pods
```text

**Key behaviors:**
- Pods get random suffixes
- Any pod can be killed first during scale-down
- PersistentVolumeClaims are shared (if any)
- No ordering guarantees

### How StatefulSets Work

StatefulSets provide ordered, persistent identity ([controller source](https://github.com/kubernetes/kubernetes/tree/master/pkg/controller/statefulset)):

```python
StatefulSet → Pods with stable names

    pod-0, pod-1, pod-2 (always)
```bash

**Key behaviors:**
- Pods get ordinal names: `{statefulset}-0`, `{statefulset}-1`
- Ordered creation: 0 must be Running before 1 starts
- Ordered deletion: N-1 deleted before N-2
- Stable network identity via headless service
- Per-pod PersistentVolumeClaims

### Why This Matters for Trading

| Requirement | Deployment | StatefulSet |
|-------------|------------|-------------|
| Stable identity | No | Yes |
| Per-pod storage | Shared only | Per-pod |
| Ordered scaling | No | Yes |
| Network identity | Random | Stable DNS |


## Fix 1: StatefulSets for Identity {#statefulsets}

### The Pattern

```yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: trading-engine
  namespace: trading
spec:
  serviceName: "trading-headless"
  replicas: 3
  podManagementPolicy: Parallel  # All start together (fast)
  updateStrategy:
    type: RollingUpdate
  selector:
    matchLabels:
      app: trading-engine
  template:
    metadata:
      labels:
        app: trading-engine
    spec:
      containers:
      - name: engine
        image: trading-engine:latest
        env:
        - name: POD_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        - name: ASSIGNED_MARKET
          # Application reads POD_NAME and derives assignment
          # trading-engine-0 → BTC
          # trading-engine-1 → ETH
          # trading-engine-2 → SOL
```text

### Application Logic

```python
import os

POD_NAME = os.environ.get('POD_NAME', 'trading-engine-0')
POD_ORDINAL = int(POD_NAME.split('-')[-1])

MARKET_ASSIGNMENTS = {
    0: ['BTCUSDT', 'BTCUSD'],
    1: ['ETHUSDT', 'ETHUSD'],
    2: ['SOLUSDT', 'SOLUSD'],
}

my_markets = MARKET_ASSIGNMENTS.get(POD_ORDINAL, [])
print(f"Pod {POD_ORDINAL} handling markets: {my_markets}")
```bash

### Expected Behavior

| Event | Result |
|-------|--------|
| Pod-0 crashes | Pod-0 restarts (same identity, same markets) |
| Scale to 4 | Pod-3 created, gets new market assignment |
| Scale to 2 | Pod-2 deleted first (reverse order) |


## Fix 2: Headless Services {#headless}

### The Problem

ClusterIP services load-balance. You can't connect to a specific pod.

### The Fix

```yaml
apiVersion: v1
kind: Service
metadata:
  name: trading-headless
spec:
  clusterIP: None  # Headless
  selector:
    app: trading-engine
  ports:
  - port: 8080
    name: http
  - port: 9090
    name: metrics
```text

### How It Works

With headless service, each pod gets stable DNS:
- `trading-engine-0.trading-headless.trading.svc.cluster.local`
- `trading-engine-1.trading-headless.trading.svc.cluster.local`

Your risk engine can connect directly to each trading engine:

```python
TRADING_ENGINES = [
    "trading-engine-0.trading-headless.trading.svc.cluster.local:8080",
    "trading-engine-1.trading-headless.trading.svc.cluster.local:8080",
]

for engine in TRADING_ENGINES:
    position = get_position(engine)
```text

No load balancer in the path. Direct TCP connections.


## Fix 3: Persistent Volumes {#pv}

### The Problem

Trading engines need persistent state:
- Order history (for reconciliation)
- Position snapshots (for crash recovery)
- WAL logs (for replay)

Without persistence, restart = lost state.

### The Fix

```yaml
volumeClaimTemplates:
- metadata:
    name: trading-data
  spec:
    accessModes: ["ReadWriteOnce"]
    storageClassName: "gp3-encrypted"
    resources:
      requests:
        storage: 50Gi
```text

**StorageClass for EKS:**

```yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: gp3-encrypted
provisioner: ebs.csi.aws.com
volumeBindingMode: WaitForFirstConsumer
parameters:
  type: gp3
  encrypted: "true"
  iops: "3000"
  throughput: "125"
```sql

### How It Works

Each pod gets its own PVC:
- `trading-data-trading-engine-0`
- `trading-data-trading-engine-1`

PVCs persist across pod restarts. Delete pod → PVC remains → New pod gets same PVC.

For EBS optimization, see [Storage Deep Dive](/blog/storage-io-linux-latency#ebs).


## Fix 4: Graceful Shutdown {#shutdown}

### The Problem

Default termination: SIGTERM → wait 30s → SIGKILL.

Trading needs:
1. Cancel open orders (5-30s)
2. Wait for exchange confirmations (10s)
3. Flush state (1s)

30 seconds isn't enough if exchange is slow.

### The Fix

```yaml
spec:
  terminationGracePeriodSeconds: 120  # 2 minutes
  containers:
  - name: engine
    lifecycle:
      preStop:
        exec:
          command:
          - /bin/sh
          - -c
          - |
            # Signal application to stop trading
            curl -X POST http://localhost:8080/shutdown

            # Wait for order cancellations
            sleep 60

            # Final state flush happens in SIGTERM handler
```text

### Application Pattern

```python
import signal
import time

shutdown_requested = False

def handle_sigterm(signum, frame):
    global shutdown_requested
    shutdown_requested = True

    # Cancel all open orders
    for order in get_open_orders():
        cancel_order(order)

    # Wait for confirmations
    while get_open_orders():
        time.sleep(1)

    # Flush state
    save_state_to_disk()

    sys.exit(0)

signal.signal(signal.SIGTERM, handle_sigterm)
```text


## Fix 5: Pod Disruption Budgets {#pdb}

### The Problem

Kubernetes can evict pods during:
- Node upgrades
- Cluster autoscaler decisions
- Spot instance reclaims

Without protection, all pods could be evicted simultaneously.

### The Fix

```yaml
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: trading-pdb
spec:
  minAvailable: 2  # At least 2 pods always running
  selector:
    matchLabels:
      app: trading-engine
```text

### How It Works

**Voluntary disruptions** (upgrades, autoscaler) respect PDB:
- Want to evict pod-0
- Check PDB: 3 running, need 2 minimum
- Eviction allowed (3-1=2 ≥ 2)

**Involuntary disruptions** (node crash) don't check PDB. You need multi-AZ for that.


## Complete StatefulSet Example

```yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: trading-engine
  namespace: trading
spec:
  serviceName: "trading-headless"
  replicas: 3
  podManagementPolicy: Parallel
  updateStrategy:
    type: RollingUpdate
    rollingUpdate:
      partition: 0
  selector:
    matchLabels:
      app: trading-engine
  template:
    metadata:
      labels:
        app: trading-engine
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "9090"
    spec:
      terminationGracePeriodSeconds: 120

      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: node-type
                operator: In
                values:
                - trading
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchLabels:
                  app: trading-engine
              topologyKey: topology.kubernetes.io/zone

      containers:
      - name: engine
        image: trading-engine:v1.2.3
        env:
        - name: POD_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        - name: NODE_NAME
          valueFrom:
            fieldRef:
              fieldPath: spec.nodeName

        resources:
          requests:
            memory: "4Gi"
            cpu: "2000m"
          limits:
            memory: "8Gi"
            cpu: "4000m"

        ports:
        - containerPort: 8080
          name: http
        - containerPort: 9090
          name: metrics

        volumeMounts:
        - name: trading-data
          mountPath: /data

        readinessProbe:
          httpGet:
            path: /ready
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 5

        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10

        lifecycle:
          preStop:
            exec:
              command:
              - /bin/sh
              - -c
              - "curl -X POST localhost:8080/shutdown && sleep 60"

  volumeClaimTemplates:
  - metadata:
      name: trading-data
    spec:
      accessModes: ["ReadWriteOnce"]
      storageClassName: "gp3-encrypted"
      resources:
        requests:
          storage: 50Gi

Design Philosophy {#design-philosophy}

Stateless vs Stateful

Kubernetes was designed for stateless. Original patterns assumed:

  • Ephemeral pods
  • Shared state in databases
  • Any pod handles any request

Trading is inherently stateful:

  • Exchange connections are stateful (WebSocket)
  • Position tracking requires memory
  • Order IDs need persistence

StatefulSets bridge this gap.

The Tradeoff

DeploymentStatefulSet
Simple scalingOrdered scaling
Fast rolloutsCareful rollouts
No identityStable identity
Shared statePer-pod state

StatefulSets are more complex. That complexity is the cost of correctness.


Audit Your Infrastructure

Running trading on Kubernetes? The underlying nodes still need kernel tuning. Check out latency-audit to verify CPU governors, memory settings, and network configurations on your node pools.

Continue Reading

Reading Path

Continue exploring with these related deep dives:

TopicNext Post
Design philosophy & architecture decisionsTrading Infrastructure: First Principles That Scale
CPU governors, C-states, NUMA, isolationCPU Isolation for HFT: The isolcpus Lie and What Actually Works
NIC offloads, IRQ affinity, kernel bypassNetwork Optimization: Kernel Bypass and the Art of Busy Polling
SLOs, metrics that matter, alertingTrading Metrics: What SRE Dashboards Miss
The 5 kernel settings that cost you latencyLinux Defaults That Cost You Latency
Share: LinkedIn X

Enjoyed this?

Get one deep infrastructure insight per week.

Free forever. Unsubscribe anytime.

You're in. Check your inbox.