Skip to content
STAGING — not production

Infrastructure

Trading Infrastructure: First Principles That Scale

Architecture decisions that determine your latency ceiling. AWS, Kubernetes, monitoring, and security patterns for crypto trading systems.

6 min
#trading #infrastructure #aws #kubernetes #architecture #sre #crypto

Infrastructure decisions made in month one determine your latency ceiling for years.

I’ve built trading infrastructure at Akuna Capital and Gemini, and now at ZeroCopy. The pattern I see everywhere: teams obsess over algorithm optimization while running on misconfigured infrastructure. The algorithm saves 10µs. The infrastructure costs 100µs.

This post covers the foundational decisions: AWS architecture, Kubernetes patterns, monitoring, and security. Not tweaks-first principles.

The Problem {#the-problem}

Crypto trading infrastructure faces unique challenges:

ChallengeTraditional HFTCrypto Trading
LocationColocated, bare-metalCloud (AWS) required
Latency target<10µs100µs-5ms acceptable
ProtocolProprietary, FIXWebSocket, REST, FIX varies
UptimeMarket hours24/7/365
Key managementHSMsHot wallets, MPC

The Physics: Traditional HFT optimizes for nanoseconds using kernel bypass (DPDK, RDMA) on dedicated hardware. Crypto trading operates on different physics: the limiting factor is network RTT to exchanges (50-200ms), not local processing. This means we optimize for reliability and observability first, then latency.

For kernel-level optimizations, see the deep dive series:

AWS Architecture for Trading {#aws}

VPC Design

Trading VPCs need:

  1. Private subnets for trading engines (no public IPs)
  2. NAT gateways for outbound exchange connectivity
  3. VPC endpoints for AWS services (no internet traversal)
# Terraform: Trading VPC
resource "aws_vpc" "trading" {
  cidr_block           = "10.0.0.0/16"
  enable_dns_hostnames = true
  enable_dns_support   = true
}

resource "aws_subnet" "private" {
  count             = 2
  vpc_id            = aws_vpc.trading.id
  cidr_block        = cidrsubnet(aws_vpc.trading.cidr_block, 8, count.index)
  availability_zone = data.aws_availability_zones.available.names[count.index]

  tags = {
    Name = "trading-private-${count.index}"
  }
}

# VPC endpoint for Secrets Manager (no internet)
resource "aws_vpc_endpoint" "secrets" {
  vpc_id              = aws_vpc.trading.id
  service_name        = "com.amazonaws.${var.region}.secretsmanager"
  vpc_endpoint_type   = "Interface"
  subnet_ids          = aws_subnet.private[*].id
  security_group_ids  = [aws_security_group.vpc_endpoints.id]
  private_dns_enabled = true
}
```bash

## Instance Selection

| Use Case | Instance Type | Why |
|----------|---------------|-----|
| Trading engine | c6in.xlarge | Network-optimized (up to 25 Gbps on xlarge) |
| Market data | c6in.xlarge | Same-network is bottleneck |
| Risk engine | r6i.xlarge | Memory-optimized for state |
| Monitoring | t3.large | Cost-effective, not latency-critical |

**Citation:** [AWS Instance Types](https://aws.amazon.com/ec2/instance-types/).

### Placement Groups

**Critical for inter-instance latency:**

```hcl
resource "aws_placement_group" "trading" {
  name     = "trading-cluster"
  strategy = "cluster"  # Same rack
}

resource "aws_instance" "trading" {
  instance_type   = "c6in.xlarge"
  placement_group = aws_placement_group.trading.id
  subnet_id       = aws_subnet.private[0].id
}
```sql

**Cluster placement** puts instances on the same rack. This meaningfully reduces inter-instance network latency compared to random AZ placement, where you're crossing more switching fabric.

**Trade-off:** Single AZ = single point of failure. Acceptable for trading engines; DR handled at application level.


## Kubernetes Patterns {#kubernetes}

### Why StatefulSets

Trading workloads need:
- Persistent identity (pod-0 handles BTC, pod-1 handles ETH)
- Ordered scaling (risk engines start before trading)
- Persistent storage (state survives restarts)

**Deployments don't provide these.** See [Kubernetes Deep Dive](/blog/kubernetes-for-trading-statefulsets) for full details.

```yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: trading-engine
spec:
  serviceName: "trading-headless"
  replicas: 3
  podManagementPolicy: Parallel
  selector:
    matchLabels:
      app: trading-engine
  template:
    spec:
      containers:
      - name: engine
        env:
        - name: POD_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        # POD_NAME = trading-engine-0, trading-engine-1, etc.
```text

### Resource Configuration

```yaml
resources:
  requests:
    memory: "4Gi"
    cpu: "2000m"
  limits:
    memory: "8Gi"
    cpu: "4000m"
```text

**Requests vs limits:** Requests are guaranteed. Limits are maximums. For trading:
- Set request = expected usage
- Set limit = 2x request (room for bursts without OOM kill)

### Node Affinity

```yaml
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: node-type
            operator: In
            values:
            - trading
```python

**Dedicated node groups** prevent resource contention with other workloads.


## Multi-Exchange Connectivity {#exchanges}

### Architecture: Connector Per Exchange

Each exchange has different:
- Rate limits
- Message formats
- Authentication
- Reconnection behavior

**Design principle:** Fault isolation. Binance down shouldn't affect Coinbase.

```yaml
# Per-exchange deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: connector-binance
spec:
  replicas: 2  # Hot standby
  template:
    spec:
      containers:
      - name: connector
        env:
        - name: EXCHANGE
          value: "binance"
        - name: WS_ENDPOINT
          value: "wss://stream.binance.com:9443"
```python

## WebSocket Reliability

WebSockets silently disconnect. TCP keepalive isn't reliable. See [Orderbook Infrastructure](/blog/orderbook-reconstruction-submillisecond) for resilience patterns:

- Heartbeat monitoring
- Staleness detection
- Automatic reconnection with backoff
- Prometheus metrics for reliability


## Monitoring That Matters {#monitoring}

### The Mistake: Infrastructure Metrics

CPU, memory, disk-these are **necessary but not sufficient**. A server can have 5% CPU while:
- Fills are 10x slower than expected
- Positions have drifted from exchange
- Market data is stale

### Trading-Specific SLOs

See [Monitoring Deep Dive](/blog/monitoring-trading-systems-metrics) for complete details.

**Essential metrics:**

```python
from prometheus_client import Histogram, Gauge, Counter

# Fill latency (P50, P95, P99)
FILL_LATENCY = Histogram(
    'trading_fill_latency_seconds',
    'Order submission to fill confirmation',
    ['exchange'],
    buckets=[0.001, 0.01, 0.05, 0.1, 0.5, 1, 2, 5]
)

# Position drift (internal vs exchange)
POSITION_DRIFT = Gauge(
    'trading_position_drift_percent',
    'Difference between calculated and exchange position',
    ['exchange', 'symbol']
)

# Market data staleness
MARKET_DATA_AGE = Gauge(
    'trading_market_data_age_seconds',
    'Time since last orderbook update',
    ['exchange', 'symbol']
)
```text

## Alert Hierarchy

```bash
PAGE IMMEDIATELY (wake me at 2am):
├── Position drift > 1%
├── No fills in 5 minutes (market hours)
├── WebSocket down > 30 seconds
└── Loss limit exceeded

SLACK (business hours):
├── Fill latency P99 > 1s
├── Rejection rate > 3%
└── Rate limit > 70%
```text


## Security at Scale {#security}

### Defense in Depth

**Layers:**
1. VPC isolation (private subnets)
2. Security groups (minimal ports)
3. Secrets management (Secrets Manager, no env vars)
4. Key rotation (automated)
5. Audit logging (CloudTrail)

### API Key Management

```hcl
# AWS Secrets Manager
resource "aws_secretsmanager_secret" "exchange_keys" {
  name = "trading/exchange-api-keys"
}

# Automatic rotation
resource "aws_secretsmanager_secret_rotation" "api_keys" {
  secret_id           = aws_secretsmanager_secret.exchange_keys.id
  rotation_lambda_arn = aws_lambda_function.rotator.arn

  rotation_rules {
    automatically_after_days = 30
  }
}
```text

## Hot Wallet Security

**Principle:** Minimize hot wallet exposure.

- **Cold storage:** 95%+ of funds, air-gapped
- **Hot wallets:** Trading capital only
- **MPC:** No single point of compromise

**Principle:** Minimize hot wallet exposure and treat any hot wallet compromise as an eventual certainty, not an edge case.


## CI/CD for Trading {#cicd}

### Zero-Downtime Deployments

**ArgoCD rollout strategy:**

```yaml
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: trading-engine
spec:
  strategy:
    canary:
      steps:
      - setWeight: 10
      - pause: {duration: 5m}
      - analysis:
          templates:
          - templateName: trading-health
      - setWeight: 50
      - pause: {duration: 5m}
```text

### Pre-Deployment Checks

```yaml
# GitHub Actions
jobs:
  pre-deploy:
    steps:
    - name: Run latency-audit
      run: ./scripts/latency-audit.sh --json

    - name: Test exchange connectivity
      run: ./scripts/test_exchanges.sh

    - name: Validate configs
      run: ./scripts/validate_configs.py

Design Philosophy {#design-philosophy}

First Principles

1. Latency is a system property, not a code property.

Your algorithm runs in 10µs. But:

  • Network adds 50µs (kernel stack)
  • Memory adds 100µs (THP compaction)
  • CPU adds 50µs (C-state wake)

Total: 210µs. The algorithm is 5% of the problem.

2. Reliability enables performance.

You can’t optimize a system that’s down. Build reliable first, fast second.

3. Observability drives optimization.

You can’t fix what you can’t measure. Instrument everything.

4. Security is non-negotiable.

One breach erases years of profits. Defense in depth, always.

When to Break Rules

These patterns are for production trading systems. For:

  • Development: Defaults are fine
  • Backtesting: Throughput matters more
  • Paper trading: Reliability testing, not latency

Continue Reading

Reading Path

Continue exploring with these related deep dives:

TopicNext Post
The 5 kernel settings that cost you latencyLinux Defaults That Cost You Latency
StatefulSets, pod placement, EKS patternsKubernetes StatefulSets: Why Trading Systems Need State
SLOs, metrics that matter, alertingTrading Metrics: What SRE Dashboards Miss
CPU governors, C-states, NUMA, isolationCPU Isolation for HFT: The isolcpus Lie and What Actually Works
Measuring without overhead using eBPFeBPF Profiling: Nanoseconds Without Adding Any
Share: LinkedIn X

Enjoyed this?

Get one deep infrastructure insight per week.

Free forever. Unsubscribe anytime.

You're in. Check your inbox.