Linux Defaults That Cost You Latency

The default Linux kernel is optimized for throughput — maximizing total work completed — not latency — minimizing individual response time. This is intentional. The kernel developers designed for the common case: web servers, databases, batch processing where throughput matters more than any single request.

Trading is not the common case. Every server I’ve configured for low-latency work at Akuna Capital, Gemini, or ZeroCopy needed the same five settings changed. The defaults aren’t wrong — they’re wrong for this specific workload. This post explains why, with kernel internals, not just the sysctl commands.

The Problem {#the-problem}

Every Linux distribution ships with defaults designed for general-purpose workloads:

Setting	Default Value	Latency Impact	Root Cause
vm.swappiness	60	10-100µs per page fault	Anonymous page reclamation
THP	always	10-50ms compaction stalls	khugepaged defragmentation
CPU governor	powersave	10-50µs frequency ramp	DVFS transitions
C-states	enabled	50-100µs wake latency	Voltage/clock restoration
NIC offloads	enabled	5-50µs packet batching	GRO/LRO coalescing

Why do these defaults exist? They save power and maximize throughput. The kernel developers made reasonable tradeoffs for 99% of workloads. Trading systems are in the other 1%.

For deep dives into each subsystem, see:

CPU Optimization - Governors, C-states, NUMA
Memory Tuning - THP, swappiness, huge pages
Network Optimization - Offloads, IRQ affinity
Storage I/O - Schedulers, Direct I/O

Background: How Linux Memory Management Works {#background}

Before diving into fixes, you need to understand how Linux manages memory. This context explains why the defaults hurt latency.

The Page Cache and Anonymous Pages

Linux divides memory into two categories:

File-backed pages (page cache): Cached file contents. Can be evicted by dropping them (clean) or writing back (dirty).
Anonymous pages: Heap, stack, and mmap’d memory without a file backing. Can only be evicted by swapping to disk.

The kernel maintains LRU (Least Recently Used) lists for both types. When memory pressure occurs, the kswapd daemon scans these lists looking for pages to reclaim (kernel source: mm/vmscan.c).

The Memory Pressure Response

When free memory drops below a threshold, the kernel:

Background reclamation (kswapd): A kernel thread wakes and starts scanning LRU lists
Direct reclamation: If kswapd can’t keep up, the allocating process itself must wait while memory is freed
OOM killer: Last resort-kill processes to free memory

Direct reclamation is the latency killer. Your trading thread asks for memory, and instead of getting it immediately, it waits while the kernel frees memory. This can take 10-100µs for page cache eviction or 1-10ms for swap operations.

For more on memory internals, see Memory Tuning Deep Dive.

Fix 1: Disable Aggressive Swapping {#swappiness}

The Problem

With vm.swappiness = 60 (default), the kernel treats file-backed and anonymous pages roughly equally when deciding what to evict. This means your application heap can be swapped to disk even when there’s plenty of page cache that could be evicted instead.

The kernel code: In mm/vmscan.c, the get_scan_count() function calculates how many pages to scan from each LRU list. The swappiness value directly influences this ratio (kernel source).

// Simplified from mm/vmscan.c
anon_prio = swappiness;
file_prio = 200 - swappiness;
```text

With swappiness=60, the kernel scans anonymous pages at 60% the rate of file pages. With swappiness=0, anonymous pages are only scanned under extreme memory pressure.

### The Fix

```bash
# Check current value
cat /proc/sys/vm/swappiness
# Output: 60 (default)

# Set to 0 for latency-critical systems
echo 'vm.swappiness = 0' | sudo tee -a /etc/sysctl.conf
sudo sysctl -p

# Verify
sysctl vm.swappiness
```text

**Terraform/Ansible automation:**

```yaml
# Ansible
- name: Set swappiness to 0
  sysctl:
    name: vm.swappiness
    value: '0'
    state: present
    reload: yes
```text

## Why It Works (Kernel Internals)

With swappiness=0:
- Anonymous pages stay in RAM unless the system is critically low on memory
- Page cache is evicted first (this is safe-file contents can be re-read)
- Your trading heap never gets swapped unless the alternative is OOM

**Verification:**

```bash
# Watch for swap activity (should be 0)
vmstat 1 | awk '{print $7, $8}'  # si (swap in), so (swap out)

# Check current swap usage
free -h
```javascript

## Expected Improvement

- **Eliminates 10-100µs page fault stalls** from swap reads (measured on NVMe)
- EBS/network storage: eliminates 1-5ms stalls

**Citation:** Page fault latency measured using eBPF tracing. See [Brendan Gregg's Memory Flame Graphs](https://www.brendangregg.com/FlameGraphs/memoryflamegraphs.html) for methodology.


## Fix 2: Disable Transparent Huge Pages {#thp}

### The Problem

Transparent Huge Pages (THP) automatically promotes 4KB pages to 2MB pages, reducing TLB misses. Sounds good. The problem is *how* it does this.

The `khugepaged` kernel thread continuously scans memory looking for contiguous 4KB pages it can merge into 2MB pages. This requires:

1. **Memory compaction:** Moving pages around to create contiguous regions
2. **Process stalling:** Holding mmap_sem while promoting pages

**The killer:** Compaction can stall your process for **10-50 milliseconds**. Not microseconds-milliseconds. During a THP compaction event, your trading thread is frozen.

### The Kernel Internals

THP is managed by the `khugepaged` kernel thread ([kernel source: mm/khugepaged.c](https://github.com/torvalds/linux/blob/master/mm/khugepaged.c)). When enabled, it:

1. Scans process address spaces every `khugepaged_scan_sleep_millisecs` (default: 10000ms)
2. Attempts to collapse contiguous pages into huge pages
3. May trigger memory compaction if huge pages aren't available

**The compaction problem:** Memory compaction ([mm/compaction.c](https://github.com/torvalds/linux/blob/master/mm/compaction.c)) migrates pages between zones to create contiguous regions. This holds locks that can block memory allocation.

### The Fix

```bash
# Check current status
cat /sys/kernel/mm/transparent_hugepage/enabled
# Output: [always] madvise never

# Disable THP entirely
echo never | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
echo never | sudo tee /sys/kernel/mm/transparent_hugepage/defrag

# Verify
grep -E 'AnonHugePages|HugePages_' /proc/meminfo
```text

**Make persistent (EC2 user_data):**

```bash
#!/bin/bash
echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo never > /sys/kernel/mm/transparent_hugepage/defrag
```text

**For Kubernetes (DaemonSet):**

```yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: disable-thp
spec:
  selector:
    matchLabels:
      name: disable-thp
  template:
    spec:
      hostPID: true
      containers:
      - name: disable-thp
        image: busybox
        command: ['sh', '-c', 'echo never > /sys/kernel/mm/transparent_hugepage/enabled && sleep infinity']
        securityContext:
          privileged: true
        volumeMounts:
        - name: sys
          mountPath: /sys
      volumes:
      - name: sys
        hostPath:
          path: /sys
```solidity

## Why It Works (Kernel Internals)

Disabling THP:
- Stops `khugepaged` from scanning your address space
- Prevents compaction from holding mmap_sem
- Eliminates the 10-50ms stall risk

**Verification:**

```bash
# Watch for compaction activity
watch -n1 'grep -E "compact_|thp_" /proc/vmstat'

# Profile with perf during suspected stalls
sudo perf record -g -a sleep 10
sudo perf report
```bash

## Expected Improvement

**Eliminates 10-50ms compaction stalls.** This is often the single biggest P99 improvement for trading systems.

**Citation:** THP compaction delays are documented in [kernel documentation](https://www.kernel.org/doc/Documentation/vm/transhuge.txt) and discussed extensively in Brendan Gregg's Linux performance work.

**Trade-off:** You lose automatic huge page benefits. For controlled huge page usage, see [explicit huge pages in Memory Tuning](/blog/memory-tuning-linux-latency#huge-pages).


## Fix 3: Lock CPU Frequency {#governors}

### The Problem

Modern CPUs use Dynamic Voltage and Frequency Scaling (DVFS) to save power. The CPU governor decides when to change frequency based on load.

**Governors explained:**

| Governor | Behavior | Latency Impact |
|----------|----------|----------------|
| powersave | Minimum frequency always | Maximum latency on first instruction |
| ondemand | Ramps up when busy | 10-50µs ramp time |
| performance | Maximum frequency always | No ramp latency |

The `ondemand` governor (common default) monitors CPU utilization and ramps frequency. The problem: **the first instructions after idle run at low frequency.**

### The Kernel Internals

CPU frequency scaling is managed by the cpufreq subsystem ([kernel source: drivers/cpufreq/](https://github.com/torvalds/linux/tree/master/drivers/cpufreq)). When using the `ondemand` governor:

1. A timer fires every `sampling_rate` microseconds (default: 10000)
2. The governor checks CPU utilization
3. If above threshold, frequency increases
4. If below, frequency decreases

**The latency:** Frequency changes require voltage changes. The hardware needs 10-50µs to stabilize at the new frequency.

### The Fix

```bash
# Check current governor
cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
# Output: ondemand (or powersave)

# Set all cores to performance
for cpu in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do
  echo performance | sudo tee $cpu
done

# Verify frequency is at maximum
watch -n1 'grep MHz /proc/cpuinfo | head -4'
```text

**Ansible automation:**

```yaml
- name: Set CPU governor to performance
  shell: |
    for cpu in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do
      echo performance > $cpu
    done
  become: yes
```text

## Why It Works (Kernel Internals)

The `performance` governor bypasses the sampling mechanism entirely. It sets the scaling_setspeed to the maximum frequency and leaves it there.

**Verification:**

```bash
# Confirm frequency is stable at max
turbostat --interval 1 --show Core,CPU,Bzy_MHz
```bash

## Expected Improvement

**Eliminates 10-50µs frequency ramp latency** on first instructions after idle.

**Citation:** DVFS transition latencies documented in [Intel Software Developer's Manual](https://www.intel.com/content/www/us/en/developer/articles/technical/software-developers-manual.html), Vol. 3, Chapter 14.

**Connection:** For C-states (idle states), see the [CPU Deep Dive](/blog/cpu-optimization-linux-latency#cstates). For NUMA effects on frequency, see [CPU NUMA section](/blog/cpu-optimization-linux-latency#numa).


## Fix 4: Disable Deep C-States {#cstates}

### The Problem

Even with the `performance` governor, idle CPUs enter C-states to save power:

| C-State | What Happens | Wake Latency |
|---------|--------------|--------------|
| C0 | Active | 0 |
| C1 | Clock stopped | 1-5µs |
| C1E | Clock + voltage reduced | 5-10µs |
| C3 | L1/L2 cache cold | 30-50µs |
| C6 | Voltage cut, state saved to RAM | 50-100µs |

**The problem:** Your trading thread is idle for 1ms waiting for market data. The CPU enters C6. Market data arrives. The CPU takes 50-100µs to wake up.

### The Kernel Internals

C-state management is handled by the `intel_idle` driver ([kernel source: drivers/idle/intel_idle.c](https://github.com/torvalds/linux/blob/master/drivers/idle/intel_idle.c)). When a CPU has no work:

1. The scheduler calls `do_idle()`
2. `do_idle()` selects a C-state based on expected idle time
3. The CPU enters the selected state
4. On interrupt, the CPU wakes and resumes

**The selection algorithm:** The cpuidle governor (menu or ladder) predicts idle time and picks the deepest state that can wake within the expected time. For unpredictable trading workloads, this prediction is often wrong.

### The Fix

**Option 1: Kernel boot parameters (recommended)**

```bash
# Add to GRUB_CMDLINE_LINUX in /etc/default/grub
processor.max_cstate=1 intel_idle.max_cstate=0
```text

This limits to C1 only-clock stops but voltage stays on.

**Option 2: Runtime (temporary)**

```bash
# Disable each C-state beyond C1
for cpu in /sys/devices/system/cpu/cpu*/cpuidle/state*/disable; do
  echo 1 | sudo tee $cpu
done
```text

## Why It Works (Kernel Internals)

With `max_cstate=1`:
- CPUs enter C1 when idle (1-5µs wake)
- Never enter C3/C6 (50-100µs wake)
- Power consumption increases, but latency is predictable

**Verification:**

```bash
# Check current C-state residency
turbostat --interval 1 --show Core,C1%,C3%,C6%

# Should show 0% for C3 and C6
```bash

## Expected Improvement

**Reduces worst-case wake latency from 50-100µs to 1-5µs.**

**Citation:** C-state latencies from [Intel Power Management Reference](https://www.intel.com/content/www/us/en/develop/documentation/vtune-help/top/reference/cpu-metrics-reference/c-state-residency.html).

**Trade-off:** Higher power consumption. On AWS, this increases instance cost. See [the design philosophy section](#design-philosophy) for when this tradeoff makes sense.


## Fix 5: Disable NIC Offloads {#offloads}

### The Problem

Network interface cards have offload features that batch packets to reduce CPU load:

| Offload | What It Does | Latency Impact |
|---------|--------------|----------------|
| GRO | Batches incoming packets | 5-50µs delay |
| LRO | Batches incoming packets (legacy) | 5-50µs delay |
| TSO | Batches outgoing packets | Minimal for small packets |
| GSO | Generic segmentation offload | Minimal for small packets |

**The problem:** Your exchange sends a market data packet. The NIC waits to see if more packets are coming so it can batch them together. **Your packet sits in the NIC for 5-50µs waiting for friends that may never arrive.**

### The Kernel Internals

GRO is implemented in `net/core/dev.c` ([kernel source](https://github.com/torvalds/linux/blob/master/net/core/dev.c)). The NIC driver calls `napi_gro_receive()` which:

1. Holds the packet in a GRO list
2. Waits for more packets from the same flow
3. Merges packets into a larger buffer
4. Delivers to the stack when flushed

**The flush triggers:** Timer expiry OR softirq batch complete OR driver-specific thresholds.

### The Fix

```bash
# Check current offloads
ethtool -k eth0 | grep -E 'offload|segmentation'

# Disable receive offloads
sudo ethtool -K eth0 gro off lro off

# Verify
ethtool -k eth0 | grep gro
```text

**Ansible automation:**

```yaml
- name: Disable NIC offloads
  shell: |
    for iface in $(ls /sys/class/net | grep -v lo); do
      ethtool -K $iface gro off lro off 2>/dev/null || true
    done
  become: yes
```sql

## Why It Works (Kernel Internals)

With GRO/LRO off:
- Each packet triggers an immediate softirq
- No batching delay
- Trade-off: Higher CPU usage from more interrupts

**Verification:**

```bash
# Check for increased interrupt rate (expected)
watch -n1 'grep eth0 /proc/interrupts'

# Check for no drops (confirm hardware keeps up)
ethtool -S eth0 | grep -E 'drop|error'
```bash

## Expected Improvement

**Eliminates 5-50µs packet batching delay.**

**Citation:** GRO behavior documented in [kernel networking documentation](https://www.kernel.org/doc/Documentation/networking/segmentation-offloads.txt).

**Connection:** For IRQ affinity tuning, see [Network Deep Dive](/blog/network-optimization-linux-latency#irq-affinity).


## Design Philosophy {#design-philosophy}

### The Fundamental Tradeoff: Throughput vs Latency

Every optimization in this post trades throughput/power for latency:

| Optimization | What We Give Up | What We Get |
|--------------|-----------------|-------------|
| swappiness=0 | Page cache efficiency | Predictable heap |
| THP disabled | Automatic huge pages | No compaction stalls |
| performance governor | Power savings | No frequency ramp |
| C-state limits | Power savings | Fast wake-up |
| Offloads disabled | CPU efficiency | Immediate packet delivery |

**The Linux design principle:** The kernel defaults make the common case fast. Web servers benefit from more page cache. Batch jobs benefit from frequency scaling. The kernel is right to optimize for throughput-that's what most workloads need.

**Trading is different:** We have:
- Known, bounded memory requirements
- Irregular, bursty workloads that fool frequency governors
- Hard latency SLOs that P99 spikes violate

### When NOT to Apply These Changes

Not every system needs latency tuning:

1. **Batch processing:** Throughput matters more; keep defaults
2. **Development environments:** Don't waste power
3. **Memory-constrained systems:** swappiness=0 can trigger OOM
4. **Shared infrastructure:** These settings affect all processes

**The test:** If your SLO is in seconds, defaults are fine. If your SLO is in milliseconds, audit your kernel.

For the philosophical framework, see [First Principles of Trading Infrastructure](/blog/first-trading-infrastructure-principles).


## Putting It All Together {#putting-it-together}

### Quick Audit Commands

```bash
# Check all settings at once
echo "=== Swappiness ===" && sysctl vm.swappiness
echo "=== THP ===" && cat /sys/kernel/mm/transparent_hugepage/enabled
echo "=== Governor ===" && cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
echo "=== C-States ===" && cat /sys/devices/system/cpu/cpu0/cpuidle/state*/disable 2>/dev/null || echo "N/A"
echo "=== NIC Offloads ===" && ethtool -k eth0 2>/dev/null | grep -E 'gro:|lro:'
```text

## Automated Audit: latency-audit

I built [**latency-audit**](/tools/latency-audit) to check all these settings at once. It audits kernel, CPU, memory, network, and storage configuration. See the [tool page](/tools/latency-audit) for usage.

### Terraform for AWS Fleet

```hcl
resource "aws_launch_template" "trading" {
  name_prefix   = "trading-"
  instance_type = "c6in.xlarge"

  user_data = base64encode(<<-EOF
    #!/bin/bash
    # Kernel tuning
    echo 'vm.swappiness = 0' >> /etc/sysctl.conf
    sysctl -p

    # THP
    echo never > /sys/kernel/mm/transparent_hugepage/enabled
    echo never > /sys/kernel/mm/transparent_hugepage/defrag

    # CPU governor
    for cpu in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do
      echo performance > $cpu
    done

    # NIC offloads
    for iface in $(ls /sys/class/net | grep -v lo); do
      ethtool -K $iface gro off lro off 2>/dev/null || true
    done
  EOF
  )
}

The kernel is optimized for throughput. Trading requires latency. Know the difference, and tune accordingly.

Topic	Next Post
CPU governors, C-states, NUMA, isolation	CPU Isolation for HFT: The isolcpus Lie and What Actually Works
THP, huge pages, memory locking, pre-allocation	Memory Tuning for Low-Latency: The THP Trap and HugePage Mastery
NIC offloads, IRQ affinity, kernel bypass	Network Optimization: Kernel Bypass and the Art of Busy Polling
I/O schedulers, Direct I/O, EBS tuning	I/O Schedulers: Why the Kernel Reorders Your Writes
Measuring without overhead using eBPF	eBPF Profiling: Nanoseconds Without Adding Any
StatefulSets, pod placement, EKS patterns	Kubernetes StatefulSets: Why Trading Systems Need State

Nikhil Padala

Linux Defaults That Cost You Latency

The Problem {#the-problem}

Background: How Linux Memory Management Works {#background}

The Page Cache and Anonymous Pages

The Memory Pressure Response

Fix 1: Disable Aggressive Swapping {#swappiness}

The Problem

PTP or Die: Hardware Timestamping for Regulatory-Grade Time Sync

Continue Reading

Sovereign Trading Infrastructure: Why the Next Generation of HFT Will Run Inside Enclaves

On-Premise GPU vs Cloud for Trading AI: When the Math Tips

AI-Driven Execution Agents: BAML/Letta Patterns for Trading Workflow Orchestration

Reading Path