Infrastructure
Linux Defaults That Cost You Latency
Deep dive into vm.swappiness, THP compaction, and C-states. Kernel internals, measurements, and the design philosophy behind low-latency Linux tuning.
The default Linux kernel is optimized for throughput — maximizing total work completed — not latency — minimizing individual response time. This is intentional. The kernel developers designed for the common case: web servers, databases, batch processing where throughput matters more than any single request.
Trading is not the common case. Every server I’ve configured for low-latency work at Akuna Capital, Gemini, or ZeroCopy needed the same five settings changed. The defaults aren’t wrong — they’re wrong for this specific workload. This post explains why, with kernel internals, not just the sysctl commands.
The Problem {#the-problem}
Every Linux distribution ships with defaults designed for general-purpose workloads:
| Setting | Default Value | Latency Impact | Root Cause |
|---|---|---|---|
| vm.swappiness | 60 | 10-100µs per page fault | Anonymous page reclamation |
| THP | always | 10-50ms compaction stalls | khugepaged defragmentation |
| CPU governor | powersave | 10-50µs frequency ramp | DVFS transitions |
| C-states | enabled | 50-100µs wake latency | Voltage/clock restoration |
| NIC offloads | enabled | 5-50µs packet batching | GRO/LRO coalescing |
Why do these defaults exist? They save power and maximize throughput. The kernel developers made reasonable tradeoffs for 99% of workloads. Trading systems are in the other 1%.
For deep dives into each subsystem, see:
- CPU Optimization - Governors, C-states, NUMA
- Memory Tuning - THP, swappiness, huge pages
- Network Optimization - Offloads, IRQ affinity
- Storage I/O - Schedulers, Direct I/O
Background: How Linux Memory Management Works {#background}
Before diving into fixes, you need to understand how Linux manages memory. This context explains why the defaults hurt latency.
The Page Cache and Anonymous Pages
Linux divides memory into two categories:
-
File-backed pages (page cache): Cached file contents. Can be evicted by dropping them (clean) or writing back (dirty).
-
Anonymous pages: Heap, stack, and mmap’d memory without a file backing. Can only be evicted by swapping to disk.
The kernel maintains LRU (Least Recently Used) lists for both types. When memory pressure occurs, the kswapd daemon scans these lists looking for pages to reclaim (kernel source: mm/vmscan.c).
The Memory Pressure Response
When free memory drops below a threshold, the kernel:
- Background reclamation (kswapd): A kernel thread wakes and starts scanning LRU lists
- Direct reclamation: If kswapd can’t keep up, the allocating process itself must wait while memory is freed
- OOM killer: Last resort-kill processes to free memory
Direct reclamation is the latency killer. Your trading thread asks for memory, and instead of getting it immediately, it waits while the kernel frees memory. This can take 10-100µs for page cache eviction or 1-10ms for swap operations.
For more on memory internals, see Memory Tuning Deep Dive.
Fix 1: Disable Aggressive Swapping {#swappiness}
The Problem
With vm.swappiness = 60 (default), the kernel treats file-backed and anonymous pages roughly equally when deciding what to evict. This means your application heap can be swapped to disk even when there’s plenty of page cache that could be evicted instead.
The kernel code: In mm/vmscan.c, the get_scan_count() function calculates how many pages to scan from each LRU list. The swappiness value directly influences this ratio (kernel source).
// Simplified from mm/vmscan.c
anon_prio = swappiness;
file_prio = 200 - swappiness;
```text
With swappiness=60, the kernel scans anonymous pages at 60% the rate of file pages. With swappiness=0, anonymous pages are only scanned under extreme memory pressure.
### The Fix
```bash
# Check current value
cat /proc/sys/vm/swappiness
# Output: 60 (default)
# Set to 0 for latency-critical systems
echo 'vm.swappiness = 0' | sudo tee -a /etc/sysctl.conf
sudo sysctl -p
# Verify
sysctl vm.swappiness
```text
**Terraform/Ansible automation:**
```yaml
# Ansible
- name: Set swappiness to 0
sysctl:
name: vm.swappiness
value: '0'
state: present
reload: yes
```text
## Why It Works (Kernel Internals)
With swappiness=0:
- Anonymous pages stay in RAM unless the system is critically low on memory
- Page cache is evicted first (this is safe-file contents can be re-read)
- Your trading heap never gets swapped unless the alternative is OOM
**Verification:**
```bash
# Watch for swap activity (should be 0)
vmstat 1 | awk '{print $7, $8}' # si (swap in), so (swap out)
# Check current swap usage
free -h
```javascript
## Expected Improvement
- **Eliminates 10-100µs page fault stalls** from swap reads (measured on NVMe)
- EBS/network storage: eliminates 1-5ms stalls
**Citation:** Page fault latency measured using eBPF tracing. See [Brendan Gregg's Memory Flame Graphs](https://www.brendangregg.com/FlameGraphs/memoryflamegraphs.html) for methodology.
## Fix 2: Disable Transparent Huge Pages {#thp}
### The Problem
Transparent Huge Pages (THP) automatically promotes 4KB pages to 2MB pages, reducing TLB misses. Sounds good. The problem is *how* it does this.
The `khugepaged` kernel thread continuously scans memory looking for contiguous 4KB pages it can merge into 2MB pages. This requires:
1. **Memory compaction:** Moving pages around to create contiguous regions
2. **Process stalling:** Holding mmap_sem while promoting pages
**The killer:** Compaction can stall your process for **10-50 milliseconds**. Not microseconds-milliseconds. During a THP compaction event, your trading thread is frozen.
### The Kernel Internals
THP is managed by the `khugepaged` kernel thread ([kernel source: mm/khugepaged.c](https://github.com/torvalds/linux/blob/master/mm/khugepaged.c)). When enabled, it:
1. Scans process address spaces every `khugepaged_scan_sleep_millisecs` (default: 10000ms)
2. Attempts to collapse contiguous pages into huge pages
3. May trigger memory compaction if huge pages aren't available
**The compaction problem:** Memory compaction ([mm/compaction.c](https://github.com/torvalds/linux/blob/master/mm/compaction.c)) migrates pages between zones to create contiguous regions. This holds locks that can block memory allocation.
### The Fix
```bash
# Check current status
cat /sys/kernel/mm/transparent_hugepage/enabled
# Output: [always] madvise never
# Disable THP entirely
echo never | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
echo never | sudo tee /sys/kernel/mm/transparent_hugepage/defrag
# Verify
grep -E 'AnonHugePages|HugePages_' /proc/meminfo
```text
**Make persistent (EC2 user_data):**
```bash
#!/bin/bash
echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo never > /sys/kernel/mm/transparent_hugepage/defrag
```text
**For Kubernetes (DaemonSet):**
```yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: disable-thp
spec:
selector:
matchLabels:
name: disable-thp
template:
spec:
hostPID: true
containers:
- name: disable-thp
image: busybox
command: ['sh', '-c', 'echo never > /sys/kernel/mm/transparent_hugepage/enabled && sleep infinity']
securityContext:
privileged: true
volumeMounts:
- name: sys
mountPath: /sys
volumes:
- name: sys
hostPath:
path: /sys
```solidity
## Why It Works (Kernel Internals)
Disabling THP:
- Stops `khugepaged` from scanning your address space
- Prevents compaction from holding mmap_sem
- Eliminates the 10-50ms stall risk
**Verification:**
```bash
# Watch for compaction activity
watch -n1 'grep -E "compact_|thp_" /proc/vmstat'
# Profile with perf during suspected stalls
sudo perf record -g -a sleep 10
sudo perf report
```bash
## Expected Improvement
**Eliminates 10-50ms compaction stalls.** This is often the single biggest P99 improvement for trading systems.
**Citation:** THP compaction delays are documented in [kernel documentation](https://www.kernel.org/doc/Documentation/vm/transhuge.txt) and discussed extensively in Brendan Gregg's Linux performance work.
**Trade-off:** You lose automatic huge page benefits. For controlled huge page usage, see [explicit huge pages in Memory Tuning](/blog/memory-tuning-linux-latency#huge-pages).
## Fix 3: Lock CPU Frequency {#governors}
### The Problem
Modern CPUs use Dynamic Voltage and Frequency Scaling (DVFS) to save power. The CPU governor decides when to change frequency based on load.
**Governors explained:**
| Governor | Behavior | Latency Impact |
|----------|----------|----------------|
| powersave | Minimum frequency always | Maximum latency on first instruction |
| ondemand | Ramps up when busy | 10-50µs ramp time |
| performance | Maximum frequency always | No ramp latency |
The `ondemand` governor (common default) monitors CPU utilization and ramps frequency. The problem: **the first instructions after idle run at low frequency.**
### The Kernel Internals
CPU frequency scaling is managed by the cpufreq subsystem ([kernel source: drivers/cpufreq/](https://github.com/torvalds/linux/tree/master/drivers/cpufreq)). When using the `ondemand` governor:
1. A timer fires every `sampling_rate` microseconds (default: 10000)
2. The governor checks CPU utilization
3. If above threshold, frequency increases
4. If below, frequency decreases
**The latency:** Frequency changes require voltage changes. The hardware needs 10-50µs to stabilize at the new frequency.
### The Fix
```bash
# Check current governor
cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
# Output: ondemand (or powersave)
# Set all cores to performance
for cpu in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do
echo performance | sudo tee $cpu
done
# Verify frequency is at maximum
watch -n1 'grep MHz /proc/cpuinfo | head -4'
```text
**Ansible automation:**
```yaml
- name: Set CPU governor to performance
shell: |
for cpu in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do
echo performance > $cpu
done
become: yes
```text
## Why It Works (Kernel Internals)
The `performance` governor bypasses the sampling mechanism entirely. It sets the scaling_setspeed to the maximum frequency and leaves it there.
**Verification:**
```bash
# Confirm frequency is stable at max
turbostat --interval 1 --show Core,CPU,Bzy_MHz
```bash
## Expected Improvement
**Eliminates 10-50µs frequency ramp latency** on first instructions after idle.
**Citation:** DVFS transition latencies documented in [Intel Software Developer's Manual](https://www.intel.com/content/www/us/en/developer/articles/technical/software-developers-manual.html), Vol. 3, Chapter 14.
**Connection:** For C-states (idle states), see the [CPU Deep Dive](/blog/cpu-optimization-linux-latency#cstates). For NUMA effects on frequency, see [CPU NUMA section](/blog/cpu-optimization-linux-latency#numa).
## Fix 4: Disable Deep C-States {#cstates}
### The Problem
Even with the `performance` governor, idle CPUs enter C-states to save power:
| C-State | What Happens | Wake Latency |
|---------|--------------|--------------|
| C0 | Active | 0 |
| C1 | Clock stopped | 1-5µs |
| C1E | Clock + voltage reduced | 5-10µs |
| C3 | L1/L2 cache cold | 30-50µs |
| C6 | Voltage cut, state saved to RAM | 50-100µs |
**The problem:** Your trading thread is idle for 1ms waiting for market data. The CPU enters C6. Market data arrives. The CPU takes 50-100µs to wake up.
### The Kernel Internals
C-state management is handled by the `intel_idle` driver ([kernel source: drivers/idle/intel_idle.c](https://github.com/torvalds/linux/blob/master/drivers/idle/intel_idle.c)). When a CPU has no work:
1. The scheduler calls `do_idle()`
2. `do_idle()` selects a C-state based on expected idle time
3. The CPU enters the selected state
4. On interrupt, the CPU wakes and resumes
**The selection algorithm:** The cpuidle governor (menu or ladder) predicts idle time and picks the deepest state that can wake within the expected time. For unpredictable trading workloads, this prediction is often wrong.
### The Fix
**Option 1: Kernel boot parameters (recommended)**
```bash
# Add to GRUB_CMDLINE_LINUX in /etc/default/grub
processor.max_cstate=1 intel_idle.max_cstate=0
```text
This limits to C1 only-clock stops but voltage stays on.
**Option 2: Runtime (temporary)**
```bash
# Disable each C-state beyond C1
for cpu in /sys/devices/system/cpu/cpu*/cpuidle/state*/disable; do
echo 1 | sudo tee $cpu
done
```text
## Why It Works (Kernel Internals)
With `max_cstate=1`:
- CPUs enter C1 when idle (1-5µs wake)
- Never enter C3/C6 (50-100µs wake)
- Power consumption increases, but latency is predictable
**Verification:**
```bash
# Check current C-state residency
turbostat --interval 1 --show Core,C1%,C3%,C6%
# Should show 0% for C3 and C6
```bash
## Expected Improvement
**Reduces worst-case wake latency from 50-100µs to 1-5µs.**
**Citation:** C-state latencies from [Intel Power Management Reference](https://www.intel.com/content/www/us/en/develop/documentation/vtune-help/top/reference/cpu-metrics-reference/c-state-residency.html).
**Trade-off:** Higher power consumption. On AWS, this increases instance cost. See [the design philosophy section](#design-philosophy) for when this tradeoff makes sense.
## Fix 5: Disable NIC Offloads {#offloads}
### The Problem
Network interface cards have offload features that batch packets to reduce CPU load:
| Offload | What It Does | Latency Impact |
|---------|--------------|----------------|
| GRO | Batches incoming packets | 5-50µs delay |
| LRO | Batches incoming packets (legacy) | 5-50µs delay |
| TSO | Batches outgoing packets | Minimal for small packets |
| GSO | Generic segmentation offload | Minimal for small packets |
**The problem:** Your exchange sends a market data packet. The NIC waits to see if more packets are coming so it can batch them together. **Your packet sits in the NIC for 5-50µs waiting for friends that may never arrive.**
### The Kernel Internals
GRO is implemented in `net/core/dev.c` ([kernel source](https://github.com/torvalds/linux/blob/master/net/core/dev.c)). The NIC driver calls `napi_gro_receive()` which:
1. Holds the packet in a GRO list
2. Waits for more packets from the same flow
3. Merges packets into a larger buffer
4. Delivers to the stack when flushed
**The flush triggers:** Timer expiry OR softirq batch complete OR driver-specific thresholds.
### The Fix
```bash
# Check current offloads
ethtool -k eth0 | grep -E 'offload|segmentation'
# Disable receive offloads
sudo ethtool -K eth0 gro off lro off
# Verify
ethtool -k eth0 | grep gro
```text
**Ansible automation:**
```yaml
- name: Disable NIC offloads
shell: |
for iface in $(ls /sys/class/net | grep -v lo); do
ethtool -K $iface gro off lro off 2>/dev/null || true
done
become: yes
```sql
## Why It Works (Kernel Internals)
With GRO/LRO off:
- Each packet triggers an immediate softirq
- No batching delay
- Trade-off: Higher CPU usage from more interrupts
**Verification:**
```bash
# Check for increased interrupt rate (expected)
watch -n1 'grep eth0 /proc/interrupts'
# Check for no drops (confirm hardware keeps up)
ethtool -S eth0 | grep -E 'drop|error'
```bash
## Expected Improvement
**Eliminates 5-50µs packet batching delay.**
**Citation:** GRO behavior documented in [kernel networking documentation](https://www.kernel.org/doc/Documentation/networking/segmentation-offloads.txt).
**Connection:** For IRQ affinity tuning, see [Network Deep Dive](/blog/network-optimization-linux-latency#irq-affinity).
## Design Philosophy {#design-philosophy}
### The Fundamental Tradeoff: Throughput vs Latency
Every optimization in this post trades throughput/power for latency:
| Optimization | What We Give Up | What We Get |
|--------------|-----------------|-------------|
| swappiness=0 | Page cache efficiency | Predictable heap |
| THP disabled | Automatic huge pages | No compaction stalls |
| performance governor | Power savings | No frequency ramp |
| C-state limits | Power savings | Fast wake-up |
| Offloads disabled | CPU efficiency | Immediate packet delivery |
**The Linux design principle:** The kernel defaults make the common case fast. Web servers benefit from more page cache. Batch jobs benefit from frequency scaling. The kernel is right to optimize for throughput-that's what most workloads need.
**Trading is different:** We have:
- Known, bounded memory requirements
- Irregular, bursty workloads that fool frequency governors
- Hard latency SLOs that P99 spikes violate
### When NOT to Apply These Changes
Not every system needs latency tuning:
1. **Batch processing:** Throughput matters more; keep defaults
2. **Development environments:** Don't waste power
3. **Memory-constrained systems:** swappiness=0 can trigger OOM
4. **Shared infrastructure:** These settings affect all processes
**The test:** If your SLO is in seconds, defaults are fine. If your SLO is in milliseconds, audit your kernel.
For the philosophical framework, see [First Principles of Trading Infrastructure](/blog/first-trading-infrastructure-principles).
## Putting It All Together {#putting-it-together}
### Quick Audit Commands
```bash
# Check all settings at once
echo "=== Swappiness ===" && sysctl vm.swappiness
echo "=== THP ===" && cat /sys/kernel/mm/transparent_hugepage/enabled
echo "=== Governor ===" && cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
echo "=== C-States ===" && cat /sys/devices/system/cpu/cpu0/cpuidle/state*/disable 2>/dev/null || echo "N/A"
echo "=== NIC Offloads ===" && ethtool -k eth0 2>/dev/null | grep -E 'gro:|lro:'
```text
## Automated Audit: latency-audit
I built [**latency-audit**](/tools/latency-audit) to check all these settings at once. It audits kernel, CPU, memory, network, and storage configuration. See the [tool page](/tools/latency-audit) for usage.
### Terraform for AWS Fleet
```hcl
resource "aws_launch_template" "trading" {
name_prefix = "trading-"
instance_type = "c6in.xlarge"
user_data = base64encode(<<-EOF
#!/bin/bash
# Kernel tuning
echo 'vm.swappiness = 0' >> /etc/sysctl.conf
sysctl -p
# THP
echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo never > /sys/kernel/mm/transparent_hugepage/defrag
# CPU governor
for cpu in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do
echo performance > $cpu
done
# NIC offloads
for iface in $(ls /sys/class/net | grep -v lo); do
ethtool -K $iface gro off lro off 2>/dev/null || true
done
EOF
)
}
The kernel is optimized for throughput. Trading requires latency. Know the difference, and tune accordingly.
Up Next in Linux Infrastructure Deep Dives
PTP or Die: Hardware Timestamping for Regulatory-Grade Time Sync
Why NTP is insufficient for HFT compliance, and how to implement IEEE 1588 PTPv2 with hardware timestamping to achieve sub-100ns accuracy.
Continue Reading
Reading Path
Continue exploring with these related deep dives:
| Topic | Next Post |
|---|---|
| CPU governors, C-states, NUMA, isolation | CPU Isolation for HFT: The isolcpus Lie and What Actually Works |
| THP, huge pages, memory locking, pre-allocation | Memory Tuning for Low-Latency: The THP Trap and HugePage Mastery |
| NIC offloads, IRQ affinity, kernel bypass | Network Optimization: Kernel Bypass and the Art of Busy Polling |
| I/O schedulers, Direct I/O, EBS tuning | I/O Schedulers: Why the Kernel Reorders Your Writes |
| Measuring without overhead using eBPF | eBPF Profiling: Nanoseconds Without Adding Any |
| StatefulSets, pod placement, EKS patterns | Kubernetes StatefulSets: Why Trading Systems Need State |
Enjoyed this?
Get one deep infrastructure insight per week.
Free forever. Unsubscribe anytime.
You're in. Check your inbox.