The Physics of Data: Kernel Bypass, SoftIRQs & Ring Buffers
Why the Linux Kernel is too slow for 10Gbps. The physics of DMA Ring Buffers, SoftIRQ Latency, and bypassing the OS with AF_XDP.
🎯 What You'll Learn
- Deconstruct a Network Packet's Journey (NIC -> RAM)
- Measure Interrupt Coalescing Latency (The 50µs Tax)
- Tune Ring Buffers for Zero-Loss vs Zero-Latency
- Implement Kernel Bypass with `AF_XDP`
- Trace SoftIRQ CPU usage (`si` in top)
Introduction
At 10Gbps with 64-byte packets, a packet arrives roughly every 67 nanoseconds. The Linux kernel takes several microseconds just to handle an interrupt. The math doesn’t work for high-rate, low-latency traffic.
Standard Linux networking is “Interrupt Driven.” High-frequency trading networking is “Polling Driven.” This lesson covers the NIC hardware architecture and how to bypass the kernel entirely.
DMA Ring Buffers
The NIC does not “give” the packet to the CPU. It writes it to RAM using DMA (Direct Memory Access). It writes into a circular queue called a Ring Buffer.
- RX Ring: NIC writes incoming packets here.
- TX Ring: CPU writes outgoing packets here.
- The Doorbell: A memory-mapped register on the NIC that the CPU writes to signal new TX data is ready.
Sizing trade-off: If the Ring is too small -> Packet Loss (Microbursts). If the Ring is too big -> Bufferbloat (Old data stuck in queue). For low-latency work, you want small rings serviced extremely fast.
Interrupt Coalescing: The 50µs Tax
To reduce CPU load, NICs wait for a batch of packets before interrupting the CPU.
Default: rx-usecs: 50 (Wait 50µs before firing an interrupt).
For latency-sensitive workloads, this is unacceptable.
# Check current settings
ethtool -c eth0
# Low-latency setting: Interrupt immediately
ethtool -C eth0 rx-usecs 0 rx-frames 1
```diff
**Consequence:** CPU usage increases significantly processing interrupts. You're trading CPU cycles for latency.
---
## SoftIRQs: The Bottom Half
When the Hard IRQ fires, the CPU acknowledges it quickly.
The "Real Work" (TCP/IP processing) happens in a **SoftIRQ** (Software Interrupt).
In `top`, this is the `%si` column.
If `%si` hits 100% on a core, you are dropping packets.
**Tuning: Multi-Queue Hashing (RSS)**
Distribute the load across cores using the Receive Side Scaling hash (IP, Port, Protocol).
```bash
ethtool -X eth0 equal 4 # Spread across 4 RX Queues
```sql
---
## Kernel Bypass: AF_XDP
Why let the kernel process TCP/IP if you just want the raw UDP multicast packet?
**AF_XDP** (XDP Socket) allows userspace to read directly from the DMA Ring Buffer.
**Performance comparison (typical, varies by hardware):**
* Standard Socket: ~15µs latency.
* AF_XDP: ~2µs latency.
```c
// Concept Code: AF_XDP
// 1. Create XSK (XDP Socket)
xsk_socket__create(&xsk, "eth0", queue_id, umem, &rx, &tx, &cfg);
// 2. Load BPF program to redirect packets
// This runs inside the NIC driver context
bpf_program__set_type(prog, BPF_PROG_TYPE_XDP);
Data never copies to kernel memory. It stays in the “UMEM” (Userspace Memory) region — hence “zero copy”.
Practice Exercises
Exercise 1: The Coalescing Experiment (Beginner)
Task: sockperf ping-pong.
Action: Measure RTT with rx-usecs 50 vs rx-usecs 0.
Result: Expect to see improvement in the 30-50µs range.
Exercise 2: Ring Buffer Tuning (Intermediate)
Task: ethtool -g eth0.
Action: Set ethtool -G eth0 rx 4096 (max throughput) vs rx 64 (low latency).
Risk: Low buffer size risks drops during bursts. Check ethtool -S eth0 | grep drops.
Exercise 3: SoftIRQ Profiling (Advanced)
Task: Run high bandwidth traffic (iperf).
Action: Run mpstat -P ALL 1. Watch the %soft column.
Goal: Ensure it is balanced across cores (RSS is working).
Knowledge Check
- What is the “Doorbell” in NIC terminology?
- Why does
rx-usecs 0increase CPU usage? - What does
%simean intop? - How does AF_XDP achieve Zero Copy?
- What happens if the RX Ring is full?
Answers
- A memory-mapped register on the NIC that signals new TX data is ready.
- More Interrupts. The CPU wakes up for every single packet.
- SoftIRQ Time. Time spent processing protocol stacks (TCP/IP).
- UMEM. The NIC writes directly to userspace-registered memory.
- Packet Drop. The NIC discards the packet immediately.
Summary
- Interrupts: Too slow for 10GbE at line rate.
- Polling: The approach for deterministic latency.
- Ring Buffers: The queue between NIC and RAM.
- AF_XDP: The modern way to bypass the kernel network stack.
Pro Version: For production-grade implementation details, see network-optimization-linux-latency
Want to go deeper?
Weekly infrastructure insights for engineers who build trading systems.
Free forever. Unsubscribe anytime.
You're in. Check your inbox.
Questions about this lesson? Working on related infrastructure?
Let's discuss