Network Optimization: Kernel Bypass and the Art of Busy Polling

On a standard Linux server with a Mellanox ConnectX-6, ping shows 120µs RTT. After disabling interrupt coalescing and enabling busy polling, the same hardware can get down to the ~18µs range.

The Linux network stack is optimized for throughput, not latency. Every packet traverses multiple software layers before reaching your application. Each layer adds jitter.

This post documents the techniques to reduce your network RTT without resorting to full kernel bypass.

1. The Physics of Network Latency

When a packet arrives, the NIC holds it in a hardware buffer. It waits for either:

Interrupt Coalescing Timeout (e.g., 100µs): The NIC batches interrupts to reduce CPU load.
Interrupt Coalescing Threshold (e.g., 64 packets): The NIC interrupts when the buffer is full.

This is great for throughput. It is terrible for latency.

Packet Arrives → Wait (100µs) → Interrupt CPU → NAPI Poll → Socket Buffer → read()

Source: Mellanox Performance Tuning Guide

2. Options at Each Layer

Approach	RTT Impact	CPU Cost	Verdict
A. Default (Coalescing On)	Baseline (~120µs)	Low	Optimized for throughput.
B. Coalescing Off	-30µs	Medium	Better, but still has syscall overhead.
C. Coalescing Off + Busy Polling	down to ~18µs	High (1 core)	Near-kernel-bypass performance.

3. Busy Polling Configuration

Step 1: Disable Interrupt Coalescing

# Disable adaptive coalescing
sudo ethtool -C eth0 adaptive-rx off adaptive-tx off

# Set coalescing to minimum
sudo ethtool -C eth0 rx-usecs 0 rx-frames 1
```text

## Step 2: Enable Busy Polling

Busy polling makes the `recvmsg()` syscall spin-wait on the NIC's RX queue instead of sleeping for an interrupt.

```bash
# /etc/sysctl.conf
net.core.busy_poll = 50        # Poll for 50µs before blocking
net.core.busy_read = 50        # Same for read()

sudo sysctl -p
```text

## Step 3: Set Socket Option

Your application must opt-in.

```c
int timeout = 50; // microseconds
setsockopt(fd, SOL_SOCKET, SO_BUSY_POLL, &timeout, sizeof(timeout));
```text

**Verification:**

```bash
# Before tuning: ~120µs
ping -c 100 <target>

# After: significantly lower, varies by NIC, driver, and kernel version
# The ~18µs figure assumes Mellanox mlx5 with a modern kernel
```text


## 4. Checking Your Configuration

Before tuning, audit your current state:

```bash
# Check coalescing settings
ethtool -c eth0

# Check busy_poll sysctl values
sysctl net.core.busy_poll net.core.busy_read

# Check which scheduler is running on latency-critical CPUs
cat /sys/bus/cpu/devices/cpu0/cpufreq/scaling_governor

5. Trade-offs

CPU Burn: Busy polling consumes 100% of a CPU core while waiting for packets. This is acceptable for latency-critical trading paths; it is wasteful for general-purpose servers.
Driver Support: Not all NIC drivers support busy polling. Mellanox (mlx5) and Intel (ixgbe, i40e) do. AWS ENA has limited support.
Kernel Version: Busy polling performance improved significantly in Linux 4.4+. Use a modern kernel.

6. The Core Principle

The network stack is a trade-off between latency and efficiency. The kernel defaults to efficiency because most users care about throughput, not P99.

By disabling coalescing and enabling busy polling, you are telling the kernel: “I will pay the CPU cost. Give me my packets immediately.”

For HFT, a 100µs improvement is worth burning a core. For most applications, it is not. Know your SLA.

Topic	Next Post
CPU governors, C-states, NUMA, isolation	CPU Isolation for HFT: The isolcpus Lie and What Actually Works
WebSocket infrastructure & orderbook design	Market Data Infrastructure: WebSocket Patterns That Scale
The 5 kernel settings that cost you latency	Linux Defaults That Cost You Latency
StatefulSets, pod placement, EKS patterns	Kubernetes StatefulSets: Why Trading Systems Need State
Measuring without overhead using eBPF	eBPF Profiling: Nanoseconds Without Adding Any

Nikhil Padala

Network Optimization: Kernel Bypass and the Art of Busy Polling

1. The Physics of Network Latency

2. Options at Each Layer

3. Busy Polling Configuration

Step 1: Disable Interrupt Coalescing

5. Trade-offs

6. The Core Principle

Trading Infrastructure: First Principles That Scale

Continue Reading

Self-Custody vs CEX: The Speed Myth That Cost Me Money

My Bot Signs 1,000 Trades a Day. Here's How I Sleep at Night.

42µs: How Deterministic Execution Beats Cloud Signing

Reading Path