The Sub-50µs Cloud Lie: How to Actually Get Deterministic Latency on AWS

Cloud providers advertise “sub-millisecond” latency. The fine print: measured in a test environment, between adjacent instances, with nothing else running.

Production is not a test environment. AWS deployments typically see 150-200µs RTT out of the box on standard ENA networking. Getting to sub-50µs requires bypassing the kernel entirely — and understanding exactly why the abstraction layers add latency.

This post documents the techniques. If you’re running latency-sensitive workloads on AWS, you’re likely leaving significant latency on the table with the default configuration.

1. The Physics of Virtualization Jitter

Cloud latency jitter comes from three sources, all invisible to application code:

Source 1: Hypervisor Scheduling (The VMExit Tax)

Every privileged instruction (I/O, timer access) triggers a VMExit. The CPU traps to the hypervisor, context switches, and returns. On KVM, a single VMExit costs ~1µs.

Source: Intel SDM Vol 3C, Chapter 25 - “VM Entries and VM Exits”
Measurement: perf kvm stat shows exit counts.

Source 2: Noisy Neighbors (The Steal Time Tax)

Even on “dedicated” instances, the hypervisor’s management plane runs on your cores. This is invisible unless you check mpstat for %steal.

Source 3: NIC Virtualization (The VNIC Tax)

Standard ENAs route packets through the hypervisor’s virtual switch. Each packet incurs a copy and a context switch.

Packet Arrives → Hypervisor vSwitch → Copy to Guest → Interrupt vCPU → Application

The Tax: Each hop adds ~20-40µs of non-deterministic delay.

2. Options

Approach	P99 Latency	Complexity	Verdict
A. Standard ENA (Default)	~180µs	Low	Baseline. Unacceptable for HFT.
B. ENA Express (AWS Feature)	~100µs	Low	Marginal improvement. Still hypervisor-bound.
C. c6i.metal + DPDK	~47µs	High	Full kernel bypass.

Why Metal? On .metal instances, the Nitro card presents the NIC directly via SR-IOV. There is no hypervisor vSwitch. The NIC is your hardware.

Source: AWS Nitro System Whitepaper

The ~47µs P99 figure is what DPDK benchmarks show on c6i.metal with a tight polling loop. Your result depends on workload, core pinning, and kernel configuration.

3. DPDK on Nitro

Bypass the kernel entirely using DPDK (Data Plane Development Kit).

Step 1: Bind the NIC to DPDK

# Load the vfio-pci driver
sudo modprobe vfio-pci

# Unbind from the kernel driver
sudo dpdk-devbind.py -u 0000:00:06.0

# Bind to DPDK
sudo dpdk-devbind.py -b vfio-pci 0000:00:06.0
```text

## Step 2: Poll in Userspace

Instead of waiting for interrupts, poll the NIC's RX ring in a tight loop.

```c
// DPDK pseudo-code
while (1) {
    nb_rx = rte_eth_rx_burst(port_id, queue_id, pkts, BURST_SIZE);
    if (nb_rx > 0) {
        process_packets(pkts, nb_rx);
    }
}
```text

**Verification:**
*   **Before:** `ping -c 100 <target>` shows P99 of ~200µs.
*   **After:** Custom DPDK `rte_rdtsc()` loop shows P99 in the 40-50µs range.


## 4. Verifying Your Kernel State

Before you touch DPDK, make sure your kernel isn't fighting you:

```bash
# Check for irqbalance (moves interrupts, adds jitter)
systemctl status irqbalance

# Check nohz_full is set for latency-critical cores
cat /sys/devices/system/cpu/nohz_full

# Check transparent huge pages (causes allocation stalls)
cat /sys/kernel/mm/transparent_hugepage/enabled

# Check CPU governor
cat /sys/bus/cpu/devices/cpu0/cpufreq/scaling_governor

A misconfigured kernel will limit the gains you see from DPDK.

5. Trade-offs

Observability Loss: DPDK packets don’t appear in tcpdump or iptables. You need custom tooling for any debugging.
CPU Cost: Polling burns 100% of a dedicated core. Budget this into your instance sizing.
Operational Complexity: Your team must understand userspace networking. This is a real skills requirement, not a one-time setup.

6. The Core Insight

The cloud is not slow. Your assumptions about the cloud are slow.

Every abstraction has a cost. The kernel’s networking stack was designed for generality, not for sub-100µs latency. When you demand determinism, you pay the complexity tax of bypassing the abstraction.

The question is not “Can we run latency-sensitive workloads on AWS?” It’s “Are we willing to operate at the metal level while paying cloud prices?”

Nikhil Padala

The Sub-50µs Cloud Lie: How to Actually Get Deterministic Latency on AWS

1. The Physics of Virtualization Jitter

Source 1: Hypervisor Scheduling (The VMExit Tax)

Source 2: Noisy Neighbors (The Steal Time Tax)

Source 3: NIC Virtualization (The VNIC Tax)

2. Options

3. DPDK on Nitro

Step 1: Bind the NIC to DPDK

5. Trade-offs

6. The Core Insight

PTP or Die: Hardware Timestamping for Regulatory-Grade Time Sync

Continue Reading

Sovereign Trading Infrastructure: Why the Next Generation of HFT Will Run Inside Enclaves

On-Premise GPU vs Cloud for Trading AI: When the Math Tips

AI-Driven Execution Agents: BAML/Letta Patterns for Trading Workflow Orchestration