Skip to content
STAGING — not production

The Physics of Memory: TLB, HugePages & page_fault_latency

Virtual Memory is an illusion. The physics of the Translation Lookaside Buffer (TLB), why 4KB pages are too small for HFT, and the cost of a Page Table Walk.

Intermediate 45 min read Expert Version →

🎯 What You'll Learn

  • Deconstruct Virtual Memory (VA -> PA Translation)
  • Measure the latency of a TLB Miss (Page Table Walk Physics)
  • Implement 1GB HugePages (`hugepagesz=1G`)
  • Audit `thp_collapse_scan` (The 100ms Latency Spike)
  • Tune NUMA Locality with `numactl`

Introduction

The kernel doesn’t hand you a contiguous block of RAM. It’s fragmented. The CPU creates an illusion called Virtual Memory by breaking RAM into 4KB pages. For a 32GB address space, that’s roughly 8 million entries in the page table.

Every memory access must consult this table. If the mapping is not in the CPU cache (TLB), the CPU must walk the table out in slower RAM. This lesson covers TLB physics and why 1GB pages matter for latency-sensitive workloads.


TLB Reach

The TLB (Translation Lookaside Buffer) is a tiny cache inside the CPU core that stores virtual-to-physical address mappings.

  • L1 TLB Size: ~64 entries (varies by processor; check cpuid).
  • Access Time: 1-2 ns.

The Math of Reach:

  • With 4KB Pages: 64×4KB=256KB64 \times 4KB = 256KB Reach.
  • With 1GB HugePages: 64×1GB=64GB64 \times 1GB = 64GB Reach.

If your trading engine uses 4GB of RAM with 4KB pages, the TLB covers roughly 0.006% of your memory. You will suffer TLB Thrashing: every access misses, forcing a page table walk (~400 cycles on modern Intel).


The Silent Killer: Transparent Huge Pages (THP)

The kernel tries to help via a background thread (khugepaged) that merges 4KB pages into 2MB pages automatically. This is harmful for latency-sensitive workloads.

  1. Stop-the-World: The kernel locks the memory region to merge pages.
  2. Spike: Your application freezes for 10ms - 200ms during compaction.
  3. Randomness: It happens unpredictably.

Rule for HFT: Disable THP.

echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo never > /sys/kernel/mm/transparent_hugepage/defrag
```diff

---

## Code: Allocating 1GB HugePages

For the hot path, 2MB pages are often sufficient, but 1GB pages reduce TLB entries further for very large working sets.

### 1. Boot Parameters
Edit `/etc/default/grub`:
```bash
# Reserve 4 x 1GB Pages. Total 4GB.
GRUB_CMDLINE_LINUX="default_hugepagesz=1G hugepagesz=1G hugepages=4"
```python
*1GB pages must be reserved at boot. By the time the OS runs, RAM is too fragmented to find contiguous 1GB regions.*

## 2. Consuming HugePages (C/C++)
Standard `malloc()` won't use hugepages. You need `mmap` with `MAP_HUGETLB`.

```c
#include <sys/mman.h>

void* get_huge_memory() {
    size_t size = 1024 * 1024 * 1024;  // 1GB

    void* ptr = mmap(NULL, size, PROT_READ | PROT_WRITE,
                     MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB, -1, 0);

    if (ptr == MAP_FAILED) {
        perror("mmap failed. Did you reserve HugePages at boot?");
        exit(1);
    }

    // Lock to prevent swap
    if (mlock(ptr, size) == -1) {
        perror("mlock failed");
    }

    return ptr;
}
```diff

---

## NUMA: Remote RAM is Slow RAM

On dual-socket servers, half the RAM is near CPU 0, half near CPU 1.
Accessing "remote node" RAM travels over the QPI/UPI interconnect.
**Typical penalty:** ~40ns latency (roughly 30% slower for memory-bound code).

```bash
# Run on CPU Node 0, use only RAM from Node 0
numactl --cpunodebind=0 --membind=0 ./my_trading_app

Practice Exercises

Exercise 1: TLB Miss Hunt (Beginner)

Task: perf stat -e dTLB-load-misses ./app. Action: Run with standard malloc vs HugePages. Observation: Watch misses drop significantly.

Exercise 2: The THP Scan (Intermediate)

Task: Enable THP always. Action: Run a memory fragmentation script. Watch grep thp_collapse_scan /proc/vmstat. Goal: See the counters incrementing (kernel interference).

Exercise 3: NUMA Benchmarking (Advanced)

Task: numactl --membind=1 ./app (Force remote memory) while running on CPU 0. Action: Measure bandwidth. Compare with --membind=0 (Local).


Knowledge Check

  1. What is the TLB Reach of 4KB pages vs 1GB pages (with a 64-entry L1 TLB)?
  2. Why is khugepaged bad for latency?
  3. Can you allocate 1GB pages while the system is running?
  4. What syscall locks memory to prevent swapping?
  5. What is the latency penalty of Remote NUMA access?
Answers
  1. 256KB vs 64GB. (Assuming 64-entry L1 TLB).
  2. Compaction Stalls. It pauses memory access to merge pages.
  3. Rarely. Memory is usually too fragmented. Do it at boot.
  4. mlock().
  5. ~40ns. (Roughly 30% slower than local NUMA access).

Summary

  • TLB: The most important cache for large working sets.
  • HugePages: The only way to increase TLB reach.
  • THP: Good for web servers, bad for trading.
  • NUMA: Local RAM is fast RAM.

Pro Version: See the full research: Memory Tuning for Linux Latency

What’s Next?

Now that you understand the physics of memory, learn how to bypass the kernel entirely:

The Sovereign Architect - Why kernel bypass and zero-copy are mandatory for sub-100µs latency.

Want to go deeper?

Weekly infrastructure insights for engineers who build trading systems.

Free forever. Unsubscribe anytime.

You're in. Check your inbox.

Questions about this lesson? Working on related infrastructure?

Let's discuss