The Physics of Matching Engines: Speed & Determinism
Why the fastest engines in the world are Single-Threaded. Ring Buffers, CPU Cache alignment, and the LMAX Disruptor.
🎯 What You'll Learn
- Deconstruct the 'Single-Threaded Determinism' rule
- Analyze the LMAX Disruptor Pattern (Ring Buffers)
- Calculate Cache Miss penalties (`std::map` vs Arrays)
- Trace a packet from NIC to Matching Logic (Kernel Bypass)
- Architect a Lock-Free Order Processing Pipeline
Introduction
In Web Development, we scale by adding threads. In High Frequency Trading, adding a thread is a death sentence.
Why? Locks. A context switch costs 2 microseconds. A lock contention costs 5 microseconds. A modern Matching Engine matches an order in 50 nanoseconds. If you use a lock, you are 100x slower.
This lesson explores the physics of building the fastest software on Earth.
The Physics: Single-Threaded Determinism
Markets must be Fair.
If Client A sends an order at t=1 and Client B sends at t=2, A must execute before B.
If you have multi-threaded matching, Thread B might win a race condition against Thread A. This is illegal.
The Solution: The Sequencer. All incoming packets are serialized into a single stream. A Single Thread processes this stream. No locks are needed because only one thread owns the memory.
Deep Dive: The Ring Buffer (LMAX Disruptor)
How do you pass data between the Network Thread (Writer) and the Matching Thread (Reader) without a lock? The Ring Buffer.
Imagine a circular array of 1 million slots.
- Writer: “I wrote to slot 100. I am moving my cursor to 101.”
- Reader: “I see the Writer is at 101. I can safely read 100.”
Physics:
- Memory Barrier: Used instead of Locks (Cost: ~10ns vs ~5000ns).
- Cache Locality: The array is contiguous in RAM, so the CPU pre-fetcher pulls the next order into L1 Cache before you even ask for it.
The Enemy: Cache Misses
Why std::map (Red-Black Tree) is banned in HFT:
It is a “Node-Based” container. Node A is at address 0x100. Node B is at 0x9000.
To traverse the tree, the CPU must fetch 0x100, wait 100ns, then fetch 0x9000.
The Fix: Flat Vectors.
We pre-allocate std::vector<Order>.
Scanning a vector is 100x faster than walking a tree, even if the algorithm is vs , because of Hardware Prefetching.
Code: The Disruptor Pattern
A conceptual implementation of a lock-free ring buffer.
class RingBuffer:
def __init__(self, size=1024):
self.buffer = [None] * size
self.size = size
self.writer_cursor = 0
self.reader_cursor = 0
def write(self, data):
# In C++, we would use memory_order_release here
next_slot = (self.writer_cursor + 1) % self.size
if next_slot == self.reader_cursor:
raise Exception("Buffer Full")
self.buffer[self.writer_cursor] = data
self.writer_cursor = next_slot # Atomic commit
def read(self):
# In C++, we would use memory_order_acquire here
if self.reader_cursor == self.writer_cursor:
return None # Empty
data = self.buffer[self.reader_cursor]
self.reader_cursor = (self.reader_cursor + 1) % self.size
return data
Practice Exercises
Exercise 1: Cache Miss Math (Beginner)
Scenario: L1 Cache hit = 1ns. RAM access = 100ns. Task: Algorithm A does 100 L1 hits. Algorithm B does 2 RAM accesses. Which is faster? (A = 100ns. B = 200ns. The “slower” algorithm wins if it fits in cache).
Exercise 2: Lock Contention (Intermediate)
Scenario: Two threads fight for a Mutex. Task: Why does the OS put the waiting thread to sleep? What occurs (Context Switch)? Why is this unacceptable? (Sleep = >10us latency).
Exercise 3: Sequence Numbers (Advanced)
Task: The Matching Engine crashes. How do you recover? Solution: Replay the Journal. Since the engine is deterministic, replaying the same input stream produces the exact same state.
Knowledge Check
- Why are Matching Engines single-threaded?
- What is a Ring Buffer?
- Why do HFTs avoid
new/mallocat runtime? - What is a “Memory Barrier”?
- Why is a flat array faster than a tree for small datasets?
Answers
- Determinism and Speed. No race conditions, no lock overhead.
- Circular Queue. A fixed-size array used to pass data between threads lock-free.
- OS Overhead. Allocation involves the Kernel and can trigger Page Faults.
- CPU Instruction. It forces the CPU to flush write buffers, ensuring other cores see the data. Cheaper than a lock.
- Cache Locality. Arrays are contiguous; Trees are scattered in RAM.
Summary
- Logic: Single-Threaded > Multi-Threaded.
- Data: Arrays > Trees.
- Communication: Ring Buffers > Queues.
Want to go deeper?
Weekly infrastructure insights for engineers who build trading systems.
Free forever. Unsubscribe anytime.
You're in. Check your inbox.
Questions about this lesson? Working on related infrastructure?
Let's discuss