The Physics of Matching Engines: Speed & Determinism

Why the fastest engines in the world are Single-Threaded. Ring Buffers, CPU Cache alignment, and the LMAX Disruptor.

Intermediate • 45 min read • Expert Version →

🎯 What You'll Learn

Deconstruct the 'Single-Threaded Determinism' rule
Analyze the LMAX Disruptor Pattern (Ring Buffers)
Calculate Cache Miss penalties (`std::map` vs Arrays)
Trace a packet from NIC to Matching Logic (Kernel Bypass)
Architect a Lock-Free Order Processing Pipeline

Introduction

In Web Development, we scale by adding threads. In High Frequency Trading, adding a thread is a death sentence.

Why? Locks. A context switch costs 2 microseconds. A lock contention costs 5 microseconds. A modern Matching Engine matches an order in 50 nanoseconds. If you use a lock, you are 100x slower.

This lesson explores the physics of building the fastest software on Earth.

The Physics: Single-Threaded Determinism

Markets must be Fair. If Client A sends an order at t=1 and Client B sends at t=2, A must execute before B. If you have multi-threaded matching, Thread B might win a race condition against Thread A. This is illegal.

The Solution: The Sequencer. All incoming packets are serialized into a single stream. A Single Thread processes this stream. No locks are needed because only one thread owns the memory.

Deep Dive: The Ring Buffer (LMAX Disruptor)

How do you pass data between the Network Thread (Writer) and the Matching Thread (Reader) without a lock? The Ring Buffer.

Imagine a circular array of 1 million slots.

Writer: “I wrote to slot 100. I am moving my cursor to 101.”
Reader: “I see the Writer is at 101. I can safely read 100.”

Physics:

Memory Barrier: Used instead of Locks (Cost: ~10ns vs ~5000ns).
Cache Locality: The array is contiguous in RAM, so the CPU pre-fetcher pulls the next order into L1 Cache before you even ask for it.

The Enemy: Cache Misses

Why std::map (Red-Black Tree) is banned in HFT: It is a “Node-Based” container. Node A is at address 0x100. Node B is at 0x9000. To traverse the tree, the CPU must fetch 0x100, wait 100ns, then fetch 0x9000.

The Fix: Flat Vectors. We pre-allocate std::vector<Order>. Scanning a vector is 100x faster than walking a tree, even if the algorithm is $O(N)$ vs $O(log N)$ , because of Hardware Prefetching.

Code: The Disruptor Pattern

A conceptual implementation of a lock-free ring buffer.

class RingBuffer:
    def __init__(self, size=1024):
        self.buffer = [None] * size
        self.size = size
        self.writer_cursor = 0
        self.reader_cursor = 0
        
    def write(self, data):
        # In C++, we would use memory_order_release here
        next_slot = (self.writer_cursor + 1) % self.size
        if next_slot == self.reader_cursor:
            raise Exception("Buffer Full")
            
        self.buffer[self.writer_cursor] = data
        self.writer_cursor = next_slot # Atomic commit
        
    def read(self):
        # In C++, we would use memory_order_acquire here
        if self.reader_cursor == self.writer_cursor:
            return None # Empty
        
        data = self.buffer[self.reader_cursor]
        self.reader_cursor = (self.reader_cursor + 1) % self.size
        return data

Practice Exercises

Exercise 1: Cache Miss Math (Beginner)

Scenario: L1 Cache hit = 1ns. RAM access = 100ns. Task: Algorithm A does 100 L1 hits. Algorithm B does 2 RAM accesses. Which is faster? (A = 100ns. B = 200ns. The “slower” algorithm wins if it fits in cache).

Exercise 2: Lock Contention (Intermediate)

Scenario: Two threads fight for a Mutex. Task: Why does the OS put the waiting thread to sleep? What occurs (Context Switch)? Why is this unacceptable? (Sleep = >10us latency).

Exercise 3: Sequence Numbers (Advanced)

Task: The Matching Engine crashes. How do you recover? Solution: Replay the Journal. Since the engine is deterministic, replaying the same input stream produces the exact same state.

Knowledge Check

Why are Matching Engines single-threaded?
What is a Ring Buffer?
Why do HFTs avoid new / malloc at runtime?
What is a “Memory Barrier”?
Why is a flat array faster than a tree for small datasets?

Answers

Determinism and Speed. No race conditions, no lock overhead.
Circular Queue. A fixed-size array used to pass data between threads lock-free.
OS Overhead. Allocation involves the Kernel and can trigger Page Faults.
CPU Instruction. It forces the CPU to flush write buffers, ensuring other cores see the data. Cheaper than a lock.
Cache Locality. Arrays are contiguous; Trees are scattered in RAM.

Summary

Logic: Single-Threaded > Multi-Threaded.
Data: Arrays > Trees.
Communication: Ring Buffers > Queues.

Want to go deeper?

Weekly infrastructure insights for engineers who build trading systems.

Free forever. Unsubscribe anytime.

You're in. Check your inbox.

Questions about this lesson? Working on related infrastructure?

Let's discuss