Load Balancing: The Physics of Queues

Why adding servers doesn't always make things faster. Little's Law, the Thundering Herd, and Layer 7 Traffic Shaping.

Beginner • 40 min read • Expert Version →

🎯 What You'll Learn

Apply Little's Law ($L = \lambda W$) to system capacity
Differentiate L4 (Packet) vs L7 (Request) Load Balancing
Mitigate the 'Thundering Herd' problem
Configure Nginx Upstream blocks
Analyze Sticky Sessions vs Stateless Routing

Introduction

A load balancer is a traffic cop — a queue manager.

Every server is a queue.

The CPU has a Run Queue.
The Network Card has a Ring Buffer.
The Database has a Lock Queue.

If you understand queueing theory, you understand load balancing. If you don’t, you add servers until you go bankrupt.

Little’s Law

The fundamental law of system capacity is:

$L = \lambda W$

$L$ : Average number of items in the system (queue length).
$\lambda$ : Average arrival rate (requests per second).
$W$ : Average wait time (latency).

The Insight: If latency ( $W$ ) doubles — say the database slows down — then queue length ( $L$ ) doubles even if traffic ( $\lambda$ ) stays constant. The load balancer’s job is to detect this and stop sending requests to the slow server before it crashes.

L4 vs. L7: The Layers of Traffic

How deep does the load balancer look?

Layer 4 (Transport)

What it sees: IP + Port. “Packet from 1.2.3.4 to 5.6.7.8”.
Action: Forwards packets. Fast. Dumb.
Example: LVS, Maglev.

Layer 7 (Application)

What it sees: HTTP Headers, Cookies, URL. “GET /api/user?id=5”.
Action: Terminates TCP, reads request, opens new TCP to backend. Smart. More overhead.
Example: Nginx, HAProxy, AWS ALB.

The Thundering Herd

Imagine 10,000 users are waiting for a cache entry. The entry expires. Suddenly, 10,000 requests hit the backend DB simultaneously. The DB crashes. The LB retries. The DB stays dead.

Solutions:

Request Coalescing: The LB holds 9,999 requests, sends one to the backend, and serves the result to all 10,000.
Jitter: Add small random delays to desynchronize spikes.

Code: Weighted Round Robin

A simple round-robin is often not enough when servers have different capacity. You need weights.

class WeightedRR:
    def __init__(self, servers):
        # servers = {"srv1": 5, "srv2": 1, "srv3": 1}
        self.servers = servers
        self.state = {k: 0 for k in servers}

    def get_server(self):
        # Find server with highest (CurrentWeight + EffectiveWeight)
        # (Simplified Nginx-style algorithm)
        best = None
        total_weight = 0

        for srv, weight in self.servers.items():
            self.state[srv] += weight
            total_weight += weight

            if best is None or self.state[srv] > self.state[best]:
                best = srv

        self.state[best] -= total_weight
        return best

# This ensures "smooth" distribution, not "bursty" distribution.
# srv1 doesn't get 5 requests in a row; it's interleaved.

Practice Exercises

Exercise 1: Capacity Planning (Beginner)

Scenario:

You process 1000 Req/Sec ( $\lambda$ ).
Avg Latency is 0.5 Sec ( $W$ ). Task: According to Little’s Law, how many concurrent connections ( $L$ ) must your server support?

Exercise 2: Nginx Config (Intermediate)

Task: Configure an Nginx upstream block that:

Load balances 3 servers.
Sends 2x traffic to srv_heavy.
Marks a server “down” if it fails 3 times.

Exercise 3: Sticky Sessions (Advanced)

Scenario: A user logs in on Server A. Their session is in Server A’s RAM. Task: Why does Round Robin break this? How does ip_hash fix it? What is the downside of ip_hash during a DDoS attack?

Knowledge Check

What happens to system capacity if latency increases?
Why is L7 Load Balancing slower than L4?
What is the “Thundering Herd” problem?
Why do we need health checks?
Does adding more servers always fix high latency?

Answers

Capacity drops. (Or queue length explodes). $L = \lambda W$ . If W goes up, L goes up.
More work per request. L7 requires terminating the TCP connection, buffering the request, parsing headers, and creating a new connection. L4 just rewrites packets.
Massive concurrency. When a cache expires, all concurrent requests hit the DB at once.
To avoid black holes. Sending traffic to a dead server results in 100% error rates.
No. If the bottleneck is the Database (shared resource), adding more web servers just increases the queue pressure on the DB.

Summary

Little’s Law: Latency kills throughput.
Algorithms: Use Weighted Round Robin for heterogeneous backends.
Layers: L4 for speed, L7 for intelligence.

Want to go deeper?

Weekly infrastructure insights for engineers who build trading systems.

Free forever. Unsubscribe anytime.

You're in. Check your inbox.

Questions about this lesson? Working on related infrastructure?

Let's discuss