Load Balancing: The Physics of Queues
Why adding servers doesn't always make things faster. Little's Law, the Thundering Herd, and Layer 7 Traffic Shaping.
🎯 What You'll Learn
- Apply Little's Law ($L = \lambda W$) to system capacity
- Differentiate L4 (Packet) vs L7 (Request) Load Balancing
- Mitigate the 'Thundering Herd' problem
- Configure Nginx Upstream blocks
- Analyze Sticky Sessions vs Stateless Routing
Introduction
A load balancer is a traffic cop — a queue manager.
Every server is a queue.
- The CPU has a Run Queue.
- The Network Card has a Ring Buffer.
- The Database has a Lock Queue.
If you understand queueing theory, you understand load balancing. If you don’t, you add servers until you go bankrupt.
Little’s Law
The fundamental law of system capacity is:
- : Average number of items in the system (queue length).
- : Average arrival rate (requests per second).
- : Average wait time (latency).
The Insight: If latency () doubles — say the database slows down — then queue length () doubles even if traffic () stays constant. The load balancer’s job is to detect this and stop sending requests to the slow server before it crashes.
L4 vs. L7: The Layers of Traffic
How deep does the load balancer look?
Layer 4 (Transport)
- What it sees: IP + Port. “Packet from 1.2.3.4 to 5.6.7.8”.
- Action: Forwards packets. Fast. Dumb.
- Example: LVS, Maglev.
Layer 7 (Application)
- What it sees: HTTP Headers, Cookies, URL. “GET /api/user?id=5”.
- Action: Terminates TCP, reads request, opens new TCP to backend. Smart. More overhead.
- Example: Nginx, HAProxy, AWS ALB.
The Thundering Herd
Imagine 10,000 users are waiting for a cache entry. The entry expires. Suddenly, 10,000 requests hit the backend DB simultaneously. The DB crashes. The LB retries. The DB stays dead.
Solutions:
- Request Coalescing: The LB holds 9,999 requests, sends one to the backend, and serves the result to all 10,000.
- Jitter: Add small random delays to desynchronize spikes.
Code: Weighted Round Robin
A simple round-robin is often not enough when servers have different capacity. You need weights.
class WeightedRR:
def __init__(self, servers):
# servers = {"srv1": 5, "srv2": 1, "srv3": 1}
self.servers = servers
self.state = {k: 0 for k in servers}
def get_server(self):
# Find server with highest (CurrentWeight + EffectiveWeight)
# (Simplified Nginx-style algorithm)
best = None
total_weight = 0
for srv, weight in self.servers.items():
self.state[srv] += weight
total_weight += weight
if best is None or self.state[srv] > self.state[best]:
best = srv
self.state[best] -= total_weight
return best
# This ensures "smooth" distribution, not "bursty" distribution.
# srv1 doesn't get 5 requests in a row; it's interleaved.
Practice Exercises
Exercise 1: Capacity Planning (Beginner)
Scenario:
- You process 1000 Req/Sec ().
- Avg Latency is 0.5 Sec (). Task: According to Little’s Law, how many concurrent connections () must your server support?
Exercise 2: Nginx Config (Intermediate)
Task: Configure an Nginx upstream block that:
- Load balances 3 servers.
- Sends 2x traffic to
srv_heavy. - Marks a server “down” if it fails 3 times.
Exercise 3: Sticky Sessions (Advanced)
Scenario: A user logs in on Server A. Their session is in Server A’s RAM.
Task: Why does Round Robin break this? How does ip_hash fix it? What is the downside of ip_hash during a DDoS attack?
Knowledge Check
- What happens to system capacity if latency increases?
- Why is L7 Load Balancing slower than L4?
- What is the “Thundering Herd” problem?
- Why do we need health checks?
- Does adding more servers always fix high latency?
Answers
- Capacity drops. (Or queue length explodes). . If W goes up, L goes up.
- More work per request. L7 requires terminating the TCP connection, buffering the request, parsing headers, and creating a new connection. L4 just rewrites packets.
- Massive concurrency. When a cache expires, all concurrent requests hit the DB at once.
- To avoid black holes. Sending traffic to a dead server results in 100% error rates.
- No. If the bottleneck is the Database (shared resource), adding more web servers just increases the queue pressure on the DB.
Summary
- Little’s Law: Latency kills throughput.
- Algorithms: Use Weighted Round Robin for heterogeneous backends.
- Layers: L4 for speed, L7 for intelligence.
Want to go deeper?
Weekly infrastructure insights for engineers who build trading systems.
Free forever. Unsubscribe anytime.
You're in. Check your inbox.
Questions about this lesson? Working on related infrastructure?
Let's discuss