The Physics of Networking: From NIC to Socket
Why your ping is 0.1ms but your app is 10ms. The physics of DMA, Ring Buffers, SoftIRQs (NET_RX), and the Socket Buffer (sk_buff).
🎯 What You'll Learn
- Trace a packet's physical journey (NIC -> RAM -> CPU)
- Deconstruct the `sk_buff` (Socket Buffer) structure
- Analyze the SoftIRQ Bottleneck (`ksoftirqd`)
- Explain NAPI (New API) Polling vs Interrupts
- Tune Sysctl for 10Gbps+ Throughput
Introduction
Most developers think networking happens in the “Cloud.” Systems Engineers know networking happens in a Ring Buffer.
When you send a request to a database, you aren’t just sending “data.” You are triggering a violent chain reaction of electrical interrupts, memory copies, and context switches.
This lesson traces the Physics of a Packet: The exact sequence of events from the moment a photon hits your Network Card to the moment your Node.js app fires a callback.
The Physics: The Path of Ingress (RX)
A packet arrives at the NIC (Network Interface Card). What happens next?
Phase 1: The Hardware (DMA)
The CPU is too slow to read packets one by one. Instead, the NIC uses Direct Memory Access (DMA) to write the packet directly into a pre-allocated space in RAM called the RX Ring Buffer.
- Physics: The packet is in RAM, but the CPU doesn’t know it yet.
- Action: The NIC fires a Hard Interrupt (IRQ) to wake up the CPU.
Phase 2: The SoftIRQ (NET_RX)
The Hard IRQ handler must be insanely fast. It cannot process TCP logic. It signals: “Hey kernel, there’s work to do,” triggers a SoftIRQ, and exits.
This keeps the CPU responsive. The heavy lifting happens later in ksoftirqd context.
Phase 3: NAPI Polling
In the old days, 10,000 packets meant 10,000 Interrupts. This caused “Receive Livelock”-the CPU spent 100% of its time handling interrupts and 0% processing data. Solution: NAPI (New API).
- First packet -> Interrupt.
- Kernel disables Interrupts for that NIC.
- Kernel Polls the Ring Buffer until it is empty.
- Re-enables Interrupts.
Deep Dive: The sk_buff (Socket Buffer)
The struct sk_buff is arguably the most important data structure in the Linux networking subsystem. It represents a packet.
Crucially, Linux never copies packet data if it can avoid it.
It just passes pointers to this sk_buff structure around.
The Anatomy
head: Start of the allocated buffer.data: Start of the actual packet data (e.g., skips the Ethernet Header).tail: End of the packet data.end: End of the allocated buffer.
Physics of Parsing:
When the kernel parses headers (Ethernet -> IP -> TCP), it doesn’t move memory. It just increments the data pointer.
skb_pull(skb, 14) -> effectively “strips” the Ethernet header by moving the pointer forward 14 bytes. Zero cost.
Strategy: Zero Copy Networking
Why is read() slow?
Because it forces a Context Switch AND a Memory Copy.
Packet (Kernel RAM) -> App Buffer (User RAM).
The Solution: sendfile() or mmap().
These syscalls allow the Kernel to send data from Disk -> Network without ever copying it to User Space.
- Result: The CPU does nothing. DMA handles Disk -> RAM, and RAM -> NIC.
- Throughput: 10Gbps+ on single-core.
Code: Tuning for High Throughput
To handle 10Gbps or high packet rates, default Linux settings are insufficient.
# 1. Enlarge the Ring Buffers (NIC Hardware Queue)
# Prevents packet drops at the hardware level during micro-bursts
ethtool -G eth0 rx 4096 tx 4096
# 2. Distribute Interrupts (Packet Steering)
# Ensures not just CPU Core 0 is handling all network traffic
# (Receive Packet Steering)
sysctl -w net.core.rps_sock_flow_entries=32768
# 3. Increase SoftIRQ budget
# Allow the kernel to process more packets before yielding CPU
sysctl -w net.core.netdev_budget=600
# 4. Enlarge TCP Window limits (BDP - Bandwidth Delay Product)
sysctl -w net.ipv4.tcp_rmem="4096 87380 16777216"
sysctl -w net.ipv4.tcp_wmem="4096 65536 16777216"
Practice Exercises
Exercise 1: The Drop (Beginner)
Task: Run ethtool -S eth0 | grep drop.
Observation: If rx_queue_0_drops is increasing, your Ring Buffer is too small. The kernel isn’t polling fast enough to empty the hardware queue.
Exercise 2: The SoftIRQ Storm (Intermediate)
Task: Run top and look at the %si (SoftIRQ) CPU usage field.
Scenario: If %si is 100% on one core (e.g., Cpu0), you are bottlenecked by interrupt handling.
Fix: Enable RPS (Receive Packet Steering) to spread the load to other cores.
Exercise 3: Zero Copy Benchmark (Advanced)
Task: Compare cat largefile > /dev/null vs sending it via sendfile() (using a tool like nginx).
Result: sendfile consumes drastically less CPU because the data never crosses the User/Kernel boundary.
Knowledge Check
- What performs the copy from NIC to RAM?
- Why did the “Interrupt Livelock” happen before NAPI?
- What does
skb_pull()actually do to the memory? - Why is
ksoftirqdusage high during heavy network load? - What is the “Bandwidth-Delay Product”?
Answers
- DMA (Direct Memory Access). The CPU is not involved.
- CPU Starvation. The CPU spent all its cycles entering/exiting interrupt handlers, doing no actual work.
- Nothing. It just increments a pointer (advances the start offset).
- SoftIRQ Processing. This is the kernel thread dedicated to processing the backlog of packets from the Ring Buffer.
- Buffer Size. Throughput * RTT. The amount of data “in flight” that needs to be buffered for max speed.
Summary
- RX Ring: The hardware parking lot.
- SoftIRQ: The kernel worker thread.
- sk_buff: The pointer-based packet structure.
- Zero Copy: The art of doing nothing.
Want to go deeper?
Weekly infrastructure insights for engineers who build trading systems.
Free forever. Unsubscribe anytime.
You're in. Check your inbox.
Questions about this lesson? Working on related infrastructure?
Let's discuss