Infrastructure
eBPF Profiling: Nanoseconds Without Adding Any
Deep dive into eBPF, bpftrace, and kernel tracing. How to measure latency at nanosecond precision without the observer effect.
You cannot debug a latency spike if your debugger causes the spike.
Traditional profiling tools add significant overhead: strace costs 100-500µs per syscall, perf sampling uses 5-20% CPU. In a trading system processing 100K messages/second, these tools destroy the behavior you’re trying to measure.
eBPF changes this. Run sandboxed programs in the kernel, measure at nanosecond precision, add nearly zero overhead.
The Problem {#the-problem}
Traditional tools and their costs:
| Tool | Overhead | Why |
|---|---|---|
| strace | 100-500µs/syscall | ptrace stops process, copies data |
| ltrace | 200-800µs/call | Library interposition |
| perf (sampling) | 5-20% CPU | Interrupt-based sampling |
| gdb | Process stopped | Breakpoint traps |
The observer effect: The measurement changes the behavior. At trading speeds, this makes traditional tools useless for production debugging.
eBPF overhead: 100-500 nanoseconds per probe. 1000x less than strace.
For what to measure, see Monitoring Metrics. For kernel tuning that affects what you’ll observe, see the deep dive series:
Background: What Is eBPF? {#background}
The Architecture
eBPF (extended Berkeley Packet Filter) runs sandboxed programs in the Linux kernel (kernel source: kernel/bpf/).
User space program (bpftrace, bcc tool)
↓
BPF bytecode compilation
↓
Kernel verifier (safety checks)
↓
JIT compilation to native code
↓
Attached to kernel hook (kprobe, tracepoint, etc.)
↓
Runs on every event, collects data
↓
Maps (hash tables, arrays) export to user space
```bash
### Why It's Fast
1. **In-kernel execution:** No context switches to user space
2. **JIT compiled:** Runs at near-native speed
3. **Verified:** Can't crash the kernel
4. **No locks:** Lock-free data structures for metrics
### Available Hooks
| Hook Type | What It Traces | Example |
|-----------|---------------|---------|
| kprobe | Function entry | `kprobe:tcp_sendmsg` |
| kretprobe | Function return | `kretprobe:tcp_sendmsg` |
| tracepoint | Static kernel events | `tracepoint:sched:sched_switch` |
| uprobe | User-space function | `uprobe:/usr/bin/python:main` |
| USDT | User-defined traces | Application-specific |
**Citation:** eBPF architecture documented in [kernel documentation](https://www.kernel.org/doc/html/latest/bpf/index.html).
## Tool 1: bpftrace One-Liners {#bpftrace}
### Installation
```bash
# Ubuntu/Debian
sudo apt install bpftrace
# Amazon Linux 2023
sudo dnf install bpftrace
```text
## Syscall Latency Histogram
```bash
# Distribution of syscall latencies (microseconds)
sudo bpftrace -e '
tracepoint:raw_syscalls:sys_enter { @start[tid] = nsecs; }
tracepoint:raw_syscalls:sys_exit /@start[tid]/ {
@usecs = hist((nsecs - @start[tid]) / 1000);
delete(@start[tid]);
}'
```text
**Output:**
```bash
@usecs:
[0] 12453 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[1] 4521 |@@@@@@@@@@@@@@@@@ |
[2, 4) 1234 |@@@@@ |
[4, 8) 342 |@ |
[8, 16) 45 | |
```text
## Per-Syscall Latency
```bash
# Which syscalls are slow?
sudo bpftrace -e '
tracepoint:raw_syscalls:sys_enter {
@start[tid] = nsecs;
@syscall[tid] = args->id;
}
tracepoint:raw_syscalls:sys_exit /@start[tid]/ {
@latency[@syscall[tid]] = hist((nsecs - @start[tid]) / 1000);
delete(@start[tid]);
delete(@syscall[tid]);
}'
```text
## Tool 2: Network Tracing {#network}
### TCP Send Latency
```bash
# How long does tcp_sendmsg take in kernel?
sudo bpftrace -e '
kprobe:tcp_sendmsg { @start[tid] = nsecs; }
kretprobe:tcp_sendmsg /@start[tid]/ {
@tcp_send_ns = hist(nsecs - @start[tid]);
delete(@start[tid]);
}'
```text
## Packet Receive Path
```bash
# Latency from NIC to socket
sudo bpftrace -e '
tracepoint:net:netif_receive_skb { @rx_start[args->skbaddr] = nsecs; }
tracepoint:net:consume_skb /@rx_start[args->skbaddr]/ {
@rx_time_ns = hist(nsecs - @rx_start[args->skbaddr]);
delete(@rx_start[args->skbaddr]);
}'
```text
## Softirq Time
```bash
# Network softirq processing time
sudo bpftrace -e '
tracepoint:irq:softirq_entry /args->vec == 3/ { @start[tid] = nsecs; }
tracepoint:irq:softirq_exit /args->vec == 3 && @start[tid]/ {
@net_softirq_ns = hist(nsecs - @start[tid]);
delete(@start[tid]);
}'
```text
For network tuning that affects these measurements, see [Network Deep Dive](/blog/network-optimization-linux-latency).
## Tool 3: Scheduler Latency {#scheduler}
### Run Queue Latency
```bash
# How long do threads wait for CPU?
sudo bpftrace -e '
tracepoint:sched:sched_wakeup { @qtime[args->pid] = nsecs; }
tracepoint:sched:sched_switch /args->prev_state == 0/ {
if (@qtime[args->next_pid]) {
@runq_latency_us = hist((nsecs - @qtime[args->next_pid]) / 1000);
delete(@qtime[args->next_pid]);
}
}'
```text
**Interpretation:**
- Spikes at 4ms = timer tick preemption (CONFIG_HZ=250)
- Long tail = CPU contention
## Off-CPU Time
```bash
# Why is my thread not running?
sudo bpftrace -e '
tracepoint:sched:sched_switch {
@off_start[args->prev_pid] = nsecs;
if (@off_start[args->next_pid]) {
@off_cpu_us[args->next_comm] = hist((nsecs - @off_start[args->next_pid]) / 1000);
delete(@off_start[args->next_pid]);
}
}'
```nginx
For scheduler tuning, see [CPU Deep Dive](/blog/cpu-optimization-linux-latency#isolation).
## Tool 4: Memory Allocation {#memory}
### Malloc Latency
```bash
# How long does malloc take?
sudo bpftrace -e '
uprobe:/lib/x86_64-linux-gnu/libc.so.6:malloc {
@start[tid] = nsecs;
@size[tid] = arg0;
}
uretprobe:/lib/x86_64-linux-gnu/libc.so.6:malloc /@start[tid]/ {
@malloc_ns[@size[tid]] = hist(nsecs - @start[tid]);
delete(@start[tid]);
delete(@size[tid]);
}'
```text
## Page Faults
```bash
# Minor and major page fault latency
sudo bpftrace -e '
tracepoint:exceptions:page_fault_user { @fault_start[tid] = nsecs; }
tracepoint:exceptions:page_fault_user { @fault_end[tid] = nsecs; }
// Actually need different approach, use kprobes:
kprobe:handle_mm_fault { @start[tid] = nsecs; }
kretprobe:handle_mm_fault /@start[tid]/ {
@page_fault_us = hist((nsecs - @start[tid]) / 1000);
delete(@start[tid]);
}'
```text
For memory tuning, see [Memory Deep Dive](/blog/memory-tuning-linux-latency).
## Tool 5: Block I/O {#block-io}
### Request Latency
```bash
# Block I/O latency by device
sudo bpftrace -e '
tracepoint:block:block_rq_issue { @issue[args->sector] = nsecs; }
tracepoint:block:block_rq_complete /@issue[args->sector]/ {
@io_latency_us = hist((nsecs - @issue[args->sector]) / 1000);
delete(@issue[args->sector]);
}'
```text
## I/O Size Distribution
```bash
# What sizes are we reading/writing?
sudo bpftrace -e '
tracepoint:block:block_rq_issue {
@io_size_bytes = hist(args->bytes);
}'
```python
For storage tuning, see [Storage Deep Dive](/blog/storage-io-linux-latency).
## Production Deployment {#production}
### Continuous Monitoring with BCC
```python
#!/usr/bin/env python3
# syscall_latency_exporter.py
from bcc import BPF
from prometheus_client import start_http_server, Histogram
import time
bpf_program = """
#include <uapi/linux/ptrace.h>
BPF_HASH(start, u32);
BPF_HISTOGRAM(dist);
TRACEPOINT_PROBE(raw_syscalls, sys_enter) {
u32 pid = bpf_get_current_pid_tgid();
u64 ts = bpf_ktime_get_ns();
start.update(&pid, &ts);
return 0;
}
TRACEPOINT_PROBE(raw_syscalls, sys_exit) {
u32 pid = bpf_get_current_pid_tgid();
u64 *tsp = start.lookup(&pid);
if (tsp) {
u64 delta = bpf_ktime_get_ns() - *tsp;
dist.increment(bpf_log2l(delta / 1000)); // microseconds
start.delete(&pid);
}
return 0;
}
"""
SYSCALL_LATENCY = Histogram(
'syscall_latency_microseconds',
'Syscall latency',
buckets=[1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024]
)
def main():
b = BPF(text=bpf_program)
start_http_server(9101)
while True:
time.sleep(10)
for k, v in b["dist"].items():
bucket_us = 2 ** k.value
SYSCALL_LATENCY.observe(bucket_us)
b["dist"].clear()
if __name__ == "__main__":
main()
```text
## Kubernetes DaemonSet
```yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: ebpf-exporter
spec:
template:
spec:
hostPID: true
hostNetwork: true
containers:
- name: exporter
image: ebpf-exporter:latest
securityContext:
privileged: true
volumeMounts:
- name: bpf
mountPath: /sys/fs/bpf
- name: debug
mountPath: /sys/kernel/debug
volumes:
- name: bpf
hostPath:
path: /sys/fs/bpf
- name: debug
hostPath:
path: /sys/kernel/debug
```bash
## Design Philosophy {#design-philosophy}
### Measure, Don't Assume
Before optimizing:
1. **Measure current state** (where is latency?)
2. **Make change**
3. **Measure again** (did it help?)
**The pattern:** Most optimization efforts target the wrong thing. Measurement reveals the true bottleneck.
### The Observer Effect Quantified
| Tool | Overhead/Event | Events/sec Impact |
|------|----------------|-------------------|
| printf | 10-50µs | Severe |
| strace | 100-500µs | Unusable |
| perf | 0.1-1µs | 5-20% CPU |
| eBPF | 0.1-0.5µs | <1% CPU |
**The tradeoff:** eBPF requires kernel knowledge to write and interpret. Worth it for production observability.
### When to Use Each Tool
| Situation | Tool |
|-----------|------|
| Quick debugging | bpftrace one-liner |
| Production monitoring | BCC + Prometheus |
| Detailed analysis | perf + flame graphs |
| Syscall list | strace (development only) |
---
## Baseline Your Kernel Before Profiling
Profiling reveals where latency goes. But if your kernel is misconfigured, the profiler will show you symptoms of the wrong problem. Before you start measuring, check:
```bash
# CPU governor (should be 'performance' for latency work)
cat /sys/bus/cpu/devices/cpu0/cpufreq/scaling_governor
# Transparent huge pages (should be 'never' or 'madvise')
cat /sys/kernel/mm/transparent_hugepage/enabled
# irqbalance (should be stopped on latency-critical systems)
systemctl status irqbalance Up Next in Linux Infrastructure Deep Dives
Market Data Infrastructure: WebSocket Patterns That Scale
Deep dive into WebSocket reliability, sequence gap detection, Kubernetes patterns, and monitoring for multi-exchange market data systems.
Continue Reading
Reading Path
Continue exploring with these related deep dives:
| Topic | Next Post |
|---|---|
| SLOs, metrics that matter, alerting | Trading Metrics: What SRE Dashboards Miss |
| CPU governors, C-states, NUMA, isolation | CPU Isolation for HFT: The isolcpus Lie and What Actually Works |
| THP, huge pages, memory locking, pre-allocation | Memory Tuning for Low-Latency: The THP Trap and HugePage Mastery |
| NIC offloads, IRQ affinity, kernel bypass | Network Optimization: Kernel Bypass and the Art of Busy Polling |
| Design philosophy & architecture decisions | Trading Infrastructure: First Principles That Scale |
Enjoyed this?
Get one deep infrastructure insight per week.
Free forever. Unsubscribe anytime.
You're in. Check your inbox.