eBPF Profiling: Nanoseconds Without Adding Any

You cannot debug a latency spike if your debugger causes the spike.

Traditional profiling tools add significant overhead: strace costs 100-500µs per syscall, perf sampling uses 5-20% CPU. In a trading system processing 100K messages/second, these tools destroy the behavior you’re trying to measure.

eBPF changes this. Run sandboxed programs in the kernel, measure at nanosecond precision, add nearly zero overhead.

The Problem {#the-problem}

Traditional tools and their costs:

Tool	Overhead	Why
strace	100-500µs/syscall	ptrace stops process, copies data
ltrace	200-800µs/call	Library interposition
perf (sampling)	5-20% CPU	Interrupt-based sampling
gdb	Process stopped	Breakpoint traps

The observer effect: The measurement changes the behavior. At trading speeds, this makes traditional tools useless for production debugging.

eBPF overhead: 100-500 nanoseconds per probe. 1000x less than strace.

For what to measure, see Monitoring Metrics. For kernel tuning that affects what you’ll observe, see the deep dive series:

Background: What Is eBPF? {#background}

The Architecture

eBPF (extended Berkeley Packet Filter) runs sandboxed programs in the Linux kernel (kernel source: kernel/bpf/).

User space program (bpftrace, bcc tool)
    ↓
BPF bytecode compilation
    ↓
Kernel verifier (safety checks)
    ↓
JIT compilation to native code
    ↓
Attached to kernel hook (kprobe, tracepoint, etc.)
    ↓
Runs on every event, collects data
    ↓
Maps (hash tables, arrays) export to user space
```bash

### Why It's Fast

1. **In-kernel execution:** No context switches to user space
2. **JIT compiled:** Runs at near-native speed
3. **Verified:** Can't crash the kernel
4. **No locks:** Lock-free data structures for metrics

### Available Hooks

| Hook Type | What It Traces | Example |
|-----------|---------------|---------|
| kprobe | Function entry | `kprobe:tcp_sendmsg` |
| kretprobe | Function return | `kretprobe:tcp_sendmsg` |
| tracepoint | Static kernel events | `tracepoint:sched:sched_switch` |
| uprobe | User-space function | `uprobe:/usr/bin/python:main` |
| USDT | User-defined traces | Application-specific |

**Citation:** eBPF architecture documented in [kernel documentation](https://www.kernel.org/doc/html/latest/bpf/index.html).


## Tool 1: bpftrace One-Liners {#bpftrace}

### Installation

```bash
# Ubuntu/Debian
sudo apt install bpftrace

# Amazon Linux 2023
sudo dnf install bpftrace
```text

## Syscall Latency Histogram

```bash
# Distribution of syscall latencies (microseconds)
sudo bpftrace -e '
tracepoint:raw_syscalls:sys_enter { @start[tid] = nsecs; }
tracepoint:raw_syscalls:sys_exit /@start[tid]/ {
  @usecs = hist((nsecs - @start[tid]) / 1000);
  delete(@start[tid]);
}'
```text

**Output:**

```bash
@usecs:
[0]                  12453 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[1]                   4521 |@@@@@@@@@@@@@@@@@                                   |
[2, 4)                1234 |@@@@@                                               |
[4, 8)                 342 |@                                                   |
[8, 16)                 45 |                                                    |
```text

## Per-Syscall Latency

```bash
# Which syscalls are slow?
sudo bpftrace -e '
tracepoint:raw_syscalls:sys_enter {
  @start[tid] = nsecs;
  @syscall[tid] = args->id;
}
tracepoint:raw_syscalls:sys_exit /@start[tid]/ {
  @latency[@syscall[tid]] = hist((nsecs - @start[tid]) / 1000);
  delete(@start[tid]);
  delete(@syscall[tid]);
}'
```text


## Tool 2: Network Tracing {#network}

### TCP Send Latency

```bash
# How long does tcp_sendmsg take in kernel?
sudo bpftrace -e '
kprobe:tcp_sendmsg { @start[tid] = nsecs; }
kretprobe:tcp_sendmsg /@start[tid]/ {
  @tcp_send_ns = hist(nsecs - @start[tid]);
  delete(@start[tid]);
}'
```text

## Packet Receive Path

```bash
# Latency from NIC to socket
sudo bpftrace -e '
tracepoint:net:netif_receive_skb { @rx_start[args->skbaddr] = nsecs; }
tracepoint:net:consume_skb /@rx_start[args->skbaddr]/ {
  @rx_time_ns = hist(nsecs - @rx_start[args->skbaddr]);
  delete(@rx_start[args->skbaddr]);
}'
```text

## Softirq Time

```bash
# Network softirq processing time
sudo bpftrace -e '
tracepoint:irq:softirq_entry /args->vec == 3/ { @start[tid] = nsecs; }
tracepoint:irq:softirq_exit /args->vec == 3 && @start[tid]/ {
  @net_softirq_ns = hist(nsecs - @start[tid]);
  delete(@start[tid]);
}'
```text

For network tuning that affects these measurements, see [Network Deep Dive](/blog/network-optimization-linux-latency).


## Tool 3: Scheduler Latency {#scheduler}

### Run Queue Latency

```bash
# How long do threads wait for CPU?
sudo bpftrace -e '
tracepoint:sched:sched_wakeup { @qtime[args->pid] = nsecs; }
tracepoint:sched:sched_switch /args->prev_state == 0/ {
  if (@qtime[args->next_pid]) {
    @runq_latency_us = hist((nsecs - @qtime[args->next_pid]) / 1000);
    delete(@qtime[args->next_pid]);
  }
}'
```text

**Interpretation:**
- Spikes at 4ms = timer tick preemption (CONFIG_HZ=250)
- Long tail = CPU contention

## Off-CPU Time

```bash
# Why is my thread not running?
sudo bpftrace -e '
tracepoint:sched:sched_switch {
  @off_start[args->prev_pid] = nsecs;
  if (@off_start[args->next_pid]) {
    @off_cpu_us[args->next_comm] = hist((nsecs - @off_start[args->next_pid]) / 1000);
    delete(@off_start[args->next_pid]);
  }
}'
```nginx

For scheduler tuning, see [CPU Deep Dive](/blog/cpu-optimization-linux-latency#isolation).


## Tool 4: Memory Allocation {#memory}

### Malloc Latency

```bash
# How long does malloc take?
sudo bpftrace -e '
uprobe:/lib/x86_64-linux-gnu/libc.so.6:malloc {
  @start[tid] = nsecs;
  @size[tid] = arg0;
}
uretprobe:/lib/x86_64-linux-gnu/libc.so.6:malloc /@start[tid]/ {
  @malloc_ns[@size[tid]] = hist(nsecs - @start[tid]);
  delete(@start[tid]);
  delete(@size[tid]);
}'
```text

## Page Faults

```bash
# Minor and major page fault latency
sudo bpftrace -e '
tracepoint:exceptions:page_fault_user { @fault_start[tid] = nsecs; }
tracepoint:exceptions:page_fault_user { @fault_end[tid] = nsecs; }

// Actually need different approach, use kprobes:
kprobe:handle_mm_fault { @start[tid] = nsecs; }
kretprobe:handle_mm_fault /@start[tid]/ {
  @page_fault_us = hist((nsecs - @start[tid]) / 1000);
  delete(@start[tid]);
}'
```text

For memory tuning, see [Memory Deep Dive](/blog/memory-tuning-linux-latency).


## Tool 5: Block I/O {#block-io}

### Request Latency

```bash
# Block I/O latency by device
sudo bpftrace -e '
tracepoint:block:block_rq_issue { @issue[args->sector] = nsecs; }
tracepoint:block:block_rq_complete /@issue[args->sector]/ {
  @io_latency_us = hist((nsecs - @issue[args->sector]) / 1000);
  delete(@issue[args->sector]);
}'
```text

## I/O Size Distribution

```bash
# What sizes are we reading/writing?
sudo bpftrace -e '
tracepoint:block:block_rq_issue {
  @io_size_bytes = hist(args->bytes);
}'
```python

For storage tuning, see [Storage Deep Dive](/blog/storage-io-linux-latency).


## Production Deployment {#production}

### Continuous Monitoring with BCC

```python
#!/usr/bin/env python3
# syscall_latency_exporter.py
from bcc import BPF
from prometheus_client import start_http_server, Histogram
import time

bpf_program = """
#include <uapi/linux/ptrace.h>

BPF_HASH(start, u32);
BPF_HISTOGRAM(dist);

TRACEPOINT_PROBE(raw_syscalls, sys_enter) {
    u32 pid = bpf_get_current_pid_tgid();
    u64 ts = bpf_ktime_get_ns();
    start.update(&pid, &ts);
    return 0;
}

TRACEPOINT_PROBE(raw_syscalls, sys_exit) {
    u32 pid = bpf_get_current_pid_tgid();
    u64 *tsp = start.lookup(&pid);
    if (tsp) {
        u64 delta = bpf_ktime_get_ns() - *tsp;
        dist.increment(bpf_log2l(delta / 1000));  // microseconds
        start.delete(&pid);
    }
    return 0;
}
"""

SYSCALL_LATENCY = Histogram(
    'syscall_latency_microseconds',
    'Syscall latency',
    buckets=[1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024]
)

def main():
    b = BPF(text=bpf_program)
    start_http_server(9101)

    while True:
        time.sleep(10)
        for k, v in b["dist"].items():
            bucket_us = 2 ** k.value
            SYSCALL_LATENCY.observe(bucket_us)
        b["dist"].clear()

if __name__ == "__main__":
    main()
```text

## Kubernetes DaemonSet

```yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: ebpf-exporter
spec:
  template:
    spec:
      hostPID: true
      hostNetwork: true
      containers:
      - name: exporter
        image: ebpf-exporter:latest
        securityContext:
          privileged: true
        volumeMounts:
        - name: bpf
          mountPath: /sys/fs/bpf
        - name: debug
          mountPath: /sys/kernel/debug
      volumes:
      - name: bpf
        hostPath:
          path: /sys/fs/bpf
      - name: debug
        hostPath:
          path: /sys/kernel/debug
```bash


## Design Philosophy {#design-philosophy}

### Measure, Don't Assume

Before optimizing:
1. **Measure current state** (where is latency?)
2. **Make change**
3. **Measure again** (did it help?)

**The pattern:** Most optimization efforts target the wrong thing. Measurement reveals the true bottleneck.

### The Observer Effect Quantified

| Tool | Overhead/Event | Events/sec Impact |
|------|----------------|-------------------|
| printf | 10-50µs | Severe |
| strace | 100-500µs | Unusable |
| perf | 0.1-1µs | 5-20% CPU |
| eBPF | 0.1-0.5µs | <1% CPU |

**The tradeoff:** eBPF requires kernel knowledge to write and interpret. Worth it for production observability.

### When to Use Each Tool

| Situation | Tool |
|-----------|------|
| Quick debugging | bpftrace one-liner |
| Production monitoring | BCC + Prometheus |
| Detailed analysis | perf + flame graphs |
| Syscall list | strace (development only) |



---

## Baseline Your Kernel Before Profiling

Profiling reveals where latency goes. But if your kernel is misconfigured, the profiler will show you symptoms of the wrong problem. Before you start measuring, check:

```bash
# CPU governor (should be 'performance' for latency work)
cat /sys/bus/cpu/devices/cpu0/cpufreq/scaling_governor

# Transparent huge pages (should be 'never' or 'madvise')
cat /sys/kernel/mm/transparent_hugepage/enabled

# irqbalance (should be stopped on latency-critical systems)
systemctl status irqbalance

Topic	Next Post
SLOs, metrics that matter, alerting	Trading Metrics: What SRE Dashboards Miss
CPU governors, C-states, NUMA, isolation	CPU Isolation for HFT: The isolcpus Lie and What Actually Works
THP, huge pages, memory locking, pre-allocation	Memory Tuning for Low-Latency: The THP Trap and HugePage Mastery
NIC offloads, IRQ affinity, kernel bypass	Network Optimization: Kernel Bypass and the Art of Busy Polling
Design philosophy & architecture decisions	Trading Infrastructure: First Principles That Scale

Nikhil Padala

eBPF Profiling: Nanoseconds Without Adding Any

The Problem {#the-problem}

Background: What Is eBPF? {#background}

The Architecture

Market Data Infrastructure: WebSocket Patterns That Scale

Continue Reading

Self-Custody vs CEX: The Speed Myth That Cost Me Money

My Bot Signs 1,000 Trades a Day. Here's How I Sleep at Night.

42µs: How Deterministic Execution Beats Cloud Signing

Reading Path