Skip to content
STAGING — not production

Infrastructure

eBPF Profiling: Nanoseconds Without Adding Any

Deep dive into eBPF, bpftrace, and kernel tracing. How to measure latency at nanosecond precision without the observer effect.

6 min
#ebpf #profiling #observability #linux #latency #kernel #monitoring

You cannot debug a latency spike if your debugger causes the spike.

Traditional profiling tools add significant overhead: strace costs 100-500µs per syscall, perf sampling uses 5-20% CPU. In a trading system processing 100K messages/second, these tools destroy the behavior you’re trying to measure.

eBPF changes this. Run sandboxed programs in the kernel, measure at nanosecond precision, add nearly zero overhead.

The Problem {#the-problem}

Traditional tools and their costs:

ToolOverheadWhy
strace100-500µs/syscallptrace stops process, copies data
ltrace200-800µs/callLibrary interposition
perf (sampling)5-20% CPUInterrupt-based sampling
gdbProcess stoppedBreakpoint traps

The observer effect: The measurement changes the behavior. At trading speeds, this makes traditional tools useless for production debugging.

eBPF overhead: 100-500 nanoseconds per probe. 1000x less than strace.

For what to measure, see Monitoring Metrics. For kernel tuning that affects what you’ll observe, see the deep dive series:

Background: What Is eBPF? {#background}

The Architecture

eBPF (extended Berkeley Packet Filter) runs sandboxed programs in the Linux kernel (kernel source: kernel/bpf/).

User space program (bpftrace, bcc tool)

BPF bytecode compilation

Kernel verifier (safety checks)

JIT compilation to native code

Attached to kernel hook (kprobe, tracepoint, etc.)

Runs on every event, collects data

Maps (hash tables, arrays) export to user space
```bash

### Why It's Fast

1. **In-kernel execution:** No context switches to user space
2. **JIT compiled:** Runs at near-native speed
3. **Verified:** Can't crash the kernel
4. **No locks:** Lock-free data structures for metrics

### Available Hooks

| Hook Type | What It Traces | Example |
|-----------|---------------|---------|
| kprobe | Function entry | `kprobe:tcp_sendmsg` |
| kretprobe | Function return | `kretprobe:tcp_sendmsg` |
| tracepoint | Static kernel events | `tracepoint:sched:sched_switch` |
| uprobe | User-space function | `uprobe:/usr/bin/python:main` |
| USDT | User-defined traces | Application-specific |

**Citation:** eBPF architecture documented in [kernel documentation](https://www.kernel.org/doc/html/latest/bpf/index.html).


## Tool 1: bpftrace One-Liners {#bpftrace}

### Installation

```bash
# Ubuntu/Debian
sudo apt install bpftrace

# Amazon Linux 2023
sudo dnf install bpftrace
```text

## Syscall Latency Histogram

```bash
# Distribution of syscall latencies (microseconds)
sudo bpftrace -e '
tracepoint:raw_syscalls:sys_enter { @start[tid] = nsecs; }
tracepoint:raw_syscalls:sys_exit /@start[tid]/ {
  @usecs = hist((nsecs - @start[tid]) / 1000);
  delete(@start[tid]);
}'
```text

**Output:**

```bash
@usecs:
[0]                  12453 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[1]                   4521 |@@@@@@@@@@@@@@@@@                                   |
[2, 4)                1234 |@@@@@                                               |
[4, 8)                 342 |@                                                   |
[8, 16)                 45 |                                                    |
```text

## Per-Syscall Latency

```bash
# Which syscalls are slow?
sudo bpftrace -e '
tracepoint:raw_syscalls:sys_enter {
  @start[tid] = nsecs;
  @syscall[tid] = args->id;
}
tracepoint:raw_syscalls:sys_exit /@start[tid]/ {
  @latency[@syscall[tid]] = hist((nsecs - @start[tid]) / 1000);
  delete(@start[tid]);
  delete(@syscall[tid]);
}'
```text


## Tool 2: Network Tracing {#network}

### TCP Send Latency

```bash
# How long does tcp_sendmsg take in kernel?
sudo bpftrace -e '
kprobe:tcp_sendmsg { @start[tid] = nsecs; }
kretprobe:tcp_sendmsg /@start[tid]/ {
  @tcp_send_ns = hist(nsecs - @start[tid]);
  delete(@start[tid]);
}'
```text

## Packet Receive Path

```bash
# Latency from NIC to socket
sudo bpftrace -e '
tracepoint:net:netif_receive_skb { @rx_start[args->skbaddr] = nsecs; }
tracepoint:net:consume_skb /@rx_start[args->skbaddr]/ {
  @rx_time_ns = hist(nsecs - @rx_start[args->skbaddr]);
  delete(@rx_start[args->skbaddr]);
}'
```text

## Softirq Time

```bash
# Network softirq processing time
sudo bpftrace -e '
tracepoint:irq:softirq_entry /args->vec == 3/ { @start[tid] = nsecs; }
tracepoint:irq:softirq_exit /args->vec == 3 && @start[tid]/ {
  @net_softirq_ns = hist(nsecs - @start[tid]);
  delete(@start[tid]);
}'
```text

For network tuning that affects these measurements, see [Network Deep Dive](/blog/network-optimization-linux-latency).


## Tool 3: Scheduler Latency {#scheduler}

### Run Queue Latency

```bash
# How long do threads wait for CPU?
sudo bpftrace -e '
tracepoint:sched:sched_wakeup { @qtime[args->pid] = nsecs; }
tracepoint:sched:sched_switch /args->prev_state == 0/ {
  if (@qtime[args->next_pid]) {
    @runq_latency_us = hist((nsecs - @qtime[args->next_pid]) / 1000);
    delete(@qtime[args->next_pid]);
  }
}'
```text

**Interpretation:**
- Spikes at 4ms = timer tick preemption (CONFIG_HZ=250)
- Long tail = CPU contention

## Off-CPU Time

```bash
# Why is my thread not running?
sudo bpftrace -e '
tracepoint:sched:sched_switch {
  @off_start[args->prev_pid] = nsecs;
  if (@off_start[args->next_pid]) {
    @off_cpu_us[args->next_comm] = hist((nsecs - @off_start[args->next_pid]) / 1000);
    delete(@off_start[args->next_pid]);
  }
}'
```nginx

For scheduler tuning, see [CPU Deep Dive](/blog/cpu-optimization-linux-latency#isolation).


## Tool 4: Memory Allocation {#memory}

### Malloc Latency

```bash
# How long does malloc take?
sudo bpftrace -e '
uprobe:/lib/x86_64-linux-gnu/libc.so.6:malloc {
  @start[tid] = nsecs;
  @size[tid] = arg0;
}
uretprobe:/lib/x86_64-linux-gnu/libc.so.6:malloc /@start[tid]/ {
  @malloc_ns[@size[tid]] = hist(nsecs - @start[tid]);
  delete(@start[tid]);
  delete(@size[tid]);
}'
```text

## Page Faults

```bash
# Minor and major page fault latency
sudo bpftrace -e '
tracepoint:exceptions:page_fault_user { @fault_start[tid] = nsecs; }
tracepoint:exceptions:page_fault_user { @fault_end[tid] = nsecs; }

// Actually need different approach, use kprobes:
kprobe:handle_mm_fault { @start[tid] = nsecs; }
kretprobe:handle_mm_fault /@start[tid]/ {
  @page_fault_us = hist((nsecs - @start[tid]) / 1000);
  delete(@start[tid]);
}'
```text

For memory tuning, see [Memory Deep Dive](/blog/memory-tuning-linux-latency).


## Tool 5: Block I/O {#block-io}

### Request Latency

```bash
# Block I/O latency by device
sudo bpftrace -e '
tracepoint:block:block_rq_issue { @issue[args->sector] = nsecs; }
tracepoint:block:block_rq_complete /@issue[args->sector]/ {
  @io_latency_us = hist((nsecs - @issue[args->sector]) / 1000);
  delete(@issue[args->sector]);
}'
```text

## I/O Size Distribution

```bash
# What sizes are we reading/writing?
sudo bpftrace -e '
tracepoint:block:block_rq_issue {
  @io_size_bytes = hist(args->bytes);
}'
```python

For storage tuning, see [Storage Deep Dive](/blog/storage-io-linux-latency).


## Production Deployment {#production}

### Continuous Monitoring with BCC

```python
#!/usr/bin/env python3
# syscall_latency_exporter.py
from bcc import BPF
from prometheus_client import start_http_server, Histogram
import time

bpf_program = """
#include <uapi/linux/ptrace.h>

BPF_HASH(start, u32);
BPF_HISTOGRAM(dist);

TRACEPOINT_PROBE(raw_syscalls, sys_enter) {
    u32 pid = bpf_get_current_pid_tgid();
    u64 ts = bpf_ktime_get_ns();
    start.update(&pid, &ts);
    return 0;
}

TRACEPOINT_PROBE(raw_syscalls, sys_exit) {
    u32 pid = bpf_get_current_pid_tgid();
    u64 *tsp = start.lookup(&pid);
    if (tsp) {
        u64 delta = bpf_ktime_get_ns() - *tsp;
        dist.increment(bpf_log2l(delta / 1000));  // microseconds
        start.delete(&pid);
    }
    return 0;
}
"""

SYSCALL_LATENCY = Histogram(
    'syscall_latency_microseconds',
    'Syscall latency',
    buckets=[1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024]
)

def main():
    b = BPF(text=bpf_program)
    start_http_server(9101)

    while True:
        time.sleep(10)
        for k, v in b["dist"].items():
            bucket_us = 2 ** k.value
            SYSCALL_LATENCY.observe(bucket_us)
        b["dist"].clear()

if __name__ == "__main__":
    main()
```text

## Kubernetes DaemonSet

```yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: ebpf-exporter
spec:
  template:
    spec:
      hostPID: true
      hostNetwork: true
      containers:
      - name: exporter
        image: ebpf-exporter:latest
        securityContext:
          privileged: true
        volumeMounts:
        - name: bpf
          mountPath: /sys/fs/bpf
        - name: debug
          mountPath: /sys/kernel/debug
      volumes:
      - name: bpf
        hostPath:
          path: /sys/fs/bpf
      - name: debug
        hostPath:
          path: /sys/kernel/debug
```bash


## Design Philosophy {#design-philosophy}

### Measure, Don't Assume

Before optimizing:
1. **Measure current state** (where is latency?)
2. **Make change**
3. **Measure again** (did it help?)

**The pattern:** Most optimization efforts target the wrong thing. Measurement reveals the true bottleneck.

### The Observer Effect Quantified

| Tool | Overhead/Event | Events/sec Impact |
|------|----------------|-------------------|
| printf | 10-50µs | Severe |
| strace | 100-500µs | Unusable |
| perf | 0.1-1µs | 5-20% CPU |
| eBPF | 0.1-0.5µs | <1% CPU |

**The tradeoff:** eBPF requires kernel knowledge to write and interpret. Worth it for production observability.

### When to Use Each Tool

| Situation | Tool |
|-----------|------|
| Quick debugging | bpftrace one-liner |
| Production monitoring | BCC + Prometheus |
| Detailed analysis | perf + flame graphs |
| Syscall list | strace (development only) |



---

## Baseline Your Kernel Before Profiling

Profiling reveals where latency goes. But if your kernel is misconfigured, the profiler will show you symptoms of the wrong problem. Before you start measuring, check:

```bash
# CPU governor (should be 'performance' for latency work)
cat /sys/bus/cpu/devices/cpu0/cpufreq/scaling_governor

# Transparent huge pages (should be 'never' or 'madvise')
cat /sys/kernel/mm/transparent_hugepage/enabled

# irqbalance (should be stopped on latency-critical systems)
systemctl status irqbalance

Continue Reading

Reading Path

Continue exploring with these related deep dives:

TopicNext Post
SLOs, metrics that matter, alertingTrading Metrics: What SRE Dashboards Miss
CPU governors, C-states, NUMA, isolationCPU Isolation for HFT: The isolcpus Lie and What Actually Works
THP, huge pages, memory locking, pre-allocationMemory Tuning for Low-Latency: The THP Trap and HugePage Mastery
NIC offloads, IRQ affinity, kernel bypassNetwork Optimization: Kernel Bypass and the Art of Busy Polling
Design philosophy & architecture decisionsTrading Infrastructure: First Principles That Scale
Share: LinkedIn X

Enjoyed this?

Get one deep infrastructure insight per week.

Free forever. Unsubscribe anytime.

You're in. Check your inbox.