Observability: The Physics of Seeing

Why you can't debug what you can't see. Metrics, Logs, Traces, and the observer effect in monitoring.

Beginner • 35 min read • Expert Version →

What you'll learn

Differentiate Logs (Cardinality) vs Metrics (Aggregatable)
Deconstruct the 'Observer Effect' in tracing overhead
Design Histogram Buckets for Latency (SLO tracking)
Trace a Distributed Request via SpanContext
Analyze the cost of High Cardinality

📚 Prerequisites

Before this lesson, you should understand:

Introduction

In a monolith, debugging is tail -f /var/log/syslog. In a distributed system, that log file doesn’t exist.

Your request hit 50 microservices. One of them failed. Which one? Without observability, you’re guessing.

Observability is not “Monitoring” (checking if the server is up). Observability is asking arbitrary questions about your system without shipping new code.

Sampling & Overhead

Measuring the system changes the system:

Logging: Writes to disk (I/O blocking).
Tracing: Adds headers and network calls (latency).
Sidecars: Consume CPU/Memory (resource contention).

The Solution: Sampling. You don’t trace 100% of requests. You trace 0.1% (head sampling) or keep only the “interesting” traces — errors and slow requests (tail sampling).

The Three Pillars

1. Metrics (The Dashboard)

What they are: Aggregatable numbers. Count, Gauge, Histogram.
Advantage: Cheap. Storing “1 Million Requests” takes the same space as “1 Request”.
Failure mode: High Cardinality.
- Good Label: status="200" (low cardinality — a few possible values).
- Bad Label: user_id="847382" (high cardinality — millions of possible values).
- Result: Your Prometheus server runs out of memory indexing millions of time series.

2. Logs (The Truth)

What they are: High-fidelity event records.
Advantage: Infinite detail. “User 5 bought Item 9 at 14:32:00.123”.
Failure mode: Cost. Indexing 1TB of logs/day is expensive.

3. Traces (The Story)

What they are: Causal chains. ParentID -> SpanID.
Advantage: Finding the bottleneck. “Why did this take 2 seconds? The Redis cache miss took 1.9s.”
Failure mode: Broken Context. If one middleware drops the trace headers, the trace breaks.

Code: The Histogram

The most misunderstood metric type. How do you calculate “99th Percentile Latency” across 100 servers? You can’t average averages — math doesn’t work that way. You must use Buckets.

# Prometheus conceptual implementation
class Histogram:
    def __init__(self):
        # Buckets define the resolution
        self.buckets = {
            0.1: 0,           # < 100ms
            0.5: 0,           # < 500ms
            1.0: 0,           # < 1s
            float("inf"): 0   # Everything else
        }
        self.sum = 0
        self.count = 0

    def observe(self, value):
        self.sum += value
        self.count += 1
        for bound in self.buckets:
            if value <= bound:
                self.buckets[bound] += 1

# When querying:
# histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
# Prometheus interpolates the buckets to estimate the value.

Push (Telegraf) vs Pull (Prometheus)

Push: Agent sends data to central server.
- Pros: Good for short-lived jobs (Lambda), bypasses firewalls.
- Cons: Can overload the server if many agents push at once.
Pull: Server scrapes targets.
- Pros: Server controls the load. Explicit inventory.
- Cons: Need to discover targets (Service Discovery).

Pull (Prometheus) is the standard for Kubernetes workloads.

Practice Exercises

Exercise 1: Cardinality Explosion (Beginner)

Scenario: You add a label client_ip to your http_requests_total metric. Task: Why does your Prometheus memory usage jump dramatically? (How many unique IPs hit your site?)

Exercise 2: Sampling Rates (Intermediate)

Scenario: 10,000 Req/Sec. Tracing adds 1ms overhead per request. Task: If you sample 100% of requests, how much total CPU time is wasted per second on tracing? What is a reasonable sampling rate?

Exercise 3: SLO Calculation (Advanced)

Scenario: SLO: “99% of requests < 200ms”. Task: Write a PromQL query using http_request_duration_bucket to check if this SLO is being met.

Knowledge Check

Why can’t you calculate P99 from avg_latency metrics?
What is a “Span Context”?
Why is logging “User ID” in a Metric label dangerous?
How does Tail Sampling differ from Head Sampling?
Which pillar is best for debugging a specific customer complaint?

Answers

Math. Averages hide outliers. P99 requires distribution data (Histograms).
Metadata. TraceID and ParentSpanID passed in HTTP Headers to correlate services.
High Cardinality. It creates millions of time series, crashing the database.
Decision Timing. Head decides before the request starts (random). Tail decides after it ends (keep only errors/slow ones).
Logs. (Or Traces if sampled). Metrics won’t show you the specific user’s error message.

Summary

You get what you pay for. Logs are expensive but detailed. Metrics are cheap but vague.
Cardinality kills. Watch your labels.
Context is King. A trace without context is just noise.

Want to go deeper?

Weekly infrastructure insights for engineers who build trading systems.

Free forever. Unsubscribe anytime.

You're in. Check your inbox.

Questions about this lesson? Working on related infrastructure?

Let's discuss