Lab 20 · Observability: Metrics, Logging & SLOs

Run it: make lab-20
Source: labs/lab-20-observability/main.go

The Problem

A CDN you cannot observe is a CDN you cannot operate. Without metrics:

You don’t know your cache hit ratio is degrading
You don’t know latency spiked at 3 AM while you slept
You can’t tell if a deploy improved or degraded performance
You can’t define SLAs because you can’t measure SLOs

Production CDN observability has three pillars:

Metrics: numeric time-series data (Prometheus)
Structured logs: machine-parseable event records (slog)
Traces: distributed request tracking (OpenTelemetry — not in this lab)

Prometheus: The Metrics System

Prometheus uses a pull model: the metrics server scrapes your application’s /metrics endpoint at regular intervals (typically 15–60s). Your application doesn’t push; it exposes a snapshot of current state.

Metric Types

Counter — monotonically increasing. Never decreases.

var requestsTotal = prometheus.NewCounterVec(
    prometheus.CounterOpts{
        Name: "cdn_requests_total",
        Help: "Total number of requests served",
    },
    []string{"method", "status", "cache"},
)

// Increment on each request
requestsTotal.WithLabelValues("GET", "200", "hit").Inc()

Use counters for: request count, bytes transferred, error count, cache hits.

Gauge — can go up or down. Represents current state.

var cacheSize = prometheus.NewGauge(prometheus.GaugeOpts{
    Name: "cdn_cache_size_bytes",
    Help: "Current cache size in bytes",
})

// Set on cache eviction/addition
cacheSize.Set(float64(currentSize))

Use gauges for: active connections, cache size, queue depth, goroutine count.

Histogram — samples observations into buckets. Calculates percentiles.

var requestDuration = prometheus.NewHistogramVec(
    prometheus.HistogramOpts{
        Name:    "cdn_request_duration_seconds",
        Help:    "Request duration distribution",
        Buckets: []float64{.001, .005, .01, .025, .05, .1, .25, .5, 1, 2.5},
    },
    []string{"cache"},  // "hit" or "miss"
)

// Record each request's duration
start := time.Now()
// ... serve request ...
requestDuration.WithLabelValues(cacheStatus).Observe(time.Since(start).Seconds())

Use histograms for: request latency, response size, queue wait time.

The Cardinality Trap

Cardinality = the number of unique combinations of label values. High cardinality is Prometheus’s kryptonite.

// WRONG — user_id can have millions of values!
requestsTotal.WithLabelValues("GET", "200", userId).Inc()
// → Millions of time series → Prometheus OOM → pager at 3 AM

Safe labels:

HTTP method: 5 values (GET, POST, PUT, DELETE, HEAD)
HTTP status code category: 5 values (1xx–5xx) or discrete codes (~30 values)
Cache status: 3 values (hit, miss, bypass)
Region: 10–20 values (US, EU, APAC, …)
Host: only if you have a bounded number of hosts

Never use as labels:

User IDs, session IDs, account IDs
Full URL paths with IDs embedded (/user/123/profile)
IP addresses
Trace IDs, request IDs (use logs for per-request data)

Rule of thumb: any label that can have more than ~1000 distinct values in production will cause cardinality explosion.

CDN Metrics Catalog

The lab implements these metrics:

// === Request counters ===
cdn_requests_total{method, status, cache}    // "cache" ∈ {hit, miss, bypass}
cdn_bytes_served_total{cache}                // bytes, same labels

// === Latency ===
cdn_request_duration_seconds{cache}          // histogram, per cache status

// === Cache state ===
cdn_cache_entries                            // gauge: count of items in cache
cdn_cache_size_bytes                         // gauge: bytes used

// === Origin ===
cdn_origin_requests_total{status}            // requests forwarded to origin
cdn_origin_duration_seconds                  // histogram: origin TTFB

// === Compression ===
cdn_compression_ratio                        // histogram: compressed/uncompressed

Structured Access Logging with slog

Go 1.21 introduced log/slog, a structured logging package. Every request should produce a structured JSON log line:

logger := slog.New(slog.NewJSONHandler(os.Stdout, nil))

// Per-request log (inside middleware)
logger.Info("request",
    "method",    r.Method,
    "path",      r.URL.Path,
    "status",    status,
    "bytes",     bytesWritten,
    "duration",  time.Since(start).Milliseconds(),
    "cache",     cacheStatus,
    "ip",        r.RemoteAddr,
    "ua",        r.Header.Get("User-Agent"),
    "referer",   r.Header.Get("Referer"),
)

Output:

{
  "time": "2025-01-15T14:23:01Z",
  "level": "INFO",
  "msg": "request",
  "method": "GET",
  "path": "/image/hero.jpg",
  "status": 200,
  "bytes": 102400,
  "duration": 3,
  "cache": "hit",
  "ip": "1.2.3.4:54321",
  "ua": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)"
}

Structured logs enable direct processing in log aggregators (Loki, Splunk, Elasticsearch) without parsing regex patterns.

Key PromQL Recipes

Cache Hit Ratio

# Instant hit ratio (last 5 minutes)
rate(cdn_requests_total{cache="hit"}[5m])
/
rate(cdn_requests_total[5m])

Target: > 0.90 (90% hit ratio). Below 0.80 indicates a caching problem.

Byte Hit Ratio

# Bytes served from cache vs. total bytes served
rate(cdn_bytes_served_total{cache="hit"}[5m])
/
rate(cdn_bytes_served_total[5m])

Byte hit ratio is more meaningful than request hit ratio for billing purposes (CDN vendors charge for bytes to/from origin).

p99 Latency

# 99th percentile request latency
histogram_quantile(0.99,
  sum(rate(cdn_request_duration_seconds_bucket[5m])) by (le)
)

p99 Latency by Cache Status

# Compare hit vs. miss latency
histogram_quantile(0.99,
  sum(rate(cdn_request_duration_seconds_bucket[5m])) by (le, cache)
)

Expect cache hits to be 5–100x faster than misses.

Error Rate (5xx)

# Percentage of 5xx responses
rate(cdn_requests_total{status=~"5.."}[5m])
/
rate(cdn_requests_total[5m])

Origin Request Rate

# Origin requests per second (should be low relative to total)
rate(cdn_origin_requests_total[5m])

Requests Per Second

sum(rate(cdn_requests_total[1m]))

SLOs and Error Budgets

An SLO (Service Level Objective) defines the target reliability:

SLO: 99.9% of requests return a successful response (2xx/3xx)
     within 500ms at p99, measured over 30 days

An error budget is the allowed amount of failure:

30-day error budget = 30 * 24 * 60 * 60 * (1 - 0.999) = 2592 seconds = 43.2 minutes

If your error budget is consumed, you stop feature deployments and focus on reliability until the budget refills.

SLO Burn Rate

The burn rate measures how fast you’re consuming the error budget:

# 1-hour burn rate (how fast are we consuming monthly budget?)
(
  sum(rate(cdn_requests_total{status=~"5.."}[1h]))
  /
  sum(rate(cdn_requests_total[1h]))
)
/ (1 - 0.999)  # error budget fraction

A burn rate of 1.0 = consuming budget at exactly the sustainable rate. Burn rate > 14.4 = exhausting the monthly budget in 2 hours → page immediately. Google SRE Workbook recommends multi-window alerting:

Fast burn (1h + 5m windows): alert for rapid consumption
Slow burn (3d + 6h windows): alert for gradual degradation

Grafana Dashboard

The lab exposes:

/metrics — Prometheus metrics endpoint
/metrics/cache — JSON cache diagnostics

Point Grafana at Prometheus and import a CDN dashboard. The docker-compose.yml in Lab 21 wires up the full stack (Prometheus + Grafana).

Try It

make lab-20

# Send some traffic to generate metrics
for i in $(seq 1 100); do
  curl -s http://localhost:8080/item/$((RANDOM % 20)) -o /dev/null
done

# View Prometheus metrics
curl -s http://localhost:8080/metrics | grep cdn_

# View cache diagnostics
curl -s http://localhost:8080/metrics/cache | python3 -m json.tool

# Compute hit ratio manually from raw counters
HITS=$(curl -s http://localhost:8080/metrics | grep 'cdn_requests_total{.*cache="hit"' | awk '{print $2}')
TOTAL=$(curl -s http://localhost:8080/metrics | grep 'cdn_requests_total' | grep -v '^#' | awk '{sum+=$2} END{print sum}')
echo "Hit ratio: $(echo "scale=3; $HITS/$TOTAL" | bc)"

Keyboard shortcuts

The Hitchhiker's Guide to CDNs