Lab 20 · Observability: Metrics, Logging & SLOs
Run it:
make lab-20
Source:labs/lab-20-observability/main.go
The Problem
A CDN you cannot observe is a CDN you cannot operate. Without metrics:
- You don’t know your cache hit ratio is degrading
- You don’t know latency spiked at 3 AM while you slept
- You can’t tell if a deploy improved or degraded performance
- You can’t define SLAs because you can’t measure SLOs
Production CDN observability has three pillars:
- Metrics: numeric time-series data (Prometheus)
- Structured logs: machine-parseable event records (slog)
- Traces: distributed request tracking (OpenTelemetry — not in this lab)
Prometheus: The Metrics System
Prometheus uses a pull model: the metrics server scrapes your
application’s /metrics endpoint at regular intervals (typically 15–60s).
Your application doesn’t push; it exposes a snapshot of current state.
Metric Types
Counter — monotonically increasing. Never decreases.
var requestsTotal = prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "cdn_requests_total",
Help: "Total number of requests served",
},
[]string{"method", "status", "cache"},
)
// Increment on each request
requestsTotal.WithLabelValues("GET", "200", "hit").Inc()
Use counters for: request count, bytes transferred, error count, cache hits.
Gauge — can go up or down. Represents current state.
var cacheSize = prometheus.NewGauge(prometheus.GaugeOpts{
Name: "cdn_cache_size_bytes",
Help: "Current cache size in bytes",
})
// Set on cache eviction/addition
cacheSize.Set(float64(currentSize))
Use gauges for: active connections, cache size, queue depth, goroutine count.
Histogram — samples observations into buckets. Calculates percentiles.
var requestDuration = prometheus.NewHistogramVec(
prometheus.HistogramOpts{
Name: "cdn_request_duration_seconds",
Help: "Request duration distribution",
Buckets: []float64{.001, .005, .01, .025, .05, .1, .25, .5, 1, 2.5},
},
[]string{"cache"}, // "hit" or "miss"
)
// Record each request's duration
start := time.Now()
// ... serve request ...
requestDuration.WithLabelValues(cacheStatus).Observe(time.Since(start).Seconds())
Use histograms for: request latency, response size, queue wait time.
The Cardinality Trap
Cardinality = the number of unique combinations of label values. High cardinality is Prometheus’s kryptonite.
// WRONG — user_id can have millions of values!
requestsTotal.WithLabelValues("GET", "200", userId).Inc()
// → Millions of time series → Prometheus OOM → pager at 3 AM
Safe labels:
- HTTP method: 5 values (GET, POST, PUT, DELETE, HEAD)
- HTTP status code category: 5 values (1xx–5xx) or discrete codes (~30 values)
- Cache status: 3 values (hit, miss, bypass)
- Region: 10–20 values (US, EU, APAC, …)
- Host: only if you have a bounded number of hosts
Never use as labels:
- User IDs, session IDs, account IDs
- Full URL paths with IDs embedded (
/user/123/profile) - IP addresses
- Trace IDs, request IDs (use logs for per-request data)
Rule of thumb: any label that can have more than ~1000 distinct values in production will cause cardinality explosion.
CDN Metrics Catalog
The lab implements these metrics:
// === Request counters ===
cdn_requests_total{method, status, cache} // "cache" ∈ {hit, miss, bypass}
cdn_bytes_served_total{cache} // bytes, same labels
// === Latency ===
cdn_request_duration_seconds{cache} // histogram, per cache status
// === Cache state ===
cdn_cache_entries // gauge: count of items in cache
cdn_cache_size_bytes // gauge: bytes used
// === Origin ===
cdn_origin_requests_total{status} // requests forwarded to origin
cdn_origin_duration_seconds // histogram: origin TTFB
// === Compression ===
cdn_compression_ratio // histogram: compressed/uncompressed
Structured Access Logging with slog
Go 1.21 introduced log/slog, a structured logging package. Every
request should produce a structured JSON log line:
logger := slog.New(slog.NewJSONHandler(os.Stdout, nil))
// Per-request log (inside middleware)
logger.Info("request",
"method", r.Method,
"path", r.URL.Path,
"status", status,
"bytes", bytesWritten,
"duration", time.Since(start).Milliseconds(),
"cache", cacheStatus,
"ip", r.RemoteAddr,
"ua", r.Header.Get("User-Agent"),
"referer", r.Header.Get("Referer"),
)
Output:
{
"time": "2025-01-15T14:23:01Z",
"level": "INFO",
"msg": "request",
"method": "GET",
"path": "/image/hero.jpg",
"status": 200,
"bytes": 102400,
"duration": 3,
"cache": "hit",
"ip": "1.2.3.4:54321",
"ua": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)"
}
Structured logs enable direct processing in log aggregators (Loki, Splunk, Elasticsearch) without parsing regex patterns.
Key PromQL Recipes
Cache Hit Ratio
# Instant hit ratio (last 5 minutes)
rate(cdn_requests_total{cache="hit"}[5m])
/
rate(cdn_requests_total[5m])
Target: > 0.90 (90% hit ratio). Below 0.80 indicates a caching problem.
Byte Hit Ratio
# Bytes served from cache vs. total bytes served
rate(cdn_bytes_served_total{cache="hit"}[5m])
/
rate(cdn_bytes_served_total[5m])
Byte hit ratio is more meaningful than request hit ratio for billing purposes (CDN vendors charge for bytes to/from origin).
p99 Latency
# 99th percentile request latency
histogram_quantile(0.99,
sum(rate(cdn_request_duration_seconds_bucket[5m])) by (le)
)
p99 Latency by Cache Status
# Compare hit vs. miss latency
histogram_quantile(0.99,
sum(rate(cdn_request_duration_seconds_bucket[5m])) by (le, cache)
)
Expect cache hits to be 5–100x faster than misses.
Error Rate (5xx)
# Percentage of 5xx responses
rate(cdn_requests_total{status=~"5.."}[5m])
/
rate(cdn_requests_total[5m])
Origin Request Rate
# Origin requests per second (should be low relative to total)
rate(cdn_origin_requests_total[5m])
Requests Per Second
sum(rate(cdn_requests_total[1m]))
SLOs and Error Budgets
An SLO (Service Level Objective) defines the target reliability:
SLO: 99.9% of requests return a successful response (2xx/3xx)
within 500ms at p99, measured over 30 days
An error budget is the allowed amount of failure:
30-day error budget = 30 * 24 * 60 * 60 * (1 - 0.999) = 2592 seconds = 43.2 minutes
If your error budget is consumed, you stop feature deployments and focus on reliability until the budget refills.
SLO Burn Rate
The burn rate measures how fast you’re consuming the error budget:
# 1-hour burn rate (how fast are we consuming monthly budget?)
(
sum(rate(cdn_requests_total{status=~"5.."}[1h]))
/
sum(rate(cdn_requests_total[1h]))
)
/ (1 - 0.999) # error budget fraction
A burn rate of 1.0 = consuming budget at exactly the sustainable rate. Burn rate > 14.4 = exhausting the monthly budget in 2 hours → page immediately. Google SRE Workbook recommends multi-window alerting:
- Fast burn (1h + 5m windows): alert for rapid consumption
- Slow burn (3d + 6h windows): alert for gradual degradation
Grafana Dashboard
The lab exposes:
/metrics— Prometheus metrics endpoint/metrics/cache— JSON cache diagnostics
Point Grafana at Prometheus and import a CDN dashboard. The docker-compose.yml
in Lab 21 wires up the full stack (Prometheus + Grafana).
Try It
make lab-20
# Send some traffic to generate metrics
for i in $(seq 1 100); do
curl -s http://localhost:8080/item/$((RANDOM % 20)) -o /dev/null
done
# View Prometheus metrics
curl -s http://localhost:8080/metrics | grep cdn_
# View cache diagnostics
curl -s http://localhost:8080/metrics/cache | python3 -m json.tool
# Compute hit ratio manually from raw counters
HITS=$(curl -s http://localhost:8080/metrics | grep 'cdn_requests_total{.*cache="hit"' | awk '{print $2}')
TOTAL=$(curl -s http://localhost:8080/metrics | grep 'cdn_requests_total' | grep -v '^#' | awk '{sum+=$2} END{print sum}')
echo "Hit ratio: $(echo "scale=3; $HITS/$TOTAL" | bc)"