Lab 21 · The Full System
Run it:
make lab-21
Source:labs/lab-21-full-system/main.go
Compose:labs/lab-21-full-system/docker-compose.yml
The Architecture
This final lab wires together everything from Labs 1–20 into a production-representative CDN system. It is a microcosm of how real CDNs like Cloudflare, Fastly, and AWS CloudFront are structured.
┌─────────────────────────────────────────────────┐
│ CDN System │
│ │
Internet ──> │ Edge NYC (:8080) ──\ │
│ (singleflight, \ │
│ signed URL verify, → Shield (:8082) ──> Origin (:9001)
│ 30s TTL, metrics) / (singleflight,
│ / 300s TTL,
│ Edge LHR (:8081) ──/ metrics)
│ (same config)
│ │
│ Prometheus (:9090) Grafana (:3000) │
└─────────────────────────────────────────────────┘
Component Responsibilities
| Component | Port | Role |
|---|---|---|
| Origin | :9001 | Source of truth. Serves all content. Simulates 50ms processing delay. |
| Shield | :8082 | Aggregation layer. One connection to origin for many edge requests. 300s TTL. |
| Edge NYC | :8080 | User-facing edge in New York. Validates signed URLs. 30s TTL. |
| Edge LHR | :8081 | User-facing edge in London. Same config as NYC. 30s TTL. |
| Prometheus | :9090 | Scrapes metrics from all nodes. |
| Grafana | :3000 | Dashboards over Prometheus. |
Multi-Tier TTL Design
The TTL cascade is intentional and critical:
User ── Edge (30s TTL) ── Shield (300s TTL) ── Origin
Why Edge TTL < Shield TTL?
The edge serves users directly. Fresh content reaches users within 30 seconds of origin publication. But the edge collapses requests from many users into one request to the shield.
The shield’s 300s TTL means: for any given piece of content, the shield makes at most one request to origin per 5 minutes. A popular item might be requested by 10,000 users/minute across both edges — the shield ensures origin sees only 1 request every 5 minutes for that item.
Without shield:
10,000 users/min × 30s TTL edge = 333 cache misses/min to origin
(every edge miss → origin request)
With shield (300s TTL):
10,000 users/min × 30s TTL edge = 333 edge misses/min
→ All go to shield
→ Shield hit ratio ~98% (only 1 miss per 5 min)
→ ~7 requests/min reach origin
This is a 50× reduction in origin load.
Singleflight at Two Layers
Both edge and shield run singleflight.Group:
type CachingProxy struct {
cache *Cache
origin string
group singleflight.Group // deduplicates concurrent misses
}
func (p *CachingProxy) ServeHTTP(w http.ResponseWriter, r *http.Request) {
key := cacheKey(r)
if item, ok := p.cache.Get(key); ok {
serveFromCache(w, item)
return
}
// Multiple concurrent requests for the same key?
// singleflight collapses them into ONE upstream request
result, _, _ := p.group.Do(key, func() (interface{}, error) {
return p.fetchFromUpstream(r)
})
item := result.(*CacheItem)
p.cache.Set(key, item)
serveFromCache(w, item)
}
The thundering herd cascade: without singleflight at both layers, a popular item expiring simultaneously at 1,000 edge nodes would cause 1,000 concurrent requests to the shield, which would cause 1,000 concurrent requests to origin. Singleflight at edge reduces 1,000 → 1 per edge node. Singleflight at shield reduces 2 edge misses → 1 shield request to origin.
Signed URL Verification
The edge validates HMAC-signed URLs before serving any content:
func (e *Edge) verifySignedURL(r *http.Request) bool {
sig := r.URL.Query().Get("sig")
if sig == "" { return false } // or true for public content
expires, _ := strconv.ParseInt(r.URL.Query().Get("expires"), 10, 64)
if time.Now().Unix() > expires {
return false // expired
}
keyver := r.URL.Query().Get("keyver")
key, ok := e.signingKeys[keyver]
if !ok { return false }
canonical := fmt.Sprintf("GET\n%s\n%d\n", r.URL.Path, expires)
expected := computeHMAC(key, canonical)
return hmac.Equal([]byte(sig), []byte(expected))
}
The shield and origin do not re-verify — they trust the edge. This is the standard trust boundary design: validation happens at the first authorized boundary, not repeatedly at every tier.
Docker Compose
# labs/lab-21-full-system/docker-compose.yml
services:
origin:
build: .
command: ["./cdn-lab21", "-role=origin", "-addr=:9001"]
ports: ["9001:9001"]
shield:
build: .
command: ["./cdn-lab21", "-role=shield", "-addr=:8082", "-upstream=http://origin:9001"]
ports: ["8082:8082"]
depends_on: [origin]
edge-nyc:
build: .
command: ["./cdn-lab21", "-role=edge", "-addr=:8080", "-upstream=http://shield:8082", "-pop=NYC"]
ports: ["8080:8080"]
depends_on: [shield]
edge-lhr:
build: .
command: ["./cdn-lab21", "-role=edge", "-addr=:8081", "-upstream=http://shield:8082", "-pop=LHR"]
ports: ["8081:8081"]
depends_on: [shield]
prometheus:
image: prom/prometheus:latest
volumes: ["./prometheus.yml:/etc/prometheus/prometheus.yml"]
ports: ["9090:9090"]
grafana:
image: grafana/grafana:latest
ports: ["3000:3000"]
depends_on: [prometheus]
Prometheus Configuration
# labs/lab-21-full-system/prometheus.yml
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'cdn-edge'
static_configs:
- targets: ['edge-nyc:8080', 'edge-lhr:8081']
- job_name: 'cdn-shield'
static_configs:
- targets: ['shield:8082']
- job_name: 'cdn-origin'
static_configs:
- targets: ['origin:9001']
Observing the System Under Load
With the system running, generate load and observe the cascade:
# Generate 1000 requests across 50 unique URLs
for i in $(seq 1 1000); do
curl -s "http://localhost:8080/item/$((RANDOM % 50))" -o /dev/null
done
# Check metrics at each tier
# Edge NYC hit ratio
curl -s http://localhost:8080/metrics | grep cdn_requests_total
# Shield hit ratio
curl -s http://localhost:8082/metrics | grep cdn_requests_total
# Origin request count (should be tiny compared to edge total)
curl -s http://localhost:9001/metrics | grep cdn_requests_total
You should see:
- Edge hit ratio: ~80–90% (after warmup)
- Shield hit ratio: ~95–99%
- Origin requests: ~1–5% of edge total
Failure Modes & Resilience
Origin failure
Origin down → Shield gets 502/503 from origin
→ Shield returns stale-if-error (from Cache-Control)
→ Edge returns stale content to users
This is the “stale-if-error” pattern from Lab 7, applied system-wide. Users see slightly stale content rather than errors.
Shield failure
Shield down → Edge cannot reach upstream
→ Edge serves stale (if available) or 503
In production, the shield tier has multiple nodes behind a load balancer. A single shield failure routes to another shield node.
Edge failure
Edge-NYC down → Geo routing redirects NYC users to Edge-LHR
→ Higher latency but service continues
This is the health-check failover from Lab 15. Each edge registers with the geo-routing layer and is removed from rotation when health checks fail.
Path to Production
To harden this system for real traffic:
- Replace in-memory cache with Redis: enables shared cache state across edge instances and survives restarts
- Add TLS termination: Let’s Encrypt or ACME protocol for automatic certificate provisioning
- Add rate limiting: token bucket per IP/user with Redis-backed counters
- Add WAF rules: block common attack patterns (SQLi, XSS, path traversal)
- Add CDN purge API: authenticated endpoint to purge cache keys by tag
- Add distributed tracing: OpenTelemetry spans across edge → shield → origin
- Add chaos testing: kill origin/shield randomly to validate resilience
Try It
# Start the full system with Docker Compose
cd labs/lab-21-full-system
docker compose up --build
# In another terminal: generate signed URL and fetch content
TOKEN=$(curl -s "http://localhost:8080/sign?path=/article/1&ttl=300")
curl -s "$TOKEN" -v
# View Prometheus metrics
open http://localhost:9090
# View Grafana (default credentials: admin/admin)
open http://localhost:3000
# Generate load test
for i in $(seq 1 5000); do
curl -s "http://localhost:8080/item/$((RANDOM % 100))" -o /dev/null &
done
wait
# Observe the request waterfall through the tiers
curl -s http://localhost:8080/metrics | grep cdn_requests_total | head -5
curl -s http://localhost:8082/metrics | grep cdn_requests_total | head -5
curl -s http://localhost:9001/metrics | grep cdn_requests_total | head -5