Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Lab 21 · The Full System

Run it: make lab-21
Source: labs/lab-21-full-system/main.go
Compose: labs/lab-21-full-system/docker-compose.yml


The Architecture

This final lab wires together everything from Labs 1–20 into a production-representative CDN system. It is a microcosm of how real CDNs like Cloudflare, Fastly, and AWS CloudFront are structured.

                 ┌─────────────────────────────────────────────────┐
                 │                  CDN System                      │
                 │                                                   │
  Internet  ──>  │  Edge NYC (:8080)  ──\                           │
                 │  (singleflight,         \                         │
                 │   signed URL verify,     → Shield (:8082)  ──>  Origin (:9001)
                 │   30s TTL, metrics)     /  (singleflight,
                 │                        /   300s TTL,
                 │  Edge LHR (:8081)  ──/    metrics)
                 │  (same config)            
                 │                                                   │
                 │  Prometheus (:9090)  Grafana (:3000)             │
                 └─────────────────────────────────────────────────┘

Component Responsibilities

ComponentPortRole
Origin:9001Source of truth. Serves all content. Simulates 50ms processing delay.
Shield:8082Aggregation layer. One connection to origin for many edge requests. 300s TTL.
Edge NYC:8080User-facing edge in New York. Validates signed URLs. 30s TTL.
Edge LHR:8081User-facing edge in London. Same config as NYC. 30s TTL.
Prometheus:9090Scrapes metrics from all nodes.
Grafana:3000Dashboards over Prometheus.

Multi-Tier TTL Design

The TTL cascade is intentional and critical:

User ── Edge (30s TTL) ── Shield (300s TTL) ── Origin

Why Edge TTL < Shield TTL?

The edge serves users directly. Fresh content reaches users within 30 seconds of origin publication. But the edge collapses requests from many users into one request to the shield.

The shield’s 300s TTL means: for any given piece of content, the shield makes at most one request to origin per 5 minutes. A popular item might be requested by 10,000 users/minute across both edges — the shield ensures origin sees only 1 request every 5 minutes for that item.

Without shield:
  10,000 users/min × 30s TTL edge = 333 cache misses/min to origin
  (every edge miss → origin request)

With shield (300s TTL):
  10,000 users/min × 30s TTL edge = 333 edge misses/min
  → All go to shield
  → Shield hit ratio ~98% (only 1 miss per 5 min)
  → ~7 requests/min reach origin

This is a 50× reduction in origin load.


Singleflight at Two Layers

Both edge and shield run singleflight.Group:

type CachingProxy struct {
    cache  *Cache
    origin string
    group  singleflight.Group  // deduplicates concurrent misses
}

func (p *CachingProxy) ServeHTTP(w http.ResponseWriter, r *http.Request) {
    key := cacheKey(r)
    
    if item, ok := p.cache.Get(key); ok {
        serveFromCache(w, item)
        return
    }
    
    // Multiple concurrent requests for the same key?
    // singleflight collapses them into ONE upstream request
    result, _, _ := p.group.Do(key, func() (interface{}, error) {
        return p.fetchFromUpstream(r)
    })
    
    item := result.(*CacheItem)
    p.cache.Set(key, item)
    serveFromCache(w, item)
}

The thundering herd cascade: without singleflight at both layers, a popular item expiring simultaneously at 1,000 edge nodes would cause 1,000 concurrent requests to the shield, which would cause 1,000 concurrent requests to origin. Singleflight at edge reduces 1,000 → 1 per edge node. Singleflight at shield reduces 2 edge misses → 1 shield request to origin.


Signed URL Verification

The edge validates HMAC-signed URLs before serving any content:

func (e *Edge) verifySignedURL(r *http.Request) bool {
    sig := r.URL.Query().Get("sig")
    if sig == "" { return false }  // or true for public content
    
    expires, _ := strconv.ParseInt(r.URL.Query().Get("expires"), 10, 64)
    if time.Now().Unix() > expires {
        return false  // expired
    }
    
    keyver := r.URL.Query().Get("keyver")
    key, ok := e.signingKeys[keyver]
    if !ok { return false }
    
    canonical := fmt.Sprintf("GET\n%s\n%d\n", r.URL.Path, expires)
    expected := computeHMAC(key, canonical)
    
    return hmac.Equal([]byte(sig), []byte(expected))
}

The shield and origin do not re-verify — they trust the edge. This is the standard trust boundary design: validation happens at the first authorized boundary, not repeatedly at every tier.


Docker Compose

# labs/lab-21-full-system/docker-compose.yml
services:
  origin:
    build: .
    command: ["./cdn-lab21", "-role=origin", "-addr=:9001"]
    ports: ["9001:9001"]

  shield:
    build: .
    command: ["./cdn-lab21", "-role=shield", "-addr=:8082", "-upstream=http://origin:9001"]
    ports: ["8082:8082"]
    depends_on: [origin]

  edge-nyc:
    build: .
    command: ["./cdn-lab21", "-role=edge", "-addr=:8080", "-upstream=http://shield:8082", "-pop=NYC"]
    ports: ["8080:8080"]
    depends_on: [shield]

  edge-lhr:
    build: .
    command: ["./cdn-lab21", "-role=edge", "-addr=:8081", "-upstream=http://shield:8082", "-pop=LHR"]
    ports: ["8081:8081"]
    depends_on: [shield]

  prometheus:
    image: prom/prometheus:latest
    volumes: ["./prometheus.yml:/etc/prometheus/prometheus.yml"]
    ports: ["9090:9090"]

  grafana:
    image: grafana/grafana:latest
    ports: ["3000:3000"]
    depends_on: [prometheus]

Prometheus Configuration

# labs/lab-21-full-system/prometheus.yml
global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'cdn-edge'
    static_configs:
      - targets: ['edge-nyc:8080', 'edge-lhr:8081']

  - job_name: 'cdn-shield'
    static_configs:
      - targets: ['shield:8082']

  - job_name: 'cdn-origin'
    static_configs:
      - targets: ['origin:9001']

Observing the System Under Load

With the system running, generate load and observe the cascade:

# Generate 1000 requests across 50 unique URLs
for i in $(seq 1 1000); do
  curl -s "http://localhost:8080/item/$((RANDOM % 50))" -o /dev/null
done

# Check metrics at each tier
# Edge NYC hit ratio
curl -s http://localhost:8080/metrics | grep cdn_requests_total

# Shield hit ratio  
curl -s http://localhost:8082/metrics | grep cdn_requests_total

# Origin request count (should be tiny compared to edge total)
curl -s http://localhost:9001/metrics | grep cdn_requests_total

You should see:

  • Edge hit ratio: ~80–90% (after warmup)
  • Shield hit ratio: ~95–99%
  • Origin requests: ~1–5% of edge total

Failure Modes & Resilience

Origin failure

Origin down → Shield gets 502/503 from origin
           → Shield returns stale-if-error (from Cache-Control)
           → Edge returns stale content to users

This is the “stale-if-error” pattern from Lab 7, applied system-wide. Users see slightly stale content rather than errors.

Shield failure

Shield down → Edge cannot reach upstream
           → Edge serves stale (if available) or 503

In production, the shield tier has multiple nodes behind a load balancer. A single shield failure routes to another shield node.

Edge failure

Edge-NYC down → Geo routing redirects NYC users to Edge-LHR
             → Higher latency but service continues

This is the health-check failover from Lab 15. Each edge registers with the geo-routing layer and is removed from rotation when health checks fail.


Path to Production

To harden this system for real traffic:

  1. Replace in-memory cache with Redis: enables shared cache state across edge instances and survives restarts
  2. Add TLS termination: Let’s Encrypt or ACME protocol for automatic certificate provisioning
  3. Add rate limiting: token bucket per IP/user with Redis-backed counters
  4. Add WAF rules: block common attack patterns (SQLi, XSS, path traversal)
  5. Add CDN purge API: authenticated endpoint to purge cache keys by tag
  6. Add distributed tracing: OpenTelemetry spans across edge → shield → origin
  7. Add chaos testing: kill origin/shield randomly to validate resilience

Try It

# Start the full system with Docker Compose
cd labs/lab-21-full-system
docker compose up --build

# In another terminal: generate signed URL and fetch content
TOKEN=$(curl -s "http://localhost:8080/sign?path=/article/1&ttl=300")
curl -s "$TOKEN" -v

# View Prometheus metrics
open http://localhost:9090

# View Grafana (default credentials: admin/admin)
open http://localhost:3000

# Generate load test
for i in $(seq 1 5000); do
  curl -s "http://localhost:8080/item/$((RANDOM % 100))" -o /dev/null &
done
wait

# Observe the request waterfall through the tiers
curl -s http://localhost:8080/metrics | grep cdn_requests_total | head -5
curl -s http://localhost:8082/metrics | grep cdn_requests_total | head -5  
curl -s http://localhost:9001/metrics | grep cdn_requests_total | head -5