Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Lab 15 · Geographic Routing & PoP Failover

Run it: make lab-15
Source: labs/lab-15-geo-routing/main.go


The Problem

A CDN node in Singapore is useless to a user in Berlin. Latency on a Singapore → Berlin path is ~160 ms one-way. A Frankfurt PoP would serve Berlin in ~5 ms.

Geographic routing — directing each user to the nearest CDN PoP — is one of the most impactful optimizations in CDN infrastructure. The difference between 160 ms and 5 ms TTFB is the difference between a bounced visitor and a retained one.


Routing Mechanisms

1. Anycast BGP (used by Cloudflare, Fastly)

The same IP address is announced from every PoP via BGP. Internet routing automatically directs packets to the topologically nearest PoP:

209.91.64.22 announced from:
  - Frankfurt PoP → European users reach Frankfurt
  - Tokyo PoP → Asian users reach Tokyo
  - Chicago PoP → US Midwest users reach Chicago

BGP anycast routing is handled entirely by the internet’s routing infrastructure. CDN operator’s job: configure BGP announcements correctly and monitor AS path lengths.

Advantage: Zero application-level routing logic. Failover is automatic (BGP withdraws the broken PoP’s announcement).

Disadvantage: BGP convergence is slow (~30–180 seconds for a prefix withdrawal to propagate globally). A PoP that goes down may continue receiving traffic for minutes.

DNS-level failover is faster (~30 seconds with low TTL), but requires additional coordination.

2. GeoDNS (used by many second-tier CDNs)

DNS returns different IP addresses based on the client’s IP’s geographic region:

User from Germany resolves cdn.example.com:
  → DNS returns 203.0.113.10 (Frankfurt PoP)

User from Japan resolves cdn.example.com:
  → DNS returns 203.0.113.20 (Tokyo PoP)

Advantage: Simple to implement; works with any CDN infrastructure.

Disadvantage: DNS caching (TTL 60s–300s) means failover is slow. During failover, users who cached the old IP get routed to a dead PoP. NXDOMAIN or connection refused until TTL expires.

3. Application-Layer Routing (HTTP Redirect)

User → cdn.example.com → Routing server
                           → 302 Redirect to "ams01.cdn.example.com"

This lab implements application-layer routing. A routing server receives all requests, calculates the optimal PoP, and either redirects or proxies to it.


Haversine Distance Calculation

The lab computes geographic distance using the haversine formula, which gives the great-circle distance between two points on a sphere:

func haversine(lat1, lon1, lat2, lon2 float64) float64 {
    const R = 6371 // Earth radius in km
    
    φ1 := lat1 * math.Pi / 180
    φ2 := lat2 * math.Pi / 180
    Δφ := (lat2 - lat1) * math.Pi / 180
    Δλ := (lon2 - lon1) * math.Pi / 180
    
    a := math.Sin(Δφ/2)*math.Sin(Δφ/2) +
         math.Cos(φ1)*math.Cos(φ2)*
         math.Sin(Δλ/2)*math.Sin(Δλ/2)
    
    c := 2 * math.Atan2(math.Sqrt(a), math.Sqrt(1-a))
    return R * c // distance in km
}

Given client location, find the closest PoP:

func nearestPoP(clientLat, clientLon float64, pops []PoP) PoP {
    var nearest PoP
    minDist := math.MaxFloat64
    for _, pop := range pops {
        if !pop.healthy.Load() { continue }  // skip unhealthy PoPs
        d := haversine(clientLat, clientLon, pop.Lat, pop.Lon)
        if d < minDist {
            minDist = d
            nearest = pop
        }
    }
    return nearest
}

The 5 PoPs

The lab simulates 5 geographically distributed PoPs:

PoPCityCoordsPort
NYCNew York40.71°N, 74.00°W:9010
LHRLondon51.51°N, 0.13°W:9011
NRTTokyo35.65°N, 139.76°E:9012
SYDSydney33.87°S, 151.21°E:9013
GRUSão Paulo23.43°S, 46.47°W:9014

Health Checking & Failover

Each PoP exposes a /health endpoint. The router runs periodic health checks:

type PoP struct {
    Name    string
    Addr    string
    Lat     float64
    Lon     float64
    healthy atomic.Bool
}

func (r *Router) healthCheckLoop() {
    ticker := time.NewTicker(5 * time.Second)
    for range ticker.C {
        for i := range r.pops {
            pop := &r.pops[i]
            go func() {
                resp, err := http.Get(pop.Addr + "/health")
                healthy := err == nil && resp.StatusCode == 200
                pop.healthy.Store(healthy)
            }()
        }
    }
}

atomic.Bool for the health state means reads in the routing hot path require no lock. Health checks run concurrently with requests; a false health state is propagated within one health-check interval.

When the nearest PoP is unhealthy, routing falls back to the next-nearest healthy PoP automatically.


Real-World PoP Selection

Geographic distance is a proxy for network latency, but not a perfect one. BGP path length, network peering relationships, and inter-AS latency can cause a geographically farther PoP to have lower latency.

Production CDNs use active latency measurements:

  • Cloudflare Argo: routes traffic based on real-time network telemetry measured across the actual internet paths between PoPs
  • Fastly: uses Anycast BGP (network handles routing) plus performance-based override for known poor paths
  • AWS CloudFront: uses latency-based routing in Route 53

The haversine approach in this lab is a good approximation (within ~20% of actual latency in most cases) and zero-overhead at runtime.


Client Location Detection

In production, client location comes from:

  1. IP geolocation: MaxMind GeoLite2 database or IP-API, maps IP → country/city/coords
  2. CDN headers: Cloudflare adds CF-IPCountry, CF-IPCity, CF-IPLatitude, CF-IPLongitude to every request automatically
  3. GPS/browser API: browser can provide precise location (user permission required)
  4. CDN PoP metadata: the PoP itself knows its geographic location; route users to the PoP they connected to

The lab accepts lat/lon as query parameters for testability.


PoP Infrastructure Design

When selecting where to locate PoPs, the key criteria are:

  1. Internet Exchange Points (IXPs): co-locate at major IXPs (DE-CIX Frankfurt, AMS-IX Amsterdam, LINX London) for direct peering with hundreds of ISPs, reducing latency and cost
  2. Traffic density: PoPs near large populations (NYC, London, Tokyo, São Paulo, Mumbai) serve the most users
  3. Data center tier: Tier 3+ (99.999% uptime, redundant power/cooling)
  4. Network diversity: multiple transit providers per PoP prevents single-provider outages from taking down the PoP

Try It

make lab-15

# Route a request from NYC (40.71, -74.00) — should go to NYC PoP
curl "http://localhost:8080/?lat=40.71&lon=-74.00" -v

# Route from London (51.51, -0.13) — should go to LHR PoP
curl "http://localhost:8080/?lat=51.51&lon=-0.13" -v

# Route from Tokyo — should go to NRT PoP
curl "http://localhost:8080/?lat=35.65&lon=139.76" -v

# Simulate LHR failure — London user should reroute to nearest healthy PoP
curl -X DELETE "http://localhost:8080/pops/LHR"
curl "http://localhost:8080/?lat=51.51&lon=-0.13" -v
# Should now route to NYC or GRU (next closest)