Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Lab 06 · The Thundering Herd

Run it: make lab-06
Source: labs/lab-06-thundering-herd/main.go


The Problem

You have a cache. A popular URL’s TTL expires. In the next millisecond, 800 concurrent requests arrive for that URL. Your cache code:

func handler(w http.ResponseWriter, r *http.Request) {
    if entry, ok := cache.Get(key); ok {
        write(w, entry)   // HIT
        return
    }
    // MISS — 800 goroutines all reach here simultaneously
    resp := fetch(key)    // 800 fetches to origin — boom
    cache.Set(key, resp)
}

This is the thundering herd (also: cache stampede, dog pile effect). A single cache expiry event becomes an instant DDoS against your origin.

At YouTube scale (2023), a single CDN node may have 10,000 concurrent viewers for a popular video. When that cache entry expires, 10,000 simultaneous origin requests arrive in sub-millisecond windows. Most origins cannot handle this.


Solution: singleflight.Group

Go’s golang.org/x/sync/singleflight package provides the exact primitive needed: request collapsing (or request deduplication).

import "golang.org/x/sync/singleflight"

var group singleflight.Group

func fetch(key string) ([]byte, error) {
    result, err, shared := group.Do(key, func() (interface{}, error) {
        // This function runs ONCE, no matter how many concurrent
        // callers invoke group.Do with the same key.
        return fetchFromOrigin(key)
    })
    // 'shared' is true if this result was returned to multiple callers
    return result.([]byte), err
}

The semantics:

  • First caller with a given key triggers the actual fetch
  • All subsequent callers with the same key block and wait for the first caller’s result
  • When the fetch completes, all waiting callers receive the same result
  • No extra origin requests are made
800 concurrent misses for /video/popular
    │
    ├── Goroutine 1: starts group.Do("video/popular") → actual fetch
    ├── Goroutine 2: group.Do("video/popular") → blocks, waiting
    ├── Goroutine 3: group.Do("video/popular") → blocks, waiting
    ├── ...
    └── Goroutine 800: group.Do("video/popular") → blocks, waiting

[~80ms later: origin responds]

    └── All 800 goroutines receive the same result simultaneously
        → 1 origin request, not 800

The shared Return Value

group.Do returns three values: (v interface{}, err error, shared bool).

shared is true if the result was shared with other callers. This is useful for metrics — you can measure how many requests were collapsed:

result, err, shared := group.Do(key, fetch)
if shared {
    collapsedRequestsTotal.Inc()
}

Monitoring collapsed requests reveals thundering herd intensity. If you see thousands of requests being collapsed per second, your TTLs may be too low or your TTLs are expiring synchronously (lab 07 addresses this with staggered expiry).


Cascade Failure Without Singleflight

The lab demonstrates what happens without singleflight:

800 requests arrive at t=0
→ 800 goroutines all observe cache miss
→ 800 goroutines all call fetchFromOrigin()
→ Origin receives 800 simultaneous connections
→ Origin CPU spikes to 100%
→ Origin response time increases from 80ms to 5000ms
→ Each 800-request wave takes 5s instead of 80ms
→ Next cache entry expires while previous wave is still in-flight
→ Another 800 requests stampede
→ Origin never recovers (cascade failure)

This is a well-documented pattern in distributed systems and the root cause of many high-profile outages. Facebook described this exact failure mode in their 2010 memcache paper. Reddit’s 2012 outage was triggered by a thundering herd on a database backing a cached list.


Beyond singleflight: Production Patterns

1. Probabilistic Early Refresh (XFetch)

Instead of waiting for expiry, refresh slightly before expiry using probabilistic jitter. Each request has a small probability of triggering a background refresh:

// XFetch algorithm (Vattani et al., 2015)
func shouldRefresh(expiry time.Time, lastFetchDuration time.Duration, beta float64) bool {
    remaining := time.Until(expiry).Seconds()
    delta := lastFetchDuration.Seconds()
    return -delta * beta * math.Log(rand.Float64()) > remaining
}

When remaining approaches 0, math.Log(rand.Float64()) (which is negative) is multiplied by delta * beta (positive), and the result becomes likely to exceed remaining. Higher beta = more aggressive prefetching. This guarantees the cache is almost always warm.

2. Mutex per key (fine-grained locking)

type keyedMutex struct {
    mu    sync.Mutex
    locks map[string]*sync.Mutex
}

func (km *keyedMutex) Lock(key string) {
    km.mu.Lock()
    l, ok := km.locks[key]
    if !ok {
        l = &sync.Mutex{}
        km.locks[key] = l
    }
    km.mu.Unlock()
    l.Lock()
}

Only one goroutine per key can fetch from origin. Others wait. Simpler to reason about than singleflight but no result sharing (each waiter re-fetches independently when the mutex is released).

3. Background refresh with locked TTL

Keep serving the stale entry while refreshing in background, preventing any thundering herd entirely. See Lab 07 for the full stale-while-revalidate implementation.


singleflight vs. Caching

Note that singleflight is not a cache. It deduplicates in-flight requests, but once the first request completes, new requests will start a new group.Do call (the key is removed from the group after completion). singleflight + cache is the correct combination:

Request arrives
    │
    ▼
Cache hit? → serve immediately
    │ miss
    ▼
group.Do(key, fetch) → one origin request, all callers get result
    │
    ▼
cache.Set(key, result, ttl)
    │
    ▼
All waiters serve result

What to Measure

# Collapsed requests (singleflight saves)
rate(singleflight_shared_total[1m])

# Origin request rate — should be orders of magnitude lower than edge rate
rate(origin_requests_total[1m])

# Collapse ratio: how many requests per origin fetch
rate(edge_requests_total[1m]) / rate(origin_requests_total[1m])
# Target: 50-500x depending on content popularity

Try It

make lab-06

# The lab fires 800 concurrent requests to a cold cache entry
# Watch the output — it will show:
#   "800 concurrent requests → 1 origin request"
# vs the naive version which fires all 800