
Cache Stampede: Beat the Thundering Herd in Redis
Summary
Stop cache stampedes with locking, single-flight, and probabilistic early expiry.
It is 9:03 a.m. A marketing email just went out to two million users. They all click the same link, and your homepage tries to load the same hot cache key at the same instant. The key happens to expire at 9:03 a.m. In the next 50 milliseconds, forty thousand requests miss the cache, decide the value is stale, and slam your database with the identical expensive query. The database’s connection pool saturates, query latency climbs from 8 ms to 8 seconds, healthy requests start timing out, and your retry logic doubles the load. By 9:04 a.m. the site is down — not because traffic was unmanageable, but because a single expired key turned a cache into a stampede.
This is the cache stampede (also called the thundering herd or dog-piling), and it is one of the most common ways a perfectly healthy system falls over under load it should easily absorb. The dangerous part is that it is invisible in normal operation: your cache hit rate is 99.9%, dashboards are green, and everything looks fine… right up until one popular key expires under concurrency. This guide walks through why it happens, three production-grade fixes — distributed locking, single-flight request coalescing, and probabilistic early expiration — and exactly when to reach for each. By the end you will be able to design a cache layer that degrades gracefully instead of collapsing.
Why a stampede happens
The naive cache-aside (lazy loading) pattern is what almost every tutorial teaches, and it is the root of the problem. The logic reads: check the cache; on a miss, compute the value, write it back, and return it. Here it is in Python against Redis:
import redis, json, time
r = redis.Redis()
def get_product(product_id):
key = f"product:{product_id}"
cached = r.get(key)
if cached is not None:
return json.loads(cached) # cache hit
# --- cache miss: everyone who arrives here hits the DB ---
value = expensive_db_query(product_id) # ~800 ms aggregate query
r.set(key, json.dumps(value), ex=300) # TTL 5 minutes
return value
Read that miss branch again. There is nothing serializing it. If the key is cold — either it just expired or was evicted — and N requests arrive before the first one finishes the 800 ms query and writes back, then all N requests run the query. The cache provides zero protection at exactly the moment you need it most: under a burst of concurrent traffic on a hot key.
The math is brutal. Suppose a key is hit 5,000 times per second and the regeneration query takes 800 ms. During the 0.8 s regeneration window, roughly 5,000 × 0.8 = 4,000 requests pile onto the database for a value that one request could have produced. Your cache hit rate barely moves — it is still >99% over the day — but for those 800 ms the database sees a 4,000× amplification. That spike is what kills you.
Three distinct triggers cause the cold key in the first place, and they call for slightly different defenses:
- Expiration stampede — a high-traffic key reaches its TTL and expires while requests are in flight. The classic case.
- Eviction stampede — Redis hits its
maxmemorylimit and evicts a hot key under an LRU/LFU policy before its TTL, recreating the same race. - Cold-start stampede — a cache node restarts or a new region comes online with an empty cache, and the first wave of traffic finds everything missing at once.
Fix #1 — Distributed locking (mutex / request coalescing)
The most direct fix: when a miss happens, let exactly one request acquire a short-lived lock and regenerate the value. Everyone else either waits and re-reads the cache, or serves slightly stale data. Redis makes this cheap with SET key value NX PX ttl — an atomic “set if not exists with expiry” that doubles as a distributed lock.
import redis, json, time, uuid
r = redis.Redis()
def get_product(product_id, lock_wait=0.05, lock_ttl_ms=2000):
key = f"product:{product_id}"
lock_key = f"lock:{key}"
cached = r.get(key)
if cached is not None:
return json.loads(cached)
token = str(uuid.uuid4()) # unique owner token, see gotcha below
# SET NX PX = acquire lock only if nobody else holds it
got_lock = r.set(lock_key, token, nx=True, px=lock_ttl_ms)
if got_lock:
try:
value = expensive_db_query(product_id)
r.set(key, json.dumps(value), ex=300)
return value
finally:
# release ONLY if we still own the lock (atomic check-and-del)
release = (
"if redis.call('get', KEYS[1]) == ARGV[1] "
"then return redis.call('del', KEYS[1]) else return 0 end")
r.eval(release, 1, lock_key, token)
else:
# someone else is rebuilding -> wait briefly, then re-read
time.sleep(lock_wait)
cached = r.get(key)
if cached is not None:
return json.loads(cached)
# still cold: fall back to DB rather than block forever
return expensive_db_query(product_id)
Now, instead of 4,000 requests hitting the database, one acquires the lock and runs the query while the other 3,999 sleep 50 ms and re-read the freshly populated cache. The database amplification drops from 4,000× to roughly 1×.
The release step deserves attention. You must never blindly DEL the lock key in your finally block. If your regeneration overran the lock TTL, the lock may already have expired and been re-acquired by another worker — deleting it would release their lock and reopen the stampede. The Lua script above does an atomic compare-and-delete using a per-owner token, which is the only safe way to release a Redis lock.
The waiter dilemma
What should the 3,999 waiters do? You have three options, each a different trade-off between latency, freshness, and database protection:
- Wait and retry (shown above): sleep a few ms and re-read. Low DB load, but adds tail latency to every waiter and risks a retry storm if the rebuild is slow.
- Serve stale: keep the old value in a second key with a longer TTL and return it immediately while one worker refreshes in the background. Best latency, but readers see stale data for a moment.
- Fail fast: return a default or a 503 to waiters. Maximally protects the DB; appropriate for non-critical widgets, never for the core page.
Serving stale is almost always the right answer for read-heavy systems, which leads directly to the next two techniques.
Fix #2 — Single-flight (in-process coalescing)
A distributed lock protects against a stampede across your fleet, but it still pays a Redis round-trip per request and adds latency. Often the herd is concentrated within a single process: one busy API node may itself receive hundreds of concurrent requests for the same key. Single-flight collapses all of those into one in-flight call, sharing the single result across every caller — no Redis lock needed for the in-process case.
Go ships this in the standard extended library as golang.org/x/sync/singleflight. It is the canonical implementation and worth studying even if you work in another language:
package cache
import (
"encoding/json"
"golang.org/x/sync/singleflight"
)
var group singleflight.Group
func GetProduct(id string) (Product, error) {
// Do() guarantees the fn runs once for a given key even if called
// by 1,000 goroutines concurrently; all callers get the same result.
v, err, shared := group.Do("product:"+id, func() (interface{}, error) {
if cached, ok := redisGet("product:" + id); ok {
return cached, nil
}
p, err := expensiveDBQuery(id) // runs at most once per key
if err != nil {
return nil, err
}
redisSet("product:"+id, p, 300)
return p, nil
})
_ = shared // true if this caller piggybacked on an in-flight call
if err != nil {
return Product{}, err
}
return v.(Product), nil
}
The pattern in Python (with asyncio) is to keep a dict of in-flight futures keyed by cache key. The first coroutine creates the future and runs the query; later coroutines for the same key await the existing future instead of launching their own:
import asyncio, json
_inflight: dict[str, asyncio.Future] = {}
async def get_product(product_id):
key = f"product:{product_id}"
cached = await redis_get(key)
if cached is not None:
return json.loads(cached)
if key in _inflight:
return await _inflight[key] # piggyback on in-flight call
loop = asyncio.get_event_loop()
fut = loop.create_future()
_inflight[key] = fut
try:
value = await expensive_db_query(product_id)
await redis_set(key, json.dumps(value), ex=300)
fut.set_result(value)
return value
except Exception as e:
fut.set_exception(e)
raise
finally:
_inflight.pop(key, None) # always clean up the slot
Single-flight and distributed locking are complementary, not competing. The strongest production setups layer them: single-flight collapses the herd inside each node (cheap, zero network), and a Redis lock or lease coordinates the one survivor per node across the fleet. With a 50-node fleet, single-flight alone cuts a 50,000-request herd to 50 DB queries; adding the cross-node lock cuts it to 1.
Fix #3 — Probabilistic early expiration (XFetch)
Locking and single-flight both react to a stampede that is already happening. The most elegant fix prevents the cold key from ever existing: refresh hot keys before they expire, and stagger those refreshes so they never align. This is probabilistic early expiration, formalized as the XFetch algorithm in the 2015 paper “Optimal Probabilistic Cache Stampede Prevention” and used in production at Redis.
The idea: every time you read a key, you also store how long the regeneration took (delta) and the absolute time it was written. On each read you roll the dice — the closer the key is to expiry, and the more expensive it is to rebuild, the higher the chance that this reader volunteers to refresh early, while everyone else keeps using the still-valid cached value. The result: exactly one (statistically) early refresh, spread out over time, with no lock and no cold window.
import time, math, random, json, redis
r = redis.Redis()
def xfetch_get(product_id, ttl=300, beta=1.0):
key = f"product:{product_id}"
pipe = r.pipeline()
pipe.hmget(key, "value", "delta", "expiry")
value, delta, expiry = pipe.execute()[0]
now = time.time()
if value is not None:
delta = float(delta)
expiry = float(expiry)
# XFetch core: refresh early with probability that rises near expiry
# gap = delta * beta * -ln(random); refresh if now+gap >= expiry
if now - delta * beta * math.log(random.random()) < expiry:
return json.loads(value) # still fresh enough -> serve cached
# either a real miss, or this reader "won" the early-refresh lottery
start = time.time()
fresh = expensive_db_query(product_id)
delta = time.time() - start # how long regeneration took
expiry = time.time() + ttl
r.hset(key, mapping={
"value": json.dumps(fresh),
"delta": delta,
"expiry": expiry,
})
r.expire(key, ttl + 60) # hard TTL as a safety net
return fresh
The beta parameter tunes aggressiveness. With beta = 1.0 you get the paper’s optimal behavior. Raise it above 1 to refresh earlier and more eagerly (fewer stampedes, more background work); lower it below 1 to refresh later (closer to the TTL, more risk). Because the trigger is multiplied by delta, expensive keys are refreshed earlier than cheap ones automatically — the algorithm spends its prevention budget where regeneration hurts most.
XFetch shines for keys that are read constantly — a homepage, a pricing table, a feature-flag blob. For those, it eliminates the cold window entirely without any waiter latency. Its weakness is rarely read keys: if a key is read once a minute, the probabilistic trigger may never fire before the hard TTL, and you fall back to a plain miss. Pair XFetch with single-flight to cover that gap.
Choosing the right tool
These three techniques are not mutually exclusive; mature systems combine them. But if you need a default decision rule, match the technique to your traffic shape and consistency needs:
| Technique | Best for | DB load under herd | Reader latency | Staleness |
|---|---|---|---|---|
| Distributed lock | Cross-fleet hot keys, strong consistency | ~1x | Adds wait or retry | None (or controlled) |
| Single-flight | Many concurrent reads per node | ~1x per node | Near zero (piggyback) | None |
| Probabilistic early expiry | Constantly-read keys, read-heavy | ~1x, spread over time | Zero (always served) | Bounded, tunable |
| Serve-stale + bg refresh | Latency-critical reads | ~1x background | Zero | Brief, bounded |
A common, robust production stack looks like this: serve-stale as the baseline so readers are never blocked, single-flight inside each node to collapse the local herd, a Redis lease so only one node per cluster rebuilds, and XFetch on your hottest handful of keys to keep them perpetually warm. That combination survives expiration, eviction, and cold-start stampedes alike.
Production gotchas
The techniques are simple; the failure modes are subtle. These are the ones that bite teams in production:
- Releasing a lock you no longer own. Always release with an atomic check-and-delete keyed to a unique owner token. A blind
DELin afinallyblock will, sooner or later, delete another worker’s freshly acquired lock and reopen the stampede. - Lock TTL shorter than the query. If
lock_ttl_msis less than your worst-case regeneration time, the lock expires mid-rebuild, a second worker grabs it, and you are back to a herd. Set the lock TTL above your P99.9 query latency, and consider a lock-extension (watchdog) heartbeat for long rebuilds. - Retry storms among waiters. If thousands of waiters all sleep a fixed 50 ms and re-read, they wake up in lockstep and can create a secondary herd. Use jittered backoff (
sleep(base * (1 + random()))) so waiters spread out. - Negative caching forgotten. If a key legitimately has no value (a deleted product), a naive cache never stores anything, so every request is a permanent miss — a stampede that never ends. Cache the “not found” result with a short TTL too.
- Synchronized TTLs. Setting the exact same TTL on a batch of keys written together (e.g., a warm-up job) makes them all expire in the same second. Add random jitter to every TTL:
ex = base_ttl + random.randint(0, 60). - Trusting Redlock for correctness. Multi-node Redlock is fine as a best-effort stampede guard, but it is not a correctness-grade lock under network partitions and clock skew. For cache regeneration that is acceptable — a rare double-rebuild only wastes one query — but never use it where a double-execution would corrupt data.
- Local in-process caches multiplying the herd. If each node also keeps an in-memory LRU with its own TTL, a deploy that restarts all nodes clears every local cache at once — an instant cold-start herd against Redis. Stagger restarts and warm caches before taking traffic.
Quick reference
| Symptom / need | Reach for |
|---|---|
| One hot key expires, fleet-wide spike | Redis SET NX PX lock + serve-stale |
| Hundreds of concurrent reads on one node | Single-flight (x/sync/singleflight) |
| A few keys read thousands of times/sec | Probabilistic early expiration (XFetch) |
| Latency must never spike for readers | Serve-stale + background refresh |
| Keys written in a batch expire together | Add random jitter to every TTL |
| Deleted/empty entities re-queried | Negative caching with short TTL |
| Long regeneration overruns lock | Lock TTL > P99.9 query + watchdog extend |
Next steps
Start by instrumenting before you fix anything. Add a counter that increments on every cache miss and a histogram of regeneration time, then graph misses-per-second against your DB query rate. A stampede is obvious once you can see the amplification: a tiny blip in miss rate producing a huge blip in DB load. Most teams discover they have exactly two or three keys responsible for every incident.
Then apply the cheapest fix that covers those keys. For most read-heavy services that is serve-stale plus single-flight — a day of work that removes the entire class of incident. Layer in a Redis lease for cross-fleet coordination and XFetch on your top keys only if your traffic justifies it. Resist the urge to wrap every cache read in a distributed lock; the network round-trip and added latency are rarely worth it for cold keys that are not actually contended.
Finally, load-test the failure path explicitly. Write a test that expires a hot key and fires a thousand concurrent requests at it, then assert that your database sees one query, not a thousand. A stampede defense you have not load-tested is a stampede defense you do not have.
Comments
Be the first to comment