
Hedged Requests: Tame P99 Tail Latency at Scale
Summary
Send a backup when the first call is slow. Cut P99 tail latency without overloading services.
Your P50 looks great. Your P99 is on fire. A handful of slow calls keep dragging the experience down, and you have already squeezed the obvious wins out of caching, indexes, and pool sizes. This is where hedged requests earn their keep. They are a small, focused pattern that targets the long tail of latency directly instead of trying to make every request faster.
The idea was popularized by Google's "The Tail at Scale" paper (Dean and Barroso, 2013) and is now baked into systems like BigTable, Spanner, gRPC client retries, and Envoy. In May 2026, it is also the most useful trick for cutting P99 latency in LLM inference fan-outs, multi-region reads, and any RPC graph where a single slow leaf node poisons the whole response. This guide walks through what hedging is, when to use it, how to implement it safely in Go, and the gotchas that bite teams in production.
What you will learn
- Why tail latency dominates user experience and how hedging targets it
- How to pick the right hedge delay from real P95/P99 data
- A production-ready Go hedger with context cancellation
- How to cap inflight requests so hedging does not become a self-DDoS
- Where hedging breaks: non-idempotent calls, sticky caches, hot keys
Why tail latency hurts more than averages
A single user-facing request often fans out to dozens of internal services. If each leaf call has a 1% chance of being slow, and your page hits 100 of them, the probability that at least one is slow is about 1 - 0.99^100 = 63%. That single slow call is what the user actually waits for. This is why an average latency of 50 ms can still feel terrible at the page level. The math punishes you for breadth.
There are three honest ways to fix tail latency: make every call faster (usually expensive), make the system tolerant of a missing reply (requires speculative or partial responses), or finish requests that are already taking too long by racing a backup. Hedged requests are the third option. You do not retry on failure; you race on slowness.
How a hedged request works
The pattern is roughly: send a request to replica A. If it has not returned within a small delay (say, your P95 latency), send the same request to replica B. Use whichever response comes back first and cancel the loser. The delay matters. Set it too low and you double your load for no benefit. Set it too high and the hedge fires too late to help.
A useful mental model: the first request runs the normal path; the hedge is a controlled, late-firing speculative duplicate that only costs you anything when the first request was about to be slow anyway. In well-tuned systems, hedging firing rate is around 5%, which is roughly the fraction of calls past your P95.
Prerequisites
- Familiarity with one mainstream language: Go, Java, or Python with asyncio
- A service that supports at least two equivalent backends or replicas (load-balanced)
- Operation must be safely repeatable: idempotent reads, or writes guarded by idempotency keys
- Metrics with at least P95 and P99 latency histograms (Prometheus, OpenTelemetry, etc.)
Step 1 — Measure your P95 baseline
Before you write a single line of hedging code, you need the P95 of the call you plan to hedge. The hedge delay will sit just above this number. If your only metric is an average, you do not have enough information yet. Pull a histogram from Prometheus, an APM, or a quick log-derived percentile.
# Quick P95 from a log file of latencies (ms, one per line).
import statistics, sys
latencies = [float(x) for x in sys.stdin if x.strip()]
latencies.sort()
def pct(p):
k = int(round((p / 100) * (len(latencies) - 1)))
return latencies[k]
print(f"P50={pct(50):.1f} ms P95={pct(95):.1f} ms P99={pct(99):.1f} ms")
# Example output:
# P50=42.0 ms P95=180.0 ms P99=520.0 ms
If your P95 is 180 ms and your P99 is 520 ms, hedging at a delay of around 180 ms will fire on roughly 5% of calls and has a real chance of converting many of those slow ones into fast ones. If your P95 is already 30 ms, hedging is probably not the right tool: there is no fat tail to chase.
Step 2 — A minimal hedger in Go
The core implementation is short. It starts one call immediately, starts a second after the hedge delay, and returns whichever finishes first. Both lose context as soon as a winner is chosen. Anything more elaborate (jitter, dynamic delay, hedging budgets) builds on this skeleton.
package hedge
import (
"context"
"time"
)
type Result struct {
Body []byte
Err error
}
// DoFunc is the per-attempt call. It MUST honor ctx cancellation.
type DoFunc func(ctx context.Context) ([]byte, error)
// Hedge fires up to 2 attempts: one immediately, one after `delay`.
// The first non-error result wins; the loser is cancelled.
func Hedge(parent context.Context, delay time.Duration, do DoFunc) ([]byte, error) {
ctx, cancel := context.WithCancel(parent)
defer cancel()
results := make(chan Result, 2)
go func() {
b, err := do(ctx)
results <- Result{b, err}
}()
timer := time.NewTimer(delay)
defer timer.Stop()
select {
case r := <-results:
if r.Err == nil { return r.Body, nil }
// First attempt failed fast: fire the hedge now.
case <-timer.C:
// Delay elapsed; fire the hedge.
case <-parent.Done():
return nil, parent.Err()
}
go func() {
b, err := do(ctx)
results <- Result{b, err}
}()
// Take the first successful result, or the last error.
var lastErr error
for i := 0; i < 2; i++ {
r := <-results
if r.Err == nil {
return r.Body, nil
}
lastErr = r.Err
}
return nil, lastErr
}
Notice two things. First, every attempt shares the same cancellable context — when one wins, the other gets a cancel signal immediately, which is how you avoid paying full price for both calls. Second, the function is generic: callers pass a `DoFunc` that already targets a healthy backend (use your existing load balancer or pick a different replica explicitly). Putting backend selection inside Hedge is a common mistake; it tangles the policy with the mechanism.
Step 3 — Wire it to a real HTTP call
client := &http.Client{Timeout: 2 * time.Second}
doCall := func(ctx context.Context) ([]byte, error) {
req, _ := http.NewRequestWithContext(ctx, "GET", "https://api.example.com/v1/items/42", nil)
// Idempotency key for any write would go here.
resp, err := client.Do(req)
if err != nil { return nil, err }
defer resp.Body.Close()
if resp.StatusCode >= 500 { return nil, fmt.Errorf("upstream %d", resp.StatusCode) }
return io.ReadAll(resp.Body)
}
ctx, cancel := context.WithTimeout(context.Background(), 2 * time.Second)
defer cancel()
body, err := Hedge(ctx, 180*time.Millisecond, doCall)
// Example: when first call is slow (350ms) and hedge call is fast (60ms),
// total latency is ~240ms instead of 350ms.
Step 4 — Pick the hedge delay from data, not vibes
Hardcoded delays drift. A safer pattern is to keep a rolling P95 estimator in your client and use it as the delay. A simple t-digest, HDR histogram, or even a 200-sample reservoir is enough. Recompute every minute. The delay should track real production behavior, not assumptions from a quiet test environment.
type RollingP95 struct {
mu sync.Mutex
samples []time.Duration
cap int
cursor int
}
func (r *RollingP95) Observe(d time.Duration) {
r.mu.Lock(); defer r.mu.Unlock()
if len(r.samples) < r.cap {
r.samples = append(r.samples, d); return
}
r.samples[r.cursor] = d
r.cursor = (r.cursor + 1) % r.cap
}
func (r *RollingP95) P95() time.Duration {
r.mu.Lock(); defer r.mu.Unlock()
if len(r.samples) < 50 { return 200 * time.Millisecond } // safe default
sorted := append([]time.Duration(nil), r.samples...)
sort.Slice(sorted, func(i, j int) bool { return sorted[i] < sorted[j] })
return sorted[int(0.95*float64(len(sorted)-1))]
}
Hedge firing rate should land around 5%. If you see 20%, your delay is too short or upstream really did degrade — page someone. If you see 0.1%, your delay is too long and the hedge is decorative.
Step 5 — Cap inflight calls so hedging cannot snowball
The dangerous failure mode of hedging is congestion collapse. If the downstream service is already overloaded, hedging will fire on most calls, double the offered load, and push the service further into the red. The cure is a hedging budget: an upper bound on the fraction of calls allowed to issue a hedge.
type Budget struct {
mu sync.Mutex
window []bool // ring of recent calls; true = hedged
cap int
cursor int
maxRatio float64 // e.g. 0.10 means 10% can hedge
}
func (b *Budget) Allow() bool {
b.mu.Lock(); defer b.mu.Unlock()
if len(b.window) < b.cap {
b.window = append(b.window, false); return true
}
used := 0
for _, v := range b.window { if v { used++ } }
return float64(used)/float64(b.cap) < b.maxRatio
}
func (b *Budget) Record(hedged bool) {
b.mu.Lock(); defer b.mu.Unlock()
b.window[b.cursor] = hedged
b.cursor = (b.cursor + 1) % b.cap
}
gRPC's official hedging policy uses exactly this concept (`maxAttempts`, `hedgingDelay`, `hedgingPolicy.nonFatalStatusCodes`, and a budget through retry throttling). If you are already on gRPC, configure it in your service config JSON and skip writing your own. Only build a custom hedger when you have non-gRPC traffic or need behavior the spec does not cover.
Step 6 — Make calls safe to repeat
Hedging assumes the operation is safe to run twice. For reads, this is usually fine. For writes, you must add an idempotency key. Generate a UUID on the client, send it with both attempts, and have the server deduplicate. Without this, hedging a payment endpoint will eventually double-charge a real customer. This is the single most common production incident from naïve hedging.
idemKey := uuid.NewString()
doCall := func(ctx context.Context) ([]byte, error) {
req, _ := http.NewRequestWithContext(ctx, "POST", payURL, body)
req.Header.Set("Idempotency-Key", idemKey) // SAME key on both attempts
return client.Do(req) // server returns cached result for duplicates
}
Common pitfalls
- Hedging non-idempotent writes without an idempotency key — duplicates land on the database.
- Forgetting to cancel the loser — you keep paying CPU and connection cost on the second backend.
- Hedge delay shorter than P50 — you double load on every request and offer no benefit.
- Hedge to the same replica — if the hot node is the problem, hedging back to it solves nothing. Always pick a different replica or rely on a load balancer that re-selects.
- No budget cap — under regional brownouts, hedging amplifies the outage into a full incident.
- Hedging through a sticky cache — both attempts share the same slow upstream miss; you doubled load and gained zero ms.
- Treating timeouts and hedges as the same thing — they aren't. A timeout aborts; a hedge races. Run both: hedge at P95, timeout at P99.9.
Quick reference
| Setting | Reasonable starting value | Notes |
|---|---|---|
| Hedge delay | Rolling P95 of the call | Recompute every 60 s from real traffic |
| Max attempts | 2 | Higher rarely worth the load |
| Hedge budget | 10% of recent calls | gRPC default is similar |
| Per-call timeout | P99.9 of the call | Hedge does not replace timeouts |
| Cancellation | Required | Share one context across attempts |
| Backend selection | Different replica | Hedging to the same node defeats it |
| Idempotency | Mandatory for writes | Use a stable client-generated key |
When NOT to hedge
- Cold cache fans-out — both attempts pay the same miss cost.
- Single-leader writes where there is only one backend that can serve the call.
- Capacity-constrained services already running near saturation.
- Cases where you can simply set a tighter SLO with the upstream team — fix the root cause if it's cheap.
Next steps
Once a basic hedger is in place, the natural extensions are: adaptive delay using a t-digest, hedging across more than two replicas using a tied-request pattern (each replica peeks at the other's queue), and exposing hedge_fired and hedge_won as first-class metrics. The latter two metrics are what you will live or die by — they tell you whether the pattern is paying for itself.
For deeper reading, Dean and Barroso's "The Tail at Scale" is still the clearest single source. The gRPC retry and hedging documentation covers the operational details. And for production code, study the Envoy and Linkerd configurations for hedge policies — they encode years of hard-won lessons in defaults you can copy directly.
Comments
Be the first to comment