
Backpressure in Streaming Systems: A Production Guide
Summary
Stop drowning consumers. Pull, drop, and credit-based backpressure with code.
Picture a Kafka consumer that processes 200 messages per second, fed by a producer that emits 5,000 per second. Within minutes the broker lag chart bends sharply upward, consumer memory balloons, GC pauses lengthen, and somewhere a pager goes off. The cause is not a bug in either component. The cause is the missing conversation between them. That conversation is backpressure: the protocol a slow consumer uses to tell a fast producer to slow down.
This guide walks through what backpressure actually means in production streaming systems, the four strategies you will choose between, runnable code in Node.js and Go, and the gotchas that separate a stable pipeline from one that fails over a long weekend.
Why this matters in 2026
AI workloads have made backpressure unavoidable. Token streams from LLM inference, embedding pipelines feeding vector databases, and event-sourced agent runtimes all push data through pipes of mismatched throughput. A naive design that buffers everything will eventually run out of memory; a design that drops everything will silently lose work. The real answer is somewhere in between, and it depends on whether the data is replayable, whether the work is idempotent, and how much latency you can absorb.
Prerequisites
- Comfortable reading Node.js and Go (we use both for short examples).
- Basic familiarity with one streaming system: Kafka, Redis Streams, NATS JetStream, RabbitMQ, or AWS Kinesis.
- An intuition for queues, batching, and async I/O. No reactive-streams background needed.
The four strategies, ranked by safety
Every backpressure design is a combination of these four primitives. You almost never use one in isolation, but understanding each on its own makes the trade-offs concrete.
1. Pull (consumer-driven)
The consumer asks for the next batch when it is ready. The producer never pushes; demand is explicit. Kafka uses a long-poll variant of this: consumer.poll(timeout) returns up to max.poll.records messages, and the broker holds the connection open until data is available or the timeout fires.
Pull is the safest strategy because the slow side is in charge. It also has the highest tail latency, since data sits on the producer until requested, and it requires durable storage on the producer side (the broker, in Kafka's case).
2. Credit-based push
The consumer grants the producer a budget — say, "send me up to 100 messages" — and the producer pushes until the budget is exhausted, then waits. As messages are processed, the consumer issues fresh credits. AMQP 1.0 (used by RabbitMQ in stream mode), gRPC streaming flow control, and Reactive Streams' Subscription.request(n) all use this model.
Credit-based push is what most modern reactive libraries (Project Reactor, Akka Streams, RxJava) implement. It combines push latency with pull safety. The price is a more complex protocol — both sides must agree on credit semantics — and bookkeeping for in-flight messages.
3. Buffering with bounded queues
An intermediate buffer absorbs short bursts. The producer writes; the consumer reads. Once the buffer is full, the producer blocks (or the queue rejects writes). Go channels with capacity, Java's ArrayBlockingQueue, and Node.js streams' highWaterMark are all bounded buffers.
Buffering is the simplest mental model and the most common production design. It only works as backpressure when the buffer is bounded and the producer respects the "full" signal. An unbounded buffer is not backpressure — it is a memory leak with extra steps.
4. Drop / sample / shed
When neither the producer nor the consumer can absorb the burst, throw work away. Drop the oldest, drop the newest, sample 1-in-N, or shed by priority. Metrics pipelines (StatsD, Prometheus remote write), real-time analytics, and observability backends lean heavily on dropping because the data is statistical: missing one sample out of 10,000 does not change the dashboard.
Dropping is the only strategy that gives you a constant-time, constant-memory response to overload. It is also the one most likely to cause an incident if applied to data that is not statistical — billing events, audit logs, financial trades.
Choosing a strategy: quick reference
| Data type | Replayable? | Idempotent consumer? | Strategy |
|---|---|---|---|
| Financial events | Yes (broker) | Yes | Pull + bounded buffer |
| AI token stream | No | No | Credit-based push |
| Metrics / telemetry | No | Yes | Drop oldest + sample |
| Audit logs | Yes (WAL) | Yes | Pull + DLQ for stuck batches |
| IoT sensor readings | No | Yes | Drop newest if buffer full |
| User-facing live feed | Sometimes | No | Buffer + drop oldest beyond N |
Walkthrough: a Node.js stream that respects backpressure
Node.js stream backpressure is the easiest version to reason about because the runtime exposes the signal directly. Below is a producer that reads from a fast source (a simulated network feed) and writes to a slow sink (a database with 50ms write latency).
// producer.js
const { Readable, Writable } = require('node:stream');
const { pipeline } = require('node:stream/promises');
// Fast source: emits 10,000 events as fast as Node will let it
class EventSource extends Readable {
constructor(total) {
super({ objectMode: true, highWaterMark: 16 });
this.i = 0;
this.total = total;
}
_read() {
if (this.i >= this.total) return this.push(null);
// Push one event; if push() returns false we MUST stop pushing
while (this.i < this.total) {
const ok = this.push({ id: this.i++, ts: Date.now() });
if (!ok) break; // backpressure signal from the consumer
}
}
}
// Slow sink: 50ms per write
class SlowDB extends Writable {
constructor() { super({ objectMode: true, highWaterMark: 4 }); }
_write(event, _enc, cb) {
setTimeout(() => {
console.log('wrote', event.id);
cb();
}, 50);
}
}
await pipeline(new EventSource(10_000), new SlowDB());
Two details carry the entire design. First, the source watches the return value of this.push(event). When the consumer's internal buffer hits its highWaterMark (four events here), push returns false and the source stops. Node will call _read again once buffer space frees up. Second, the sink reports completion via the cb() callback. Until cb fires, the slot stays occupied and the source stays paused. That is bounded buffering implemented in 30 lines.
A common mistake is to ignore the return value and keep pushing. Memory grows, garbage collection takes longer, the event loop stalls, and eventually the process is killed by the OOM killer. Always honor what push tells you.
Walkthrough: credit-based pacing in Go
Go has no built-in stream protocol, so you build backpressure with channels. A bounded channel acts as the buffer; a separate "credits" channel acts as the budget the consumer grants the producer.
// pacer.go
package main
import (
"context"
"fmt"
"time"
)
type Event struct{ ID int }
func producer(ctx context.Context, out chan<- Event, credits <-chan int) {
defer close(out)
id := 0
for {
select {
case <-ctx.Done():
return
case n, ok := <-credits:
if !ok { return }
for i := 0; i < n; i++ {
select {
case out <- Event{ID: id}:
id++
case <-ctx.Done():
return
}
}
}
}
}
func consumer(events <-chan Event, credits chan<- int, batch int) {
credits <- batch // initial grant
processed := 0
for e := range events {
time.Sleep(20 * time.Millisecond) // simulated work
fmt.Println("processed", e.ID)
processed++
if processed%batch == 0 {
credits <- batch // refill
}
}
}
func main() {
ctx, cancel := context.WithTimeout(context.Background(), 2*time.Second)
defer cancel()
events := make(chan Event, 8)
credits := make(chan int, 1)
go producer(ctx, events, credits)
consumer(events, credits, 10)
}
The producer never sends without a credit. The consumer's credits <- batch send is what unlocks the next round of work. If the consumer stalls — say, on a slow downstream HTTP call — credits stop flowing, the producer parks on the credits channel, and memory stays flat. This is the same pattern gRPC server-streaming uses under the hood, just with HTTP/2 WINDOW_UPDATE frames instead of channels.
Mapping the model to real systems
Once the primitives click, you see them everywhere. A short tour:
- Kafka: pull, with broker-side durability. Tune
max.poll.recordsandmax.poll.interval.ms; if the consumer cannot finish a batch within the interval, it is kicked from the group. - Flink: credit-based push between operators, with adaptive buffer sizing based on observed throughput. The infamous "checkpoint barrier alignment" pause is backpressure being expressed across the DAG.
- Redis Streams: pull via
XREADGROUPwith consumer groups. No native flow control beyond the consumer's poll cadence. - NATS JetStream: pull or push consumers; push consumers expose
MaxAckPending, which is exactly a credit window. - gRPC streaming: HTTP/2 flow control. The receiver advertises a window; the sender stops at the boundary. Mostly invisible until you tune
InitialWindowSize. - AWS Kinesis: pull via
GetRecords; throughput limited per shard. Consumers usecheckpoint-style positions, similar to Kafka offsets.
Gotchas that bite in production
1. Unbounded buffers disguised as backpressure
Default Node.js streams have a highWaterMark of 16 objects (16KB for buffers). Default Go channels have capacity 0 (unbuffered, fully synchronous). Default Java LinkedBlockingQueue is unbounded. Audit every queue in your pipeline and set explicit caps. "Default" is rarely what you want at scale.
2. The bufferbloat trap
A 100,000-message in-memory buffer between two services smooths out bursts beautifully — until the consumer crashes. Now you have to decide: replay the buffer, drop it, or block the producer? If the buffer is in process memory, it dies with the process. If the buffer is in Kafka or another durable broker, it survives but takes hours to drain. Smaller buffers with explicit overflow policies are usually safer than large ones with implicit ones.
3. Hidden parallelism breaks the contract
Reactive frameworks let you write flatMap to fan out to N concurrent downstream calls. If you do not cap the concurrency (Project Reactor's flatMap(fn, concurrency) takes a second arg), one slow producer can spawn 10,000 in-flight calls and saturate your connection pool. Backpressure at the stream level does not propagate through unbounded fan-outs.
4. Auto-ack hides the problem
RabbitMQ in autoAck=true mode acknowledges as soon as the broker delivers, regardless of whether the consumer finished. This breaks every backpressure assumption: the broker thinks the work is done, the consumer is still chewing on it, and a crash loses the in-flight batch. Always use manual ack for any work you cannot afford to lose.
5. Drop policies that leak invariants
Dropping the oldest message when the buffer is full sounds simple. It also breaks if your consumer assumes monotonic timestamps, ordered keys, or per-entity sequencing. If you must drop, drop based on a key ("keep latest per user") rather than position, or shed entire low-priority topics rather than random messages from a critical one.
Make backpressure observable
You cannot tune what you cannot see. The minimum metrics for any streaming hop:
- Lag — messages or bytes between the producer head and the consumer position. In Kafka, this is consumer-group lag per partition.
- Buffer depth — current size of any in-process queue, exposed as a gauge. Alert at 80% capacity, page at 95%.
- Drop count — counter of messages shed, broken down by reason (buffer full, TTL expired, key sampled).
- Producer pause time — total time spent waiting on a backpressure signal. A non-zero number is healthy; a number that grows without bound is the warning sign.
- End-to-end latency — p50, p95, p99 from event creation to consumer commit. Backpressure trades latency for stability; you must know how much.
A worked decision: token streaming for an LLM proxy
Concrete example. You are building a proxy in front of a hosted LLM. Tokens stream from the model at 80 tokens/second. Your downstream client is a browser on a flaky mobile connection that occasionally pauses for two seconds. What do you do?
- Tokens are not replayable from the LLM provider once consumed. Drop is unacceptable — partial responses are worse than none.
- Buffering the entire response is bounded (LLM responses are capped) but expensive at scale (thousands of concurrent streams).
- Pull from the provider is impossible — the API is push-only via SSE.
- Conclusion: bounded per-connection buffer (say, 4KB), credit-based push downstream over WebSocket or SSE with explicit ack, and a circuit breaker that cancels the upstream LLM call if the buffer stays full for more than the connection's keep-alive interval.
That last point matters: if the client disappears, you must propagate the cancellation upstream. Otherwise you keep paying the LLM provider for tokens that nobody will read. Backpressure and cancellation are the same protocol seen from two angles.
Next steps
- Audit one production pipeline this week. List every queue, channel, and buffer between the entry point and the durable sink. Mark each as bounded or unbounded.
- Add a buffer-depth gauge to the deepest queue and watch it for a day. The shape of that graph will tell you more about your traffic than any load test.
- Read the Reactive Streams JVM specification (or the equivalent in your stack). The protocol is small, and once it clicks, every framework that implements it becomes legible.
- Pick one streaming hop and draw the contract on paper: who is the producer, who is the consumer, what happens on overload, what happens on consumer failure, what happens on producer failure. The exercise alone catches half the bugs.
Backpressure is not a feature you bolt on after a load test fails. It is a property of the contract between two components, decided early, made visible, and tested by deliberately starving and flooding each side. Build the contract, expose the signal, and the rest of your reliability work gets easier.
Comments
Be the first to comment