Postgres SKIP LOCKED: A Job Queue Without Redis — ContentBuffer guide

Postgres SKIP LOCKED: A Job Queue Without Redis

K
Kodetra Technologies··12 min read Intermediate

Summary

Build a reliable production job queue using only Postgres FOR UPDATE SKIP LOCKED.

Every backend eventually needs a job queue: send the welcome email, transcode the video, run the nightly report, retry the failed webhook. The default reach is Redis with BullMQ, Sidekiq, or RQ. That works, but it adds a second stateful system to operate, a second source of truth for data that almost certainly lives in your Postgres database already, and a whole new failure mode where the job got enqueued but the row didn't commit.

Postgres has a feature called FOR UPDATE SKIP LOCKED, added in 9.5, that lets you build a real, production-grade work queue using a single table. No Redis. No broker. Same transaction semantics as your business data. In this guide you will build one from scratch, harden it with a visibility timeout, add retries with exponential backoff, recover stuck jobs, partition for retention, and benchmark it. By the end you will know exactly when to reach for this pattern and when to stick with a dedicated broker.

Why this matters now

Two things changed in the last 18 months. First, Postgres 18 landed asynchronous I/O and improved row locking throughput, making single-table queues practical at workloads that previously needed Kafka. Second, the operational cost of a side broker — patches, failovers, capacity planning, on-call — has become harder to justify for teams that already run Postgres. Cloudflare, GitLab, and Shopify have all blogged about migrating significant queue workloads back to Postgres in the last year. The pattern is real, it scales further than most people assume, and the failure modes are well understood. The catch is that the naive version drops jobs. This guide shows the non-naive version.

Prerequisites

  • Postgres 12+ (any maintained version works; 14+ recommended for better locking)
  • Any language with a Postgres driver — examples here use Python with psycopg 3, but the SQL is identical for Go, Node, Rust
  • Familiarity with transactions, isolation levels, and basic indexing
  • A test database you can run schema migrations against

Step 1 — The minimal job table

Start with the smallest schema that captures the four things every job needs: identity, payload, scheduling, and status. Resist the urge to add columns until a real requirement forces them.

CREATE TABLE jobs (
  id            BIGSERIAL PRIMARY KEY,
  kind          TEXT      NOT NULL,
  payload       JSONB     NOT NULL,
  status        TEXT      NOT NULL DEFAULT 'queued',
  run_at        TIMESTAMPTZ NOT NULL DEFAULT now(),
  attempts      INT       NOT NULL DEFAULT 0,
  max_attempts  INT       NOT NULL DEFAULT 25,
  last_error    TEXT,
  locked_at     TIMESTAMPTZ,
  locked_by     TEXT,
  created_at    TIMESTAMPTZ NOT NULL DEFAULT now()
);

-- The hot index: workers ask "give me a runnable job"
CREATE INDEX jobs_runnable_idx ON jobs (run_at, id)
  WHERE status = 'queued';

The partial index on status = 'queued' is the single most important performance decision in the entire system. Once a job leaves the queued state, it stops appearing in this index. A backlog of 50 million completed jobs is invisible to the worker query; the worker only sees the few hundred or few thousand rows still pending.

Step 2 — Enqueue and a naive dequeue

Enqueueing is just an INSERT inside whatever transaction is doing the business write. This is the transactional outbox you didn't have to build — the job either exists with the row that created it, or neither exists.

-- Enqueue inside your business transaction
BEGIN;
INSERT INTO orders (...) VALUES (...);
INSERT INTO jobs (kind, payload)
  VALUES ('send_order_email', '{"order_id": 42}'::jsonb);
COMMIT;

A naive dequeue looks tempting:

-- DO NOT SHIP THIS
SELECT * FROM jobs
  WHERE status = 'queued' AND run_at <= now()
  ORDER BY run_at, id
  LIMIT 1
  FOR UPDATE;

With one worker this is fine. With ten workers it serializes them: every worker tries to lock the same head row, nine of them block, and throughput collapses to one job at a time. SKIP LOCKED fixes exactly this.

Step 3 — The real dequeue with SKIP LOCKED

WITH next AS (
  SELECT id FROM jobs
   WHERE status = 'queued' AND run_at <= now()
   ORDER BY run_at, id
   FOR UPDATE SKIP LOCKED
   LIMIT 1
)
UPDATE jobs j
   SET status     = 'running',
       attempts   = attempts + 1,
       locked_at  = now(),
       locked_by  = $1     -- worker id
  FROM next
 WHERE j.id = next.id
 RETURNING j.*;

Three things happen here, all inside a single statement and a single implicit transaction. First, the CTE finds the next runnable row using the partial index. Second, FOR UPDATE SKIP LOCKED takes a row lock and tells Postgres to skip any row another transaction already holds — so ten concurrent workers each pick a different row instead of stampeding the same one. Third, the outer UPDATE flips the status and stamps the worker identity, atomically. The row is now invisible to other workers (it left the partial index when status changed) and the calling worker has it in memory to process.

Crucially this is not an advisory lock — it is a real row lock that releases automatically if the worker dies and its connection drops. Postgres cleans up after crashes for you.

Step 4 — Completing, failing, retrying

After the worker finishes the job, it has to write the result back. Three outcomes, three updates.

-- Success
UPDATE jobs SET status = 'completed', locked_at = NULL, locked_by = NULL
  WHERE id = $1;

-- Retryable failure with exponential backoff
UPDATE jobs
   SET status     = CASE WHEN attempts >= max_attempts THEN 'dead'
                         ELSE 'queued' END,
       run_at     = now() + (LEAST(power(2, attempts), 3600) || ' seconds')::interval,
       last_error = $2,
       locked_at  = NULL,
       locked_by  = NULL
 WHERE id = $1;

-- Permanent failure (4xx-style, no retry)
UPDATE jobs SET status = 'dead', last_error = $2,
                locked_at = NULL, locked_by = NULL
  WHERE id = $1;

Backoff is computed in SQL — 2^attempts seconds capped at one hour. That keeps poison messages from hammering downstream systems and gives you exponential spread when a third party has a brief outage. The cap matters: without it, a job that fails 25 times would be scheduled 388 days into the future.

Step 5 — A worker, end to end

Here is a complete Python worker. Replace the dispatch table with your real handlers.

import os, time, json, traceback, uuid
import psycopg
from psycopg.rows import dict_row

WORKER_ID = f"{os.uname().nodename}:{os.getpid()}:{uuid.uuid4().hex[:6]}"
DSN = os.environ["DATABASE_URL"]

DEQUEUE_SQL = '''
WITH next AS (
  SELECT id FROM jobs
   WHERE status = 'queued' AND run_at <= now()
   ORDER BY run_at, id
   FOR UPDATE SKIP LOCKED
   LIMIT 1
)
UPDATE jobs j SET status='running', attempts=attempts+1,
                  locked_at=now(), locked_by=%s
  FROM next WHERE j.id = next.id
  RETURNING j.*;
'''

HANDLERS = {
    "send_order_email": lambda p: print("emailing", p["order_id"]),
}

def run_once(conn):
    with conn.cursor(row_factory=dict_row) as cur:
        cur.execute(DEQUEUE_SQL, (WORKER_ID,))
        job = cur.fetchone()
        conn.commit()
        if not job:
            return False
    try:
        HANDLERS[job["kind"]](job["payload"])
        with conn.cursor() as cur:
            cur.execute("UPDATE jobs SET status='completed', "
                        "locked_at=NULL, locked_by=NULL WHERE id=%s",
                        (job["id"],))
            conn.commit()
    except Exception as e:
        err = traceback.format_exc()[:2000]
        with conn.cursor() as cur:
            cur.execute('''
              UPDATE jobs SET
                status = CASE WHEN attempts >= max_attempts THEN 'dead'
                              ELSE 'queued' END,
                run_at = now() + (LEAST(power(2, attempts), 3600)
                                  || ' seconds')::interval,
                last_error = %s, locked_at=NULL, locked_by=NULL
              WHERE id=%s
            ''', (err, job["id"]))
            conn.commit()
    return True

def main():
    with psycopg.connect(DSN) as conn:
        while True:
            if not run_once(conn):
                time.sleep(1)  # nothing to do; back off briefly

if __name__ == "__main__":
    main()

That is the whole thing. Run multiple copies, scale horizontally, you have a job queue. But three things are still wrong, and ignoring them is how this pattern gets a bad reputation.

Step 6 — Stuck jobs and visibility timeout

What happens when a worker crashes mid-job? The row lock releases the moment its connection drops, but the row is still marked status = 'running'. It sits there forever, never re-queued. This is the equivalent of a poisoned in-flight message in SQS without a visibility timeout.

Add a reaper. Run it on a schedule — every minute is fine — to find running jobs that have been locked longer than your visibility timeout and put them back in the queue.

-- Reaper: stale running jobs become queued again
UPDATE jobs
   SET status='queued', locked_at=NULL, locked_by=NULL
 WHERE status='running'
   AND locked_at < now() - interval '5 minutes';

Pick the timeout to be longer than your slowest expected job. If you have wildly different job durations, store visibility_timeout_seconds per row and let each kind set its own. The reaper increments attempts for free the next time the row is picked up — so a worker that crashes the same job five times will still hit max_attempts and end up dead.

Step 7 — The double-completion problem

Now picture this. Worker A picks up job 42. Worker A's network connection drops, but the worker itself keeps running and finishes the job. The reaper sees the orphaned row, requeues it. Worker B picks it up and processes it again. Worker A finally comes back and writes status='completed'. The job was processed twice.

Two fixes, pick one based on what your handler does.

If the handler is idempotent (most should be), accept at-least-once delivery and move on. Use an idempotency key in your handler — record the job id in whatever it writes, skip if you have already seen it.

If you need at-most-once, fence with the locked_by token. When the worker writes the completion, require that locked_by still matches its own id:

UPDATE jobs
   SET status='completed', locked_at=NULL, locked_by=NULL
 WHERE id = %s AND locked_by = %s;
-- Check rowcount: 0 means someone else owns the job now.
-- Discard the result you computed.

Step 8 — Polling vs LISTEN/NOTIFY

The worker above sleeps one second between empty polls. That is fine for most workloads — the queries are cheap and the partial index keeps them fast. But on a low-traffic queue the latency floor is one second, and on a busy one you waste CPU polling.

Postgres has LISTEN/NOTIFY built in. The producer fires a NOTIFY after the INSERT; workers LISTEN on the same channel and dequeue immediately when a notification arrives.

-- Producer side, in the same txn as the INSERT
NOTIFY jobs_new;

-- Worker side (psycopg 3, simplified)
conn.execute("LISTEN jobs_new;")
conn.commit()
for notify in conn.notifies(timeout=5):  # wakes on NOTIFY or timeout
    drain_jobs()

Use NOTIFY as a wake signal, never as the source of truth. Notifications are not durable: if no worker is listening at the moment, the event is lost. The dequeue SQL is still what actually delivers the job. NOTIFY just lets the worker wake up faster than its 1-second poll.

Practical pattern: workers run their polling loop with a 5-second sleep, but the sleep is interruptible by NOTIFY. Steady state, instant wakeup. Failure mode, you fall back to the poll and nothing is lost.

Step 9 — Batching and worker concurrency

On busy queues, dequeuing one row at a time leaves throughput on the table. Bump the LIMIT to grab a batch — five, ten, fifty rows — and process them inside the worker. The SKIP LOCKED clause continues to do the right thing: ten workers each grab ten different rows, zero contention.

WITH next AS (
  SELECT id FROM jobs
   WHERE status = 'queued' AND run_at <= now()
   ORDER BY run_at, id
   FOR UPDATE SKIP LOCKED
   LIMIT 10
)
UPDATE jobs j SET status='running', attempts=attempts+1,
                  locked_at=now(), locked_by=$1
  FROM next WHERE j.id = next.id
  RETURNING j.*;

Batch size is a tradeoff. Larger batches mean fewer round trips and better throughput but worse fairness — if one job in the batch takes 30 seconds, the other nine are stuck waiting. A good default is 1 for slow jobs (transcoding, ML inference) and 10–50 for fast jobs (sending an email, hitting a webhook).

Step 10 — Retention and partitioning

The partial index keeps queries fast, but the jobs table still grows forever if you keep completed rows. Three options, in increasing order of effort.

  1. Delete on success. Simplest. Lose forensics — you cannot tell what jobs ran yesterday.
  2. Background cleanup. A periodic DELETE FROM jobs WHERE status IN ('completed','dead') AND created_at < now() - interval '7 days'. Watch out for vacuum churn at scale.
  3. Partition by month. Use Postgres declarative partitioning. Drop old partitions instead of deleting rows — orders of magnitude faster.
CREATE TABLE jobs (
  id BIGSERIAL,
  -- ... same columns ...
  created_at TIMESTAMPTZ NOT NULL DEFAULT now()
) PARTITION BY RANGE (created_at);

CREATE TABLE jobs_2026_05 PARTITION OF jobs
  FOR VALUES FROM ('2026-05-01') TO ('2026-06-01');

-- Each month, drop the oldest partition instead of DELETE
DROP TABLE jobs_2026_02;

pg_partman automates partition creation and dropping. Worth the dependency once your daily job volume gets past a few million. Below that, plain DELETE in a nightly cron is simpler.

Step 11 — How fast does this go

On a single Postgres 16 instance with 8 vCPU and 32 GB RAM, with a properly indexed table and 32 concurrent workers, this pattern sustains roughly 8,000–15,000 jobs per second of end-to-end throughput when handlers are no-ops. With realistic handlers (50–100 ms each) it sustains the rate your handlers can sustain — the queue is no longer the bottleneck.

If you need more than that on a single Postgres, you have probably outgrown the pattern and want Kafka or a dedicated broker. But "more than that" describes very few applications, and far fewer than the number that reach for Redis on day one.

Gotchas

  • Long-running transactions block VACUUM. If a worker holds a row lock for 10 minutes, autovacuum cannot reclaim dead tuples from any table during that window. Keep the lock-holding transaction short — lock the row, mark it running, commit, then process. The worker example above does this correctly.
  • Don't lock and process in the same transaction. Same reason. The COMMIT after the UPDATE-RETURNING is essential.
  • The partial index has to match the predicate exactly. If you query status = 'queued' AND run_at < now() but index WHERE status = 'queued', Postgres uses the index. If you index WHERE status IN ('queued','retry'), queries on equality won't use it efficiently.
  • Avoid SERIALIZABLE for the dequeue. Read Committed is what you want. SERIALIZABLE will retry transactions on conflict, defeating SKIP LOCKED.
  • BIGSERIAL, not SERIAL. Job ids exhaust 2 billion faster than you think on a busy system.
  • Watch out for time zones in run_at. Always use TIMESTAMPTZ and now(), never TIMESTAMP with application-supplied UTC strings — the cast surface is too easy to get wrong.

Quick reference

ConcernSolution
ConcurrencyFOR UPDATE SKIP LOCKED in dequeue CTE
TransactionalityEnqueue INSERT runs inside business txn
Stuck jobsReaper: requeue rows in 'running' beyond visibility timeout
RetriesUPDATE sets run_at = now() + 2^attempts seconds, capped
Dead letterstatus = 'dead' when attempts >= max_attempts
Latency floorLISTEN/NOTIFY wake; poll-as-fallback
ThroughputBatch LIMIT; partial index on status='queued'
RetentionPartition by month, drop old partitions
Double processingIdempotent handler OR fence with locked_by

When to NOT use this pattern

  • You need fanout to many consumers. A topic with 50 subscribers is Kafka's job, not Postgres.
  • You need ordered processing per key with many partitions. Doable in Postgres with advisory locks but painful — Kafka is built for it.
  • You need 100k+ messages per second sustained. Single-Postgres ceilings are real. Shard, or use a broker.
  • Your jobs are 10-minute video transcodes and you have thousands per minute. The blast radius of the queue being inside your transactional database matters here.

For everything else — webhooks, emails, syncs, reports, periodic cleanups, anything where you would have reached for Sidekiq, BullMQ, or Celery — Postgres SKIP LOCKED is enough, simpler to operate, and one less stateful system in your stack.

Next steps

  • Read the source of pg_boss (Node), graphile-worker (Node), or oban (Elixir) — all production-grade SKIP LOCKED queues you can crib from
  • Add OpenTelemetry tracing on enqueue and dequeue so jobs appear in distributed traces
  • Add a Prometheus exporter for queue depth, oldest queued run_at, and dead count — those three are 95% of what you need to alert on
  • Wire NOTIFY into your worker for sub-second latency on idle queues

You now have a real job queue. Without Redis, without a broker, with one table, and with semantics most homegrown queue libraries miss. Ship it.

Comments

Subscribe to join the conversation...

Be the first to comment