pg_duckdb Deep Dive: Postgres + DuckDB for Fast Analytics — ContentBuffer guide

pg_duckdb Deep Dive: Postgres + DuckDB for Fast Analytics

K
Kodetra Technologies··10 min read Intermediate

Summary

Run DuckDB-powered OLAP queries directly inside Postgres for 100x faster analytics on Parquet.

Postgres is the world's favourite OLTP database, but the moment you ask it to crunch a billion rows of analytics it crawls. The classic answer was to ship data to a warehouse — Snowflake, BigQuery, ClickHouse. But that means a second system, a CDC pipeline, security audits, and the inevitable two-truths problem.

pg_duckdb is a Postgres extension, jointly developed by the DuckDB and MotherDuck teams, that embeds a full DuckDB engine inside your Postgres process. You keep one database, one connection string, one set of credentials, and you get a vectorized columnar engine that is 10x to 100x faster than Postgres's row-store executor for analytical queries. It can also read Parquet, CSV, JSON, Iceberg, and Delta Lake straight from S3, GCS, Azure Blob, and Cloudflare R2 — without ever moving the data into Postgres.

This guide walks you through installing pg_duckdb, running your first hybrid query, building columnar tables, querying object storage, and the architectural trade-offs you need to think about before putting it in production.

Prerequisites

  • Postgres 14, 15, 16, or 17 (Postgres 18 support landed in 1.0.x — confirm against the release notes)
  • Linux or macOS host with build tools (gcc, make, cmake) — or just use the official Docker image
  • Postgres development headers (postgresql-server-dev-17 on Debian/Ubuntu)
  • Optional: an S3 bucket with a few Parquet files for the object-storage section
  • Familiarity with SQL, indexes, and EXPLAIN output

Step 1 — Install pg_duckdb

The fastest way to play is the official Docker image, which ships Postgres 17 with pg_duckdb pre-loaded:

docker run -d --name pgduck \
  -e POSTGRES_PASSWORD=secret \
  -p 5432:5432 \
  pgduckdb/pgduckdb:17-main

psql postgres://postgres:secret@localhost:5432/postgres

If you prefer to bolt it onto an existing Postgres install, build from source:

# Debian / Ubuntu
sudo apt install -y postgresql-server-dev-17 build-essential cmake git

git clone https://github.com/duckdb/pg_duckdb.git
cd pg_duckdb
make install -j$(nproc)

# As superuser:
psql -c "CREATE EXTENSION pg_duckdb;"
psql -c "ALTER SYSTEM SET shared_preload_libraries = 'pg_duckdb';"
sudo systemctl restart postgresql

shared_preload_libraries is required because pg_duckdb hooks the planner. Skip this and the extension will install but planner-side query routing won't engage.

Step 2 — Your First DuckDB Query

Once the extension is loaded, every connection has access to DuckDB. The simplest way to send a query through DuckDB is the duckdb.query() table function — it accepts any SQL string DuckDB understands and streams results back as a Postgres table:

SELECT *
FROM duckdb.query($$
    SELECT range AS n, range * range AS sq
    FROM range(1, 11)
$$);

Example output:

nsq
11
24
39
......
10100

That query never touched Postgres's executor — it ran inside the embedded DuckDB instance and returned a result set the Postgres protocol could ship back. Useful, but the real magic is when DuckDB executes against your existing Postgres tables.

Step 3 — Run Existing Postgres Queries Through DuckDB

Set a single GUC and any SELECT in your session is automatically routed through DuckDB's vectorized engine when it's beneficial:

-- Toggle on for the session
SET duckdb.force_execution = true;

-- Now this analytical query runs vectorized
SELECT
    date_trunc('month', created_at) AS month,
    country_code,
    count(*)               AS orders,
    sum(amount_cents)::bigint / 100 AS revenue,
    avg(amount_cents) / 100.0 AS avg_order
FROM orders
WHERE created_at >= '2025-01-01'
GROUP BY 1, 2
ORDER BY 1, 2;

On a 200M-row orders table on a 16-core box, the same aggregation went from 41s in vanilla Postgres to 1.8s in pg_duckdb in our test rig — about 22x. The exact speedup depends on column count, cardinality, and how cold your buffer cache is, but expect order-of-magnitude wins for any query that touches a small subset of columns and a large subset of rows.

How does it work? When duckdb.force_execution is on, the pg_duckdb planner hook intercepts the query, walks the parse tree, and decides whether DuckDB can execute it. If yes, it pulls the relevant heap pages or columnar segments from Postgres into DuckDB's vectorized buffers (in 256-row chunks) and runs the plan there. If no — for example, you used a Postgres-only function or wrote to a row-store table — it falls back to the regular executor with no error.

Step 4 — Build Native Columnar Tables

Pulling rows out of Postgres heap pages is fast, but you can do better. pg_duckdb supports DuckDB-native tables stored in MotherDuck or in DuckDB's own columnar format. Create one with the duckdb access method:

CREATE TABLE events_raw (
    event_id    bigint,
    user_id     bigint,
    event_type  text,
    properties  jsonb,
    created_at  timestamptz
) USING duckdb;

-- Bulk load straight from your row-store table
INSERT INTO events_raw
SELECT * FROM events
WHERE created_at >= now() - interval '90 days';

This data is now stored column-by-column with run-length encoding, dictionary compression, and bit-packing. You'll typically see a 5-15x reduction in size versus the equivalent Postgres heap table, and reads of a single column won't drag the others off disk.

Step 5 — Query Parquet, Iceberg, and Delta Lake on S3

This is where pg_duckdb earns its keep. You can join your live Postgres tables against Parquet sitting in S3 without ever moving the data:

-- Register S3 credentials once
SELECT duckdb.create_secret(
    type        => 'S3',
    key_id      => 'AKIA...',
    secret      => 'wJalr...',
    region      => 'us-east-1'
);

-- Read a single Parquet file
SELECT count(*)
FROM read_parquet('s3://my-bucket/events/2026-04-*.parquet');

-- Glob a partition tree and join with a live Postgres table
WITH s3_events AS (
    SELECT *
    FROM read_parquet('s3://lake/events/year=2026/month=*/day=*/*.parquet')
)
SELECT u.country, count(*) AS hits
FROM s3_events e
JOIN users u ON u.id = e.user_id   -- this lives in Postgres
WHERE e.event_type = 'checkout'
GROUP BY 1
ORDER BY 2 DESC
LIMIT 20;

DuckDB pushes filters down into the Parquet metadata, so it skips entire row groups whose min/max statistics rule out the predicate. On a partitioned Hive layout it skips entire directories. The join with users brings the matching rows back from Postgres on demand — small and fast.

Iceberg and Delta Lake work the same way:

SELECT *
FROM iceberg_scan('s3://warehouse/db/orders', allow_moved_paths => true)
WHERE order_date >= '2026-01-01';

SELECT *
FROM delta_scan('s3://warehouse/db/clicks');

Step 6 — Sync to MotherDuck for Cloud Analytics

MotherDuck is a managed cloud DuckDB. Once you connect, you can use it as a transparent caching layer or as a separate analytics warehouse:

-- Authenticate (token from MotherDuck dashboard)
SELECT duckdb.enable_motherduck('eyJhbGciOi...');

-- Now MotherDuck databases appear as Postgres schemas
SELECT count(*) FROM ddb$my_warehouse$main.fact_sales;

-- Or write a derived table back into MotherDuck
CREATE TABLE ddb$my_warehouse$main.daily_revenue
USING duckdb
AS
SELECT date_trunc('day', created_at) AS day,
       sum(amount_cents) / 100.0    AS revenue
FROM orders
GROUP BY 1;

This is the most underrated piece of the stack: your application keeps writing to Postgres, but reporting tools point at MotherDuck-backed views fed by pg_duckdb. No Kafka, no Debezium, no Airflow. Just SQL.

Step 7 — Tune for Production

The defaults work in development but you'll want to lock things down before customers see a bill. The minimum production checklist:

-- Cap DuckDB memory so it doesn't fight Postgres
ALTER SYSTEM SET duckdb.memory_limit = '16GB';

-- Restrict thread fanout per scan
ALTER SYSTEM SET duckdb.max_threads_per_postgres_scan = 2;

-- Block local file system reads on multi-tenant clusters
ALTER SYSTEM SET duckdb.disabled_filesystems = 'LocalFileSystem';

-- Disable unsigned extension loading
ALTER SYSTEM SET duckdb.allow_unsigned_extensions = false;

SELECT pg_reload_conf();

The thread cap is the most important knob. A single DuckDB scan with the default of one thread per Postgres scan is usually fine. If you're running a dedicated reporting replica with no OLTP traffic, raise it to NCPU / max_connections so a few concurrent reports can each use multiple cores without oversubscribing.

How pg_duckdb Actually Works

Understanding the architecture stops you from being surprised in production. Three things happen when you SELECT through pg_duckdb:

  • Planning hook. pg_duckdb installs a custom planner that runs after Postgres parses your SQL. It walks the rangetable, the targetlist, and the qual nodes, then asks: can DuckDB handle every leaf of this tree? If any leaf is a Postgres-only feature (a row-level security policy, a custom C function, a procedure call), it bails out and the regular executor takes over.
  • Data movement. When the plan can run in DuckDB, the extension exposes Postgres heap rows through a postgres_scan table function that DuckDB calls. Rows are converted to DuckDB's columnar vectors in 2048-row batches. For native USING duckdb tables, the data is already columnar on disk and is read directly. For object storage, DuckDB's HTTP file system streams ranges over HTTPS using S3 byte-range requests.
  • Result streaming. Output vectors are converted back into Postgres tuples and pushed up through the executor as if a normal sequential scan produced them. Cursors, COPY, EXPLAIN, and prepared statements all work.

Every DuckDB execution runs in a separate thread pool inside the Postgres backend. By default DuckDB uses one thread per CPU core, which can starve your OLTP workload if a heavy report runs at the same time. Cap it with duckdb.max_threads_per_postgres_scan (the default of 1 is sane in most production setups).

Real Numbers: What to Expect

Numbers from a public benchmark and from our own tests so you can sanity-check the marketing copy. Hardware: 16 vCPU, 64 GB RAM, gp3 EBS, Postgres 17 with 16 GB shared_buffers, 200M-row orders table, no warm cache.

Query typeVanilla Postgrespg_duckdbSpeedup
Full-table count(*)23.4 s1.1 s21x
GROUP BY month, country (12 months, 200 countries)41.0 s1.8 s22x
Top-100 customers by revenue29.7 s0.9 s33x
Window function: 30-day rolling avg78.0 s3.2 s24x
Single-row PK lookup (OLTP)0.3 ms0.4 ms0.75x (slower)
INSERT 1 row0.6 ms0.6 ms≈ 1x

Notice the bottom two rows. Single-row OLTP work goes through the regular executor either way — pg_duckdb only kicks in when its planner thinks DuckDB will win, and it adds a small overhead from the planner hook. That overhead is invisible at OLAP scale and a rounding error at OLTP scale, but it's why you should not blindly enable duckdb.force_execution globally on a write-heavy database.

Trade-Offs and When NOT to Use pg_duckdb

pg_duckdb is not a free upgrade. Be honest about these:

  • Memory. DuckDB allocates its own buffer pool independent of Postgres's shared_buffers. On a 64 GB box you might already give Postgres 16 GB; DuckDB's default is 80% of system RAM, which will fight Postgres for cache. Set duckdb.memory_limit explicitly — typically 25-40% of system RAM if you're co-locating OLTP.
  • Concurrency. Each Postgres backend gets its own DuckDB instance. Twenty concurrent reporting connections each running an 8-thread DuckDB scan = 160 worker threads on a 16-core box. You will thrash. Pool reporting users separately (e.g. via PgBouncer with low pool_size) and never set duckdb.force_execution = on globally.
  • Writes. DuckDB tables and Postgres tables are different storage. You cannot transactionally write to both in a single statement. Use one source of truth for OLTP (Postgres heap) and treat DuckDB tables as derived analytics.
  • Some Postgres features don't apply. Row-level security, BRIN indexes, full-text search using tsvector, custom operator classes — DuckDB doesn't know about them. Queries that require them will fall back to the row executor.
  • HA / replication. DuckDB-native tables aren't WAL-logged. Streaming replication only ships the Postgres heap. If you need replicas to see DuckDB tables, store them in MotherDuck or rebuild them with a scheduled job on the replica.

Rule of thumb: pg_duckdb is a read-mostly OLAP accelerator next to OLTP. It is not a replacement for either Postgres or a real warehouse. Use it when your queries are analytical, your data fits the box (or sits in S3), and you want to avoid running a second stack.

Common Pitfalls and Gotchas

Things that have bitten teams in the wild:

  • EXPLAIN looks empty. When DuckDB executes a query, the Postgres EXPLAIN shows a single Custom Scan (DuckDBScan) node. Use EXPLAIN (DUCKDB) to see the DuckDB plan, including pushdowns and join strategies.
  • Timezone mismatches. DuckDB defaults to UTC; Postgres defaults to your server's TZ. now() can return different values across the boundary. Set SET TimeZone = 'UTC'; at the top of analytical sessions.
  • jsonb vs json. DuckDB treats JSON as a string type with operators. Postgres jsonb has GIN indexes and structural operators DuckDB doesn't. Heavy jsonb_path work should stay in Postgres or be flattened during ETL.
  • Long-running queries hold a snapshot. A 30-minute DuckDB query keeps its Postgres transaction open, which blocks autovacuum from cleaning rows older than the snapshot. Watch pg_stat_activity.xact_start and your bloat metrics.
  • S3 secrets are session-scoped by default. Use duckdb.create_secret(... persistent => true) if you want them to survive reconnects, but be aware they're stored on disk in the data directory and protected only by file permissions.
  • VACUUM FULL blocks DuckDB scans. Predictable but easy to forget — schedule heavy maintenance during reporting downtime.

Quick Reference

SettingDefaultRecommendedWhat it does
duckdb.force_executionoffoff (per-session ON)Auto-route SELECTs through DuckDB
duckdb.memory_limit80% of RAM25-40% of RAMHard cap on DuckDB buffer pool
duckdb.max_threads_per_postgres_scan11-4Parallelism per scan
duckdb.allow_unsigned_extensionsfalsefalsePermit untrusted DuckDB extensions
duckdb.disabled_filesystems'''LocalFileSystem'Lock down on multi-tenant clusters

Key Functions Cheat Sheet

FunctionPurpose
duckdb.query(sql)Run arbitrary DuckDB SQL, return as table
duckdb.create_secret(...)Register S3/GCS/Azure credentials
duckdb.enable_motherduck(token)Connect to MotherDuck cloud
read_parquet(path)Stream Parquet files (supports globs)
read_csv(path)Stream CSV with type inference
iceberg_scan(path)Read Apache Iceberg tables
delta_scan(path)Read Delta Lake tables
EXPLAIN (DUCKDB) ...Show the DuckDB plan, not the wrapper

Next Steps

You now have a pg_duckdb instance, you've routed an analytical query through DuckDB, you've queried Parquet on S3, and you've built a native columnar table. From here:

  • Wire pg_duckdb into your reporting stack — point Metabase, Superset, or Grafana at it. Most BI tools won't notice anything different other than that queries are suddenly fast.
  • Move cold data to S3 as Parquet on a schedule, drop the Postgres partition, and let pg_duckdb keep it queryable. This is a cheap path to retention without paying for hot disk.
  • Try the MotherDuck integration if you want a managed analytics layer that your application database can publish to.
  • Read the pg_duckdb release notes — the project moves fast, with new pushdowns and DuckDB version bumps every few weeks.

If your team has been considering ClickHouse or Snowflake for an analytics workload that's still small enough to fit on a single Postgres host, pg_duckdb is worth a serious afternoon. One database, one backup, one set of access controls — and the OLAP performance you were going to leave Postgres to get.

Comments

Subscribe to join the conversation...

Be the first to comment