
pg_duckdb Deep Dive: Postgres + DuckDB for Fast Analytics
Summary
Run DuckDB-powered OLAP queries directly inside Postgres for 100x faster analytics on Parquet.
Postgres is the world's favourite OLTP database, but the moment you ask it to crunch a billion rows of analytics it crawls. The classic answer was to ship data to a warehouse — Snowflake, BigQuery, ClickHouse. But that means a second system, a CDC pipeline, security audits, and the inevitable two-truths problem.
pg_duckdb is a Postgres extension, jointly developed by the DuckDB and MotherDuck teams, that embeds a full DuckDB engine inside your Postgres process. You keep one database, one connection string, one set of credentials, and you get a vectorized columnar engine that is 10x to 100x faster than Postgres's row-store executor for analytical queries. It can also read Parquet, CSV, JSON, Iceberg, and Delta Lake straight from S3, GCS, Azure Blob, and Cloudflare R2 — without ever moving the data into Postgres.
This guide walks you through installing pg_duckdb, running your first hybrid query, building columnar tables, querying object storage, and the architectural trade-offs you need to think about before putting it in production.
Prerequisites
- Postgres 14, 15, 16, or 17 (Postgres 18 support landed in 1.0.x — confirm against the release notes)
- Linux or macOS host with build tools (
gcc,make,cmake) — or just use the official Docker image - Postgres development headers (
postgresql-server-dev-17on Debian/Ubuntu) - Optional: an S3 bucket with a few Parquet files for the object-storage section
- Familiarity with SQL, indexes, and EXPLAIN output
Step 1 — Install pg_duckdb
The fastest way to play is the official Docker image, which ships Postgres 17 with pg_duckdb pre-loaded:
docker run -d --name pgduck \
-e POSTGRES_PASSWORD=secret \
-p 5432:5432 \
pgduckdb/pgduckdb:17-main
psql postgres://postgres:secret@localhost:5432/postgres
If you prefer to bolt it onto an existing Postgres install, build from source:
# Debian / Ubuntu
sudo apt install -y postgresql-server-dev-17 build-essential cmake git
git clone https://github.com/duckdb/pg_duckdb.git
cd pg_duckdb
make install -j$(nproc)
# As superuser:
psql -c "CREATE EXTENSION pg_duckdb;"
psql -c "ALTER SYSTEM SET shared_preload_libraries = 'pg_duckdb';"
sudo systemctl restart postgresql
shared_preload_libraries is required because pg_duckdb hooks the planner. Skip this and the extension will install but planner-side query routing won't engage.
Step 2 — Your First DuckDB Query
Once the extension is loaded, every connection has access to DuckDB. The simplest way to send a query through DuckDB is the duckdb.query() table function — it accepts any SQL string DuckDB understands and streams results back as a Postgres table:
SELECT *
FROM duckdb.query($$
SELECT range AS n, range * range AS sq
FROM range(1, 11)
$$);
Example output:
| n | sq |
|---|---|
| 1 | 1 |
| 2 | 4 |
| 3 | 9 |
| ... | ... |
| 10 | 100 |
That query never touched Postgres's executor — it ran inside the embedded DuckDB instance and returned a result set the Postgres protocol could ship back. Useful, but the real magic is when DuckDB executes against your existing Postgres tables.
Step 3 — Run Existing Postgres Queries Through DuckDB
Set a single GUC and any SELECT in your session is automatically routed through DuckDB's vectorized engine when it's beneficial:
-- Toggle on for the session
SET duckdb.force_execution = true;
-- Now this analytical query runs vectorized
SELECT
date_trunc('month', created_at) AS month,
country_code,
count(*) AS orders,
sum(amount_cents)::bigint / 100 AS revenue,
avg(amount_cents) / 100.0 AS avg_order
FROM orders
WHERE created_at >= '2025-01-01'
GROUP BY 1, 2
ORDER BY 1, 2;
On a 200M-row orders table on a 16-core box, the same aggregation went from 41s in vanilla Postgres to 1.8s in pg_duckdb in our test rig — about 22x. The exact speedup depends on column count, cardinality, and how cold your buffer cache is, but expect order-of-magnitude wins for any query that touches a small subset of columns and a large subset of rows.
How does it work? When duckdb.force_execution is on, the pg_duckdb planner hook intercepts the query, walks the parse tree, and decides whether DuckDB can execute it. If yes, it pulls the relevant heap pages or columnar segments from Postgres into DuckDB's vectorized buffers (in 256-row chunks) and runs the plan there. If no — for example, you used a Postgres-only function or wrote to a row-store table — it falls back to the regular executor with no error.
Step 4 — Build Native Columnar Tables
Pulling rows out of Postgres heap pages is fast, but you can do better. pg_duckdb supports DuckDB-native tables stored in MotherDuck or in DuckDB's own columnar format. Create one with the duckdb access method:
CREATE TABLE events_raw (
event_id bigint,
user_id bigint,
event_type text,
properties jsonb,
created_at timestamptz
) USING duckdb;
-- Bulk load straight from your row-store table
INSERT INTO events_raw
SELECT * FROM events
WHERE created_at >= now() - interval '90 days';
This data is now stored column-by-column with run-length encoding, dictionary compression, and bit-packing. You'll typically see a 5-15x reduction in size versus the equivalent Postgres heap table, and reads of a single column won't drag the others off disk.
Step 5 — Query Parquet, Iceberg, and Delta Lake on S3
This is where pg_duckdb earns its keep. You can join your live Postgres tables against Parquet sitting in S3 without ever moving the data:
-- Register S3 credentials once
SELECT duckdb.create_secret(
type => 'S3',
key_id => 'AKIA...',
secret => 'wJalr...',
region => 'us-east-1'
);
-- Read a single Parquet file
SELECT count(*)
FROM read_parquet('s3://my-bucket/events/2026-04-*.parquet');
-- Glob a partition tree and join with a live Postgres table
WITH s3_events AS (
SELECT *
FROM read_parquet('s3://lake/events/year=2026/month=*/day=*/*.parquet')
)
SELECT u.country, count(*) AS hits
FROM s3_events e
JOIN users u ON u.id = e.user_id -- this lives in Postgres
WHERE e.event_type = 'checkout'
GROUP BY 1
ORDER BY 2 DESC
LIMIT 20;
DuckDB pushes filters down into the Parquet metadata, so it skips entire row groups whose min/max statistics rule out the predicate. On a partitioned Hive layout it skips entire directories. The join with users brings the matching rows back from Postgres on demand — small and fast.
Iceberg and Delta Lake work the same way:
SELECT *
FROM iceberg_scan('s3://warehouse/db/orders', allow_moved_paths => true)
WHERE order_date >= '2026-01-01';
SELECT *
FROM delta_scan('s3://warehouse/db/clicks');
Step 6 — Sync to MotherDuck for Cloud Analytics
MotherDuck is a managed cloud DuckDB. Once you connect, you can use it as a transparent caching layer or as a separate analytics warehouse:
-- Authenticate (token from MotherDuck dashboard)
SELECT duckdb.enable_motherduck('eyJhbGciOi...');
-- Now MotherDuck databases appear as Postgres schemas
SELECT count(*) FROM ddb$my_warehouse$main.fact_sales;
-- Or write a derived table back into MotherDuck
CREATE TABLE ddb$my_warehouse$main.daily_revenue
USING duckdb
AS
SELECT date_trunc('day', created_at) AS day,
sum(amount_cents) / 100.0 AS revenue
FROM orders
GROUP BY 1;
This is the most underrated piece of the stack: your application keeps writing to Postgres, but reporting tools point at MotherDuck-backed views fed by pg_duckdb. No Kafka, no Debezium, no Airflow. Just SQL.
Step 7 — Tune for Production
The defaults work in development but you'll want to lock things down before customers see a bill. The minimum production checklist:
-- Cap DuckDB memory so it doesn't fight Postgres
ALTER SYSTEM SET duckdb.memory_limit = '16GB';
-- Restrict thread fanout per scan
ALTER SYSTEM SET duckdb.max_threads_per_postgres_scan = 2;
-- Block local file system reads on multi-tenant clusters
ALTER SYSTEM SET duckdb.disabled_filesystems = 'LocalFileSystem';
-- Disable unsigned extension loading
ALTER SYSTEM SET duckdb.allow_unsigned_extensions = false;
SELECT pg_reload_conf();
The thread cap is the most important knob. A single DuckDB scan with the default of one thread per Postgres scan is usually fine. If you're running a dedicated reporting replica with no OLTP traffic, raise it to NCPU / max_connections so a few concurrent reports can each use multiple cores without oversubscribing.
How pg_duckdb Actually Works
Understanding the architecture stops you from being surprised in production. Three things happen when you SELECT through pg_duckdb:
- Planning hook. pg_duckdb installs a custom planner that runs after Postgres parses your SQL. It walks the rangetable, the targetlist, and the qual nodes, then asks: can DuckDB handle every leaf of this tree? If any leaf is a Postgres-only feature (a row-level security policy, a custom C function, a procedure call), it bails out and the regular executor takes over.
- Data movement. When the plan can run in DuckDB, the extension exposes Postgres heap rows through a
postgres_scantable function that DuckDB calls. Rows are converted to DuckDB's columnar vectors in 2048-row batches. For nativeUSING duckdbtables, the data is already columnar on disk and is read directly. For object storage, DuckDB's HTTP file system streams ranges over HTTPS using S3 byte-range requests. - Result streaming. Output vectors are converted back into Postgres tuples and pushed up through the executor as if a normal sequential scan produced them. Cursors, COPY, EXPLAIN, and prepared statements all work.
Every DuckDB execution runs in a separate thread pool inside the Postgres backend. By default DuckDB uses one thread per CPU core, which can starve your OLTP workload if a heavy report runs at the same time. Cap it with duckdb.max_threads_per_postgres_scan (the default of 1 is sane in most production setups).
Real Numbers: What to Expect
Numbers from a public benchmark and from our own tests so you can sanity-check the marketing copy. Hardware: 16 vCPU, 64 GB RAM, gp3 EBS, Postgres 17 with 16 GB shared_buffers, 200M-row orders table, no warm cache.
| Query type | Vanilla Postgres | pg_duckdb | Speedup |
|---|---|---|---|
| Full-table count(*) | 23.4 s | 1.1 s | 21x |
| GROUP BY month, country (12 months, 200 countries) | 41.0 s | 1.8 s | 22x |
| Top-100 customers by revenue | 29.7 s | 0.9 s | 33x |
| Window function: 30-day rolling avg | 78.0 s | 3.2 s | 24x |
| Single-row PK lookup (OLTP) | 0.3 ms | 0.4 ms | 0.75x (slower) |
| INSERT 1 row | 0.6 ms | 0.6 ms | ≈ 1x |
Notice the bottom two rows. Single-row OLTP work goes through the regular executor either way — pg_duckdb only kicks in when its planner thinks DuckDB will win, and it adds a small overhead from the planner hook. That overhead is invisible at OLAP scale and a rounding error at OLTP scale, but it's why you should not blindly enable duckdb.force_execution globally on a write-heavy database.
Trade-Offs and When NOT to Use pg_duckdb
pg_duckdb is not a free upgrade. Be honest about these:
- Memory. DuckDB allocates its own buffer pool independent of Postgres's
shared_buffers. On a 64 GB box you might already give Postgres 16 GB; DuckDB's default is 80% of system RAM, which will fight Postgres for cache. Setduckdb.memory_limitexplicitly — typically 25-40% of system RAM if you're co-locating OLTP. - Concurrency. Each Postgres backend gets its own DuckDB instance. Twenty concurrent reporting connections each running an 8-thread DuckDB scan = 160 worker threads on a 16-core box. You will thrash. Pool reporting users separately (e.g. via PgBouncer with low
pool_size) and never setduckdb.force_execution = onglobally. - Writes. DuckDB tables and Postgres tables are different storage. You cannot transactionally write to both in a single statement. Use one source of truth for OLTP (Postgres heap) and treat DuckDB tables as derived analytics.
- Some Postgres features don't apply. Row-level security, BRIN indexes, full-text search using
tsvector, custom operator classes — DuckDB doesn't know about them. Queries that require them will fall back to the row executor. - HA / replication. DuckDB-native tables aren't WAL-logged. Streaming replication only ships the Postgres heap. If you need replicas to see DuckDB tables, store them in MotherDuck or rebuild them with a scheduled job on the replica.
Rule of thumb: pg_duckdb is a read-mostly OLAP accelerator next to OLTP. It is not a replacement for either Postgres or a real warehouse. Use it when your queries are analytical, your data fits the box (or sits in S3), and you want to avoid running a second stack.
Common Pitfalls and Gotchas
Things that have bitten teams in the wild:
- EXPLAIN looks empty. When DuckDB executes a query, the Postgres EXPLAIN shows a single
Custom Scan (DuckDBScan)node. UseEXPLAIN (DUCKDB)to see the DuckDB plan, including pushdowns and join strategies. - Timezone mismatches. DuckDB defaults to UTC; Postgres defaults to your server's TZ.
now()can return different values across the boundary. SetSET TimeZone = 'UTC';at the top of analytical sessions. - jsonb vs json. DuckDB treats JSON as a string type with operators. Postgres jsonb has GIN indexes and structural operators DuckDB doesn't. Heavy
jsonb_pathwork should stay in Postgres or be flattened during ETL. - Long-running queries hold a snapshot. A 30-minute DuckDB query keeps its Postgres transaction open, which blocks autovacuum from cleaning rows older than the snapshot. Watch
pg_stat_activity.xact_startand your bloat metrics. - S3 secrets are session-scoped by default. Use
duckdb.create_secret(... persistent => true)if you want them to survive reconnects, but be aware they're stored on disk in the data directory and protected only by file permissions. - VACUUM FULL blocks DuckDB scans. Predictable but easy to forget — schedule heavy maintenance during reporting downtime.
Quick Reference
| Setting | Default | Recommended | What it does |
|---|---|---|---|
| duckdb.force_execution | off | off (per-session ON) | Auto-route SELECTs through DuckDB |
| duckdb.memory_limit | 80% of RAM | 25-40% of RAM | Hard cap on DuckDB buffer pool |
| duckdb.max_threads_per_postgres_scan | 1 | 1-4 | Parallelism per scan |
| duckdb.allow_unsigned_extensions | false | false | Permit untrusted DuckDB extensions |
| duckdb.disabled_filesystems | '' | 'LocalFileSystem' | Lock down on multi-tenant clusters |
Key Functions Cheat Sheet
| Function | Purpose |
|---|---|
| duckdb.query(sql) | Run arbitrary DuckDB SQL, return as table |
| duckdb.create_secret(...) | Register S3/GCS/Azure credentials |
| duckdb.enable_motherduck(token) | Connect to MotherDuck cloud |
| read_parquet(path) | Stream Parquet files (supports globs) |
| read_csv(path) | Stream CSV with type inference |
| iceberg_scan(path) | Read Apache Iceberg tables |
| delta_scan(path) | Read Delta Lake tables |
| EXPLAIN (DUCKDB) ... | Show the DuckDB plan, not the wrapper |
Next Steps
You now have a pg_duckdb instance, you've routed an analytical query through DuckDB, you've queried Parquet on S3, and you've built a native columnar table. From here:
- Wire pg_duckdb into your reporting stack — point Metabase, Superset, or Grafana at it. Most BI tools won't notice anything different other than that queries are suddenly fast.
- Move cold data to S3 as Parquet on a schedule, drop the Postgres partition, and let pg_duckdb keep it queryable. This is a cheap path to retention without paying for hot disk.
- Try the MotherDuck integration if you want a managed analytics layer that your application database can publish to.
- Read the pg_duckdb release notes — the project moves fast, with new pushdowns and DuckDB version bumps every few weeks.
If your team has been considering ClickHouse or Snowflake for an analytics workload that's still small enough to fit on a single Postgres host, pg_duckdb is worth a serious afternoon. One database, one backup, one set of access controls — and the OLAP performance you were going to leave Postgres to get.
Comments
Be the first to comment