DuckDB - Analytics for not-so-big data with DuckDB

6 minute read

Published:

DuckDB: Analytics for “Not-So-Big” Data That Still Deserves Proper Engineering

In data teams, tooling choices often jump straight from Postgres to Spark. Once datasets stop fitting comfortably into a transactional database, the reflex is to introduce distributed systems, clusters, schedulers, and a lot of operational overhead. In practice, many analytical workloads don’t need any of that.

If your data fits on a laptop or a single VM, but your queries involve wide tables, heavy joins, window functions, and aggregations over tens or hundreds of millions of rows, you’re firmly in analytical territory — just not big data territory.

This is exactly where DuckDB fits.

DuckDB is best understood as a local, embedded OLAP engine that behaves like a serious analytical database, not a toy replacement for SQLite. For analytics engineers and SQL-first developers, it can simplify ETL pipelines, reduce dependency sprawl, and dramatically speed up local iteration.


What DuckDB Actually Is (and Isn’t)

DuckDB is an in-process, column-oriented relational database optimized for analytical queries. You link it to your application or script, and it runs in the same process — no daemon, no service, no cluster.

A few implications that matter in practice:

  • It is not designed for high-concurrency transactional workloads.
  • It is designed for scanning large datasets, doing joins, aggregations, and complex analytical SQL efficiently.
  • It assumes data locality (local disk, object storage, HTTP) rather than a constantly mutating OLTP store.

Key characteristics that are relevant for analytics engineering:

  • Columnar execution – only the columns referenced in the query are read.
  • Vectorized processing – operations are applied in batches, not row by row.
  • Single-file database – easy to version, move, or recreate.
  • First-class file access – CSV, Parquet, JSON, Excel, S3, HTTP.

If you’ve ever used Postgres for analytics and felt that “this should be faster,” DuckDB was built to answer that exact complaint.


Using DuckDB as a SQL Layer on Top of DataFrames

One of the most practical entry points is using DuckDB as a query engine for Pandas DataFrames. This is especially useful when Pandas logic starts to become unreadable or slow.

import duckdb
import pandas as pd

df = pd.DataFrame({
    "employee": ["Alice", "Bob", "Charlie"],
    "salary": [120000, 90000, 135000],
    "department": ["IT", "HR", "IT"]
})

duckdb.query("""
    SELECT department,
           AVG(salary) AS avg_salary,
           COUNT(*)    AS headcount
    FROM df
    GROUP BY department
""").to_df()

For exploratory analysis or notebook-based work, this immediately replaces chains of groupby, agg, and merge calls with readable SQL.

In production-grade scripts, explicit registration is cleaner and more predictable:

con = duckdb.connect()
con.register("employees", df)

result = con.execute("""
    SELECT *
    FROM employees
    WHERE salary > 100000
""").fetchdf()

This pattern is powerful because it lets you prototype transformations in Pandas and lock them down in SQL once they stabilize — a workflow that maps well to dbt-style thinking.


DuckDB for Serious Analytical Workloads

DuckDB’s performance story becomes more interesting once you stop thinking in “DataFrame size” and start thinking in scan volume.

In the NDC Oslo talk Analytics for not-so-big data with DuckDB, large fact tables (on the order of hundreds of millions to a billion rows) are queried on a single machine with response times in seconds. The key is that DuckDB assumes analytical access patterns and optimizes for them aggressively. (Reference: NDC Oslo talk by David Ostrovsky – https://www.youtube.com/watch?v=3iy9qaGopyw)

For example, pivot-style reporting queries that are painful in traditional databases are straightforward:

PIVOT lineitem
ON returnflag
USING SUM(extendedprice);

For analytics engineers, this matters because it enables local validation of logic that would otherwise require deploying to a warehouse or cluster just to test query semantics.


Querying Data Where It Lives (Without ETL First)

A common anti-pattern in analytics pipelines is “load everything first, then filter.” DuckDB encourages the opposite.

You can query Parquet files directly from disk, object storage, or even over HTTP:

SELECT
    COUNT(*) AS trips,
    AVG(distance) AS avg_distance
FROM 'https://raw.githubusercontent.com/.../trips.parquet'
WHERE pickup_year = 2023;

Because Parquet is columnar and DuckDB understands its metadata, only the required columns and row groups are read. This makes DuckDB a strong choice for:

  • Profiling raw data drops
  • Validating upstream exports
  • Lightweight federation across files

In many cases, this replaces the need for a temporary staging database entirely.


DuckDB as a Zero-Friction ETL Engine

For ETL tasks, DuckDB shines as a fast, declarative transformer. Instead of writing imperative scripts, you express transformations once in SQL and let the engine handle parallelism and memory management.

A simple but common example: converting large CSV files into Parquet.

duckdb -c "
COPY (
    SELECT *
    FROM read_csv_auto('events_raw.csv')
)
TO 'events.parquet'
(FORMAT PARQUET);
"

This approach has several advantages over Pandas-based ETL:

  • Automatic type inference
  • Parallel execution
  • Lower memory pressure
  • Reproducible transformations

For analytics engineers, this aligns well with ELT-style pipelines, where raw data is ingested as-is and refined through SQL transformations.


Built-In Data Introspection

DuckDB includes practical inspection tools that reduce boilerplate:

  • DESCRIBE table – schema and types
  • SUMMARIZE table – distribution statistics for every column
  • EXPLAIN ANALYZE – physical query plan with timings

These are especially useful when optimizing transformations or validating assumptions about data distributions before modeling.


Where DuckDB Fits in an Analytics Stack

DuckDB is not a replacement for a data warehouse. It is a strong complement to one.

Typical use cases where it fits naturally:

  • Local development and testing of analytical SQL
  • Lightweight ETL and format conversion
  • Ad-hoc analysis on large files
  • Reproducible analytical scripts without infrastructure overhead

If you think of it as “SQLite, but for analytics,” you’ll underuse it. If you think of it as a single-node analytical engine, it becomes much more interesting.


Further References


If you want, I can:

  • Reframe this as a blog for analytics engineers
  • Add a DuckDB vs Postgres / Spark comparison table
  • Extend it with dbt-style transformation examples
  • Add a real ETL case study (CSV → Parquet → analytical model)

Just tell me the target audience and publication platform.