DuckDB - Analytics for not-so-big data with DuckDB

7 minute read

Published:

In analytics engineering, tooling discussions are often presented as an either-or choice: use a transactional database like Postgres, or go all-in on a distributed engine like Spark. In practice, though, a huge share of analytical work lives somewhere in the middle. The data easily fits on a laptop or a single VM, but the queries themselves are anything but trivial—wide tables, joins across multiple fact datasets, window functions, and time-based aggregations are the norm.

This is exactly where DuckDB shines. It’s not a toy database, and it’s not trying to replace your data warehouse. Instead, it positions itself as a serious single-node analytical engine built for this middle ground.

You’ll often hear DuckDB described as “SQLite for analytics.” That analogy is helpful, but it only tells part of the story. What really matters is why this design works so well for modern ETL and analytics workflows.

image


What DuckDB Is (and Why It’s Built This Way)

At its core, DuckDB is an open-source, embeddable OLAP SQL database, written in C++ and first released in 2019. Like SQLite, it runs as a single binary inside your application process—no server to manage, no daemon running in the background, and no external services to configure. That choice isn’t just about convenience; it fundamentally shapes how DuckDB is meant to be used.

From an analytics engineering point of view, this has some important implications:

  • DuckDB behaves like a library, not a piece of infrastructure.
  • Analytical logic can live right next to your code, tests, and version control.
  • Local development starts to resemble production logic, rather than drifting into one-off Pandas scripts that are hard to operationalize.

One of the most important consequences of this design shows up in how DuckDB stores and processes data.


Columnar Storage: Why DuckDB Behaves Like an Analytics Engine

Traditional relational databases such as Postgres, MySQL, or SQLite store data row by row. That layout is perfect for transactional workloads, where entire records are read or updated frequently—classic OLTP behavior.

Analytical workloads look very different. They tend to scan large datasets, touch only a subset of columns, and aggregate across many rows.

DuckDB is built around columnar storage, which means:

  • Only the columns referenced in a query are read from disk
  • Aggregations operate on tightly packed, contiguous memory
  • CPU caches are used far more efficiently

This is why DuckDB performs so well on time-series data, event logs, IoT streams, and large fact tables—the same types of data that typically sit at the heart of analytics and semantic layers.


Vectorized Execution and Parallelism (Why It’s Fast on One Machine)

On top of its storage model, DuckDB uses a vectorized execution engine. Instead of processing one row at a time, it works on batches (vectors) of values. This design allows DuckDB to:

  • Minimize interpreter overhead
  • Take advantage of low-level CPU optimizations
  • Parallelize execution across cores by default

The result is surprisingly strong performance on a single machine. DuckDB can comfortably process hundreds of millions—or even billions—of rows locally, as shown in the NDC Oslo talk Analytics for not-so-big data with DuckDB.

For analytics engineers, this is especially valuable. It means you can validate transformations, query plans, and performance characteristics locally, without having to deploy everything to a data warehouse just to see how it behaves.


SQL as the Primary Interface

DuckDB speaks standard SQL. There is no custom DSL, no hidden API layer. Creating tables, inserting data, querying results — all behave as expected:

CREATE TABLE users (
    id INTEGER,
    name VARCHAR
);

INSERT INTO users VALUES (1, 'Jeff');

SELECT name FROM users;

This makes DuckDB particularly attractive for SQL-first teams, especially those already using dbt or warehouse-native transformations. Logic written in DuckDB SQL is highly portable.

You can start interacting immediately via the CLI:

duckdb

Querying Files Directly

One of DuckDB’s most impactful features for ETL workflows is its ability to query files directly, without an import step.

SELECT * FROM 'data.csv';
SELECT * FROM 'data.parquet';

This shifts how you think about extraction:

  • CSV and Parquet become first-class query sources
  • You can validate schema, distributions, and data quality before loading
  • Intermediate staging tables are often unnecessary

Because DuckDB understands Parquet metadata and uses columnar execution, it will only read the columns required to answer a query — even when the file lives remotely (HTTP or object storage).

This is a practical alternative to spinning up temporary databases just to inspect raw drops.


DuckDB as a SQL Layer on Top of Pandas

DuckDB integrates tightly with Python and Pandas, which makes it useful as a bridge between exploratory analysis and production SQL.

import duckdb
import pandas as pd

df = pd.DataFrame({
    "employee": ["Alice", "Bob", "Charlie"],
    "salary": [120000, 90000, 135000],
    "department": ["IT", "HR", "IT"]
})

duckdb.query("""
    SELECT department,
           AVG(salary) AS avg_salary,
           COUNT(*)    AS headcount
    FROM df
    GROUP BY department
""").to_df()

For more structured codebases, explicit registration is preferable:

con = duckdb.connect()
con.register("employees", df)

con.execute("""
    SELECT *
    FROM employees
    WHERE salary > 100000
""").fetchdf()

This pattern works well when:

  • Prototyping transformations in notebooks
  • Gradually moving logic into SQL
  • Avoiding complex Pandas pipelines that are hard to test and review

(For a practical walkthrough, see ArjanCodes: Stop Struggling with DataFrames – Try DuckDB.)


DuckDB as a Lightweight ETL Engine

DuckDB is often faster and simpler than custom Python scripts for format conversion and transformation tasks.

Example: converting a large CSV into Parquet.

duckdb -c "
COPY (
    SELECT *
    FROM read_csv_auto('events_raw.csv')
)
TO 'events.parquet'
(FORMAT PARQUET);
"

Why this works well in practice:

  • Automatic schema inference
  • Parallel execution across CPU cores
  • Deterministic, SQL-defined transformations
  • Minimal memory overhead compared to Pandas

For analytics engineers, this fits naturally into ELT-style pipelines, where raw data is preserved and refined through SQL transformations.


Analytical Aggregations and Time-Based Queries

DuckDB excels at analytical queries on time-series and event data:

SELECT
    symbol,
    AVG(price),
    MAX(price),
    MIN(price)
FROM 'stocks.parquet'
WHERE date > now() - interval '1 day'
GROUP BY symbol;

Under the hood, this query is executed using vectorized batches and multi-threaded execution, which explains why it scales well even without distributed infrastructure.


Built-In Introspection for Data and Queries

DuckDB includes several features that are especially useful during modeling and optimization:

  • DESCRIBE table – inspect schema and types
  • SUMMARIZE table – get column-level statistics
  • EXPLAIN ANALYZE – see the physical query plan and execution time

These tools reduce guesswork when tuning transformations or validating assumptions about data distributions.


Adoption and Real-World Usage

DuckDB is not an experimental project. It is used by organizations such as Meta, Google, and Airbnb, typically embedded inside data tools, notebooks, or internal pipelines rather than exposed as a user-facing database.

This usage pattern reinforces its core strength: analytical computation without operational overhead.

(Overview reference: Fireship – DuckDB in 100 Seconds: https://www.youtube.com/watch?v=uHm6FEb2Re4)


Where DuckDB Fits in an Analytics Stack

DuckDB does not replace a data warehouse. It complements one.

It is particularly well-suited for:

  • Local development of analytical SQL
  • Data profiling and validation
  • Lightweight ETL and file transformations
  • Reproducible analytics without infrastructure setup

If your current workflow oscillates between fragile Pandas scripts and heavyweight distributed systems, DuckDB often fills the gap more cleanly than either extreme.


Further References