DuckDB for Stream Processing: Sliding Windows Without Flink

Use DuckDB for stream processing with sliding windows, micro-batch ingestion, and SQL window functions — no Flink or heavy infrastructure required.

You don't always need Flink to answer a simple question like:

"What's the rolling error rate over the last 5 minutes?"

A lot of teams reach for heavyweight stream processing — Flink, Kafka Streams, ksqlDB — when what they actually want is:

A stream of events
Some sliding windows
A dashboard or alert

If that sounds familiar, DuckDB might be a surprisingly good fit. It's not a streaming engine in the strict sense, but with the right patterns you can get stream-like behavior and sliding windows using nothing more exotic than SQL and micro-batches.

Let's walk through how.

DuckDB Is Not Flink… and That's the Point

DuckDB brands itself as an "in-process OLAP database," not a streaming engine. It's:

Embedded (runs in your process, like SQLite)
Columnar and vectorized (very fast scans/aggregations)
Great at analytical queries over local files (Parquet, CSV, etc.)
Easy to call from Python, Node, or your backend

So why even talk about stream processing with DuckDB?

Because in many real systems, the "stream" is actually:

A Kafka topic consumed in small batches
A log file that's being appended to
An object store (S3, GCS) where data lands in minute-based Parquet files
A simple API endpoint that gets events and writes them somewhere

In those cases, you can often treat stream processing as fast, incremental batch processing, and DuckDB shines there.

Architecture: Sliding Windows as Fast Micro-Batches

Let's sketch a simple setup.

Imagine you have an event source (Kafka, Kinesis, or just your app), and you want recent metrics:

[ Event Source ]
      |
      v
[ Ingestion Worker ]  --appends-->  [ DuckDB Event Table ]
                                       |
                                       v
                             [ Windowed SQL Queries ]
                                       |
                                       v
                               [ Dashboard / Alerts ]

[ Event Source ]
      |
      v
[ Ingestion Worker ]  --appends-->  [ DuckDB Event Table ]
                                       |
                                       v
                             [ Windowed SQL Queries ]
                                       |
                                       v
                               [ Dashboard / Alerts ]

The pattern:

Ingestion worker pulls new events every N seconds.
Writes them into a DuckDB table (on disk or in-memory with periodic snapshots).
Window queries run on that table to compute sliding metrics.

No distributed cluster. No job manager. Just your app process and a DuckDB database.

Is it "real-time"? Not sub-second, but for a ton of business use cases, 5–30 seconds of latency is perfectly fine.

Step 1: Modeling the Stream as an Append-Only Table

Let's assume events look like this:

event_time (TIMESTAMP)
user_id (TEXT)
event_type (TEXT)
value (FLOAT or INT)

We'll create an append-only table in DuckDB:

CREATE TABLE events (
    event_time TIMESTAMP NOT NULL,
    user_id    TEXT NOT NULL,
    event_type TEXT NOT NULL,
    value      DOUBLE
);

CREATE TABLE events (
    event_time TIMESTAMP NOT NULL,
    user_id    TEXT NOT NULL,
    event_type TEXT NOT NULL,
    value      DOUBLE
);

Your ingestion layer (Python, Node, Go, whatever) simply inserts new rows as they arrive or in small batches.

In Python + DuckDB, a tiny ingestion sketch might look like:

import duckdb
import pandas as pd

con = duckdb.connect("events.duckdb")

def ingest_batch(df: pd.DataFrame):
    # df has columns: event_time, user_id, event_type, value
    con.execute("INSERT INTO events SELECT * FROM df", {"df": df})

import duckdb
import pandas as pd

con = duckdb.connect("events.duckdb")

def ingest_batch(df: pd.DataFrame):
    # df has columns: event_time, user_id, event_type, value
    con.execute("INSERT INTO events SELECT * FROM df", {"df": df})

You call ingest_batch every few seconds or every N messages. That's your "stream" in micro-batch form.

Step 2: Sliding Windows with DuckDB Window Functions

Now the fun part: sliding windows in pure SQL.

Let's say you want, for each user, a rolling count of error events over the last 5 minutes, evaluated at each event.

In DuckDB, you can use RANGE window frames on timestamps:

SELECT
    event_time,
    user_id,
    COUNT(*) FILTER (WHERE event_type = 'error') OVER (
        PARTITION BY user_id
        ORDER BY event_time
        RANGE BETWEEN INTERVAL 5 MINUTES PRECEDING AND CURRENT ROW
    ) AS error_count_5m
FROM events
WHERE event_time >= NOW() - INTERVAL 10 MINUTES
ORDER BY event_time;

SELECT
    event_time,
    user_id,
    COUNT(*) FILTER (WHERE event_type = 'error') OVER (
        PARTITION BY user_id
        ORDER BY event_time
        RANGE BETWEEN INTERVAL 5 MINUTES PRECEDING AND CURRENT ROW
    ) AS error_count_5m
FROM events
WHERE event_time >= NOW() - INTERVAL 10 MINUTES
ORDER BY event_time;

What's happening:

PARTITION BY user_id → windows are per user.
ORDER BY event_time → sliding over time.
RANGE BETWEEN INTERVAL 5 MINUTES PRECEDING → 5-minute sliding window.
We filter recent events (>= NOW() - 10 MINUTES) so we don't scan the entire history.

This gives you the classic "rolling window" behavior you'd expect from Flink's sliding windows, but in a single query.

Step 3: Tumbling vs Sliding Windows in DuckDB

You might want both:

Sliding windows for per-event metrics
Tumbling windows for dashboard aggregates (e.g., per minute)

Tumbling windows (per minute, per event_type)

WITH bucketed AS (
    SELECT
        date_trunc('minute', event_time) AS minute_bucket,
        event_type,
        COUNT(*) AS event_count
    FROM events
    WHERE event_time >= NOW() - INTERVAL 1 HOUR
    GROUP BY 1, 2
)
SELECT *
FROM bucketed
ORDER BY minute_bucket, event_type;

WITH bucketed AS (
    SELECT
        date_trunc('minute', event_time) AS minute_bucket,
        event_type,
        COUNT(*) AS event_count
    FROM events
    WHERE event_time >= NOW() - INTERVAL 1 HOUR
    GROUP BY 1, 2
)
SELECT *
FROM bucketed
ORDER BY minute_bucket, event_type;

Here, date_trunc('minute', event_time) acts like a 1-minute tumbling window.

Sliding windows (e.g., 5-minute moving average of value)

SELECT
    event_time,
    AVG(value) OVER (
        ORDER BY event_time
        RANGE BETWEEN INTERVAL 5 MINUTES PRECEDING AND CURRENT ROW
    ) AS value_5m_avg
FROM events
WHERE event_time >= NOW() - INTERVAL 15 MINUTES
ORDER BY event_time;

SELECT
    event_time,
    AVG(value) OVER (
        ORDER BY event_time
        RANGE BETWEEN INTERVAL 5 MINUTES PRECEDING AND CURRENT ROW
    ) AS value_5m_avg
FROM events
WHERE event_time >= NOW() - INTERVAL 15 MINUTES
ORDER BY event_time;

You can mix and match these patterns to build near real-time analytics.

Example: Latency Dashboard Without a Streaming Stack

Let's make it concrete.

Say you're tracking API latencies:

Each event = one request
event_type = 'request'
value = latency_ms

You want:

Per-endpoint 1-minute p95 latency
Rolling 5-minute error rate

Schema tweak: add endpoint and status_code:

ALTER TABLE events
ADD COLUMN endpoint TEXT,
ADD COLUMN status_code INT;

ALTER TABLE events
ADD COLUMN endpoint TEXT,
ADD COLUMN status_code INT;

1-minute p95 latency per endpoint

WITH last_minute AS (
    SELECT *
    FROM events
    WHERE event_time >= NOW() - INTERVAL 1 MINUTE
      AND event_type = 'request'
)
SELECT
    endpoint,
    approx_quantile(value, 0.95) AS p95_latency_ms,
    COUNT(*) AS request_count
FROM last_minute
GROUP BY endpoint
ORDER BY p95_latency_ms DESC;

WITH last_minute AS (
    SELECT *
    FROM events
    WHERE event_time >= NOW() - INTERVAL 1 MINUTE
      AND event_type = 'request'
)
SELECT
    endpoint,
    approx_quantile(value, 0.95) AS p95_latency_ms,
    COUNT(*) AS request_count
FROM last_minute
GROUP BY endpoint
ORDER BY p95_latency_ms DESC;

Rolling 5-minute error rate per endpoint

We define "error" as status_code >= 500.

SELECT
    event_time,
    endpoint,
    SUM(CASE WHEN status_code >= 500 THEN 1 ELSE 0 END) OVER (
        PARTITION BY endpoint
        ORDER BY event_time
        RANGE BETWEEN INTERVAL 5 MINUTES PRECEDING AND CURRENT ROW
    )::DOUBLE
    /
    COUNT(*) OVER (
        PARTITION BY endpoint
        ORDER BY event_time
        RANGE BETWEEN INTERVAL 5 MINUTES PRECEDING AND CURRENT ROW
    ) AS error_rate_5m
FROM events
WHERE event_time >= NOW() - INTERVAL 15 MINUTES
  AND event_type = 'request'
ORDER BY event_time;

SELECT
    event_time,
    endpoint,
    SUM(CASE WHEN status_code >= 500 THEN 1 ELSE 0 END) OVER (
        PARTITION BY endpoint
        ORDER BY event_time
        RANGE BETWEEN INTERVAL 5 MINUTES PRECEDING AND CURRENT ROW
    )::DOUBLE
    /
    COUNT(*) OVER (
        PARTITION BY endpoint
        ORDER BY event_time
        RANGE BETWEEN INTERVAL 5 MINUTES PRECEDING AND CURRENT ROW
    ) AS error_rate_5m
FROM events
WHERE event_time >= NOW() - INTERVAL 15 MINUTES
  AND event_type = 'request'
ORDER BY event_time;

You can pipe these queries into:

A simple web dashboard (FastAPI / Flask + charts)
A CLI report
A scheduled job that pushes metrics to Prometheus/Influx

No Flink job. No YAML-heavy pipelines. Just micro-batches and SQL.

Handling Out-of-Order Events (Without Losing Your Mind)

Streaming engines like Flink have explicit watermarks to deal with out-of-order events.

With DuckDB, you don't get built-in watermarks, but you can approximate the idea:

Only consider events with event_time <= NOW() - grace_interval for "final" aggregates.
Keep a small "late events" buffer if you care about corrections.

For example, to compute stable 1-minute windows with 30 seconds of grace:

WITH horizon AS (
    SELECT NOW() - INTERVAL 30 SECONDS AS watermark
),
eligible AS (
    SELECT e.*
    FROM events e, horizon h
    WHERE e.event_time <= h.watermark
      AND e.event_time >= h.watermark - INTERVAL 1 MINUTE
)
SELECT
    date_trunc('minute', event_time) AS minute_bucket,
    COUNT(*) AS event_count
FROM eligible
GROUP BY 1
ORDER BY minute_bucket;

WITH horizon AS (
    SELECT NOW() - INTERVAL 30 SECONDS AS watermark
),
eligible AS (
    SELECT e.*
    FROM events e, horizon h
    WHERE e.event_time <= h.watermark
      AND e.event_time >= h.watermark - INTERVAL 1 MINUTE
)
SELECT
    date_trunc('minute', event_time) AS minute_bucket,
    COUNT(*) AS event_count
FROM eligible
GROUP BY 1
ORDER BY minute_bucket;

It's more manual than Flink's declarative watermarks, but for smaller systems it's usually good enough.

Where DuckDB Streaming Shines (and Where It Doesn't)

Great fits

DuckDB as a "stream processor" makes a lot of sense when:

You're building local or embedded analytics (per-tenant metrics, dashboards, internal tools).
You can tolerate seconds of latency instead of strict sub-second processing.
Your data volume fits comfortably on a single machine (or a small VM).
You want to iterate in SQL, not manage a distributed cluster.

Not so great

You probably don't want to rely solely on DuckDB if:

You need exactly-once semantics across multiple systems.
Your event volume is massive (billions of events per hour, across many partitions).
You need complex stream joins across high-velocity topics with tight SLAs.

In those worlds, something like Flink or Kafka Streams is still the right hammer.

But the point is: a huge chunk of "real-time analytics" problems live below that complexity line.

And DuckDB absolutely eats that space.

Wrapping Up: Sliding Windows Without the Heavy Machinery

Using DuckDB for stream processing is really about reframing:

Treat your stream as fast, append-only batches.
Let DuckDB's window functions and aggregations handle sliding and tumbling windows.
Keep the architecture boring: ingestion worker → DuckDB → SQL → metrics.

You get:

Minimal operational overhead
Familiar SQL-based windowing
A path from "prototype on laptop" to "small production instance" without changing tools

If this sparked ideas:

Comment with your current stack (Flink, Kafka, plain cron jobs?) and what you'd love to simplify.
Follow for more practical DuckDB patterns, from local analytics to lakehouse-style workflows.
Share this with the teammate who keeps saying "we need Flink" for every dashboard — maybe you don't.

Contents