November 24, 2025
DuckDB for Stream Processing: Sliding Windows Without Flink
How to get real-time-ish analytics and sliding windows with DuckDB, without dragging in a full-blown stream processing stack.
Nikulsinh Rajput
5 min read
Use DuckDB for stream processing with sliding windows, micro-batch ingestion, and SQL window functions — no Flink or heavy infrastructure required.
You don't always need Flink to answer a simple question like:
"What's the rolling error rate over the last 5 minutes?"
A lot of teams reach for heavyweight stream processing — Flink, Kafka Streams, ksqlDB — when what they actually want is:
- A stream of events
- Some sliding windows
- A dashboard or alert
If that sounds familiar, DuckDB might be a surprisingly good fit. It's not a streaming engine in the strict sense, but with the right patterns you can get stream-like behavior and sliding windows using nothing more exotic than SQL and micro-batches.
Let's walk through how.
DuckDB Is Not Flink… and That's the Point
DuckDB brands itself as an "in-process OLAP database," not a streaming engine. It's:
- Embedded (runs in your process, like SQLite)
- Columnar and vectorized (very fast scans/aggregations)
- Great at analytical queries over local files (Parquet, CSV, etc.)
- Easy to call from Python, Node, or your backend
So why even talk about stream processing with DuckDB?
Because in many real systems, the "stream" is actually:
- A Kafka topic consumed in small batches
- A log file that's being appended to
- An object store (S3, GCS) where data lands in minute-based Parquet files
- A simple API endpoint that gets events and writes them somewhere
In those cases, you can often treat stream processing as fast, incremental batch processing, and DuckDB shines there.
Architecture: Sliding Windows as Fast Micro-Batches
Let's sketch a simple setup.
Imagine you have an event source (Kafka, Kinesis, or just your app), and you want recent metrics:
[ Event Source ]
|
v
[ Ingestion Worker ] --appends--> [ DuckDB Event Table ]
|
v
[ Windowed SQL Queries ]
|
v
[ Dashboard / Alerts ][ Event Source ]
|
v
[ Ingestion Worker ] --appends--> [ DuckDB Event Table ]
|
v
[ Windowed SQL Queries ]
|
v
[ Dashboard / Alerts ]The pattern:
- Ingestion worker pulls new events every N seconds.
- Writes them into a DuckDB table (on disk or in-memory with periodic snapshots).
- Window queries run on that table to compute sliding metrics.
No distributed cluster. No job manager. Just your app process and a DuckDB database.
Is it "real-time"? Not sub-second, but for a ton of business use cases, 5–30 seconds of latency is perfectly fine.
Step 1: Modeling the Stream as an Append-Only Table
Let's assume events look like this:
event_time(TIMESTAMP)user_id(TEXT)event_type(TEXT)value(FLOAT or INT)
We'll create an append-only table in DuckDB:
CREATE TABLE events (
event_time TIMESTAMP NOT NULL,
user_id TEXT NOT NULL,
event_type TEXT NOT NULL,
value DOUBLE
);CREATE TABLE events (
event_time TIMESTAMP NOT NULL,
user_id TEXT NOT NULL,
event_type TEXT NOT NULL,
value DOUBLE
);Your ingestion layer (Python, Node, Go, whatever) simply inserts new rows as they arrive or in small batches.
In Python + DuckDB, a tiny ingestion sketch might look like:
import duckdb
import pandas as pd
con = duckdb.connect("events.duckdb")
def ingest_batch(df: pd.DataFrame):
# df has columns: event_time, user_id, event_type, value
con.execute("INSERT INTO events SELECT * FROM df", {"df": df})import duckdb
import pandas as pd
con = duckdb.connect("events.duckdb")
def ingest_batch(df: pd.DataFrame):
# df has columns: event_time, user_id, event_type, value
con.execute("INSERT INTO events SELECT * FROM df", {"df": df})You call ingest_batch every few seconds or every N messages. That's your "stream" in micro-batch form.
Step 2: Sliding Windows with DuckDB Window Functions
Now the fun part: sliding windows in pure SQL.
Let's say you want, for each user, a rolling count of error events over the last 5 minutes, evaluated at each event.
In DuckDB, you can use RANGE window frames on timestamps:
SELECT
event_time,
user_id,
COUNT(*) FILTER (WHERE event_type = 'error') OVER (
PARTITION BY user_id
ORDER BY event_time
RANGE BETWEEN INTERVAL 5 MINUTES PRECEDING AND CURRENT ROW
) AS error_count_5m
FROM events
WHERE event_time >= NOW() - INTERVAL 10 MINUTES
ORDER BY event_time;SELECT
event_time,
user_id,
COUNT(*) FILTER (WHERE event_type = 'error') OVER (
PARTITION BY user_id
ORDER BY event_time
RANGE BETWEEN INTERVAL 5 MINUTES PRECEDING AND CURRENT ROW
) AS error_count_5m
FROM events
WHERE event_time >= NOW() - INTERVAL 10 MINUTES
ORDER BY event_time;What's happening:
PARTITION BY user_id→ windows are per user.ORDER BY event_time→ sliding over time.RANGE BETWEEN INTERVAL 5 MINUTES PRECEDING→ 5-minute sliding window.- We filter recent events (
>= NOW() - 10 MINUTES) so we don't scan the entire history.
This gives you the classic "rolling window" behavior you'd expect from Flink's sliding windows, but in a single query.
Step 3: Tumbling vs Sliding Windows in DuckDB
You might want both:
- Sliding windows for per-event metrics
- Tumbling windows for dashboard aggregates (e.g., per minute)
Tumbling windows (per minute, per event_type)
WITH bucketed AS (
SELECT
date_trunc('minute', event_time) AS minute_bucket,
event_type,
COUNT(*) AS event_count
FROM events
WHERE event_time >= NOW() - INTERVAL 1 HOUR
GROUP BY 1, 2
)
SELECT *
FROM bucketed
ORDER BY minute_bucket, event_type;WITH bucketed AS (
SELECT
date_trunc('minute', event_time) AS minute_bucket,
event_type,
COUNT(*) AS event_count
FROM events
WHERE event_time >= NOW() - INTERVAL 1 HOUR
GROUP BY 1, 2
)
SELECT *
FROM bucketed
ORDER BY minute_bucket, event_type;Here, date_trunc('minute', event_time) acts like a 1-minute tumbling window.
Sliding windows (e.g., 5-minute moving average of value)
SELECT
event_time,
AVG(value) OVER (
ORDER BY event_time
RANGE BETWEEN INTERVAL 5 MINUTES PRECEDING AND CURRENT ROW
) AS value_5m_avg
FROM events
WHERE event_time >= NOW() - INTERVAL 15 MINUTES
ORDER BY event_time;SELECT
event_time,
AVG(value) OVER (
ORDER BY event_time
RANGE BETWEEN INTERVAL 5 MINUTES PRECEDING AND CURRENT ROW
) AS value_5m_avg
FROM events
WHERE event_time >= NOW() - INTERVAL 15 MINUTES
ORDER BY event_time;You can mix and match these patterns to build near real-time analytics.
Example: Latency Dashboard Without a Streaming Stack
Let's make it concrete.
Say you're tracking API latencies:
- Each event = one request
event_type = 'request'value = latency_ms
You want:
- Per-endpoint 1-minute p95 latency
- Rolling 5-minute error rate
Schema tweak: add endpoint and status_code:
ALTER TABLE events
ADD COLUMN endpoint TEXT,
ADD COLUMN status_code INT;ALTER TABLE events
ADD COLUMN endpoint TEXT,
ADD COLUMN status_code INT;1-minute p95 latency per endpoint
WITH last_minute AS (
SELECT *
FROM events
WHERE event_time >= NOW() - INTERVAL 1 MINUTE
AND event_type = 'request'
)
SELECT
endpoint,
approx_quantile(value, 0.95) AS p95_latency_ms,
COUNT(*) AS request_count
FROM last_minute
GROUP BY endpoint
ORDER BY p95_latency_ms DESC;WITH last_minute AS (
SELECT *
FROM events
WHERE event_time >= NOW() - INTERVAL 1 MINUTE
AND event_type = 'request'
)
SELECT
endpoint,
approx_quantile(value, 0.95) AS p95_latency_ms,
COUNT(*) AS request_count
FROM last_minute
GROUP BY endpoint
ORDER BY p95_latency_ms DESC;Rolling 5-minute error rate per endpoint
We define "error" as status_code >= 500.
SELECT
event_time,
endpoint,
SUM(CASE WHEN status_code >= 500 THEN 1 ELSE 0 END) OVER (
PARTITION BY endpoint
ORDER BY event_time
RANGE BETWEEN INTERVAL 5 MINUTES PRECEDING AND CURRENT ROW
)::DOUBLE
/
COUNT(*) OVER (
PARTITION BY endpoint
ORDER BY event_time
RANGE BETWEEN INTERVAL 5 MINUTES PRECEDING AND CURRENT ROW
) AS error_rate_5m
FROM events
WHERE event_time >= NOW() - INTERVAL 15 MINUTES
AND event_type = 'request'
ORDER BY event_time;SELECT
event_time,
endpoint,
SUM(CASE WHEN status_code >= 500 THEN 1 ELSE 0 END) OVER (
PARTITION BY endpoint
ORDER BY event_time
RANGE BETWEEN INTERVAL 5 MINUTES PRECEDING AND CURRENT ROW
)::DOUBLE
/
COUNT(*) OVER (
PARTITION BY endpoint
ORDER BY event_time
RANGE BETWEEN INTERVAL 5 MINUTES PRECEDING AND CURRENT ROW
) AS error_rate_5m
FROM events
WHERE event_time >= NOW() - INTERVAL 15 MINUTES
AND event_type = 'request'
ORDER BY event_time;You can pipe these queries into:
- A simple web dashboard (FastAPI / Flask + charts)
- A CLI report
- A scheduled job that pushes metrics to Prometheus/Influx
No Flink job. No YAML-heavy pipelines. Just micro-batches and SQL.
Handling Out-of-Order Events (Without Losing Your Mind)
Streaming engines like Flink have explicit watermarks to deal with out-of-order events.
With DuckDB, you don't get built-in watermarks, but you can approximate the idea:
- Only consider events with
event_time <= NOW() - grace_intervalfor "final" aggregates. - Keep a small "late events" buffer if you care about corrections.
For example, to compute stable 1-minute windows with 30 seconds of grace:
WITH horizon AS (
SELECT NOW() - INTERVAL 30 SECONDS AS watermark
),
eligible AS (
SELECT e.*
FROM events e, horizon h
WHERE e.event_time <= h.watermark
AND e.event_time >= h.watermark - INTERVAL 1 MINUTE
)
SELECT
date_trunc('minute', event_time) AS minute_bucket,
COUNT(*) AS event_count
FROM eligible
GROUP BY 1
ORDER BY minute_bucket;WITH horizon AS (
SELECT NOW() - INTERVAL 30 SECONDS AS watermark
),
eligible AS (
SELECT e.*
FROM events e, horizon h
WHERE e.event_time <= h.watermark
AND e.event_time >= h.watermark - INTERVAL 1 MINUTE
)
SELECT
date_trunc('minute', event_time) AS minute_bucket,
COUNT(*) AS event_count
FROM eligible
GROUP BY 1
ORDER BY minute_bucket;It's more manual than Flink's declarative watermarks, but for smaller systems it's usually good enough.
Where DuckDB Streaming Shines (and Where It Doesn't)
Great fits
DuckDB as a "stream processor" makes a lot of sense when:
- You're building local or embedded analytics (per-tenant metrics, dashboards, internal tools).
- You can tolerate seconds of latency instead of strict sub-second processing.
- Your data volume fits comfortably on a single machine (or a small VM).
- You want to iterate in SQL, not manage a distributed cluster.
Not so great
You probably don't want to rely solely on DuckDB if:
- You need exactly-once semantics across multiple systems.
- Your event volume is massive (billions of events per hour, across many partitions).
- You need complex stream joins across high-velocity topics with tight SLAs.
In those worlds, something like Flink or Kafka Streams is still the right hammer.
But the point is: a huge chunk of "real-time analytics" problems live below that complexity line.
And DuckDB absolutely eats that space.
Wrapping Up: Sliding Windows Without the Heavy Machinery
Using DuckDB for stream processing is really about reframing:
- Treat your stream as fast, append-only batches.
- Let DuckDB's window functions and aggregations handle sliding and tumbling windows.
- Keep the architecture boring: ingestion worker → DuckDB → SQL → metrics.
You get:
- Minimal operational overhead
- Familiar SQL-based windowing
- A path from "prototype on laptop" to "small production instance" without changing tools
If this sparked ideas:
- Comment with your current stack (Flink, Kafka, plain cron jobs?) and what you'd love to simplify.
- Follow for more practical DuckDB patterns, from local analytics to lakehouse-style workflows.
- Share this with the teammate who keeps saying "we need Flink" for every dashboard — maybe you don't.