Not a Medium member, you can read the article here.
When it comes to data engineering, it often requires a credit card and a cloud subscription, but mastering modern data transformation can actually be achieved on our local machine.
In this article, we are going to build a functional ELT (Extract, Load, Transform) pipeline using DuckDB (our analytical engine) and dbt (data transformation).
Here is what we will cover:
- The Architecture: Understanding the "Local Stack" approach.
- The Ingestion: Using Python to load raw business data into DuckDB.
- The Configuration: Setting up
dbt-duckdbto connect to our local database. - The Transformation: Building specific data marts for Sales and Finance teams.
Let's get started!
Why DuckDB & DBT?
Before starting to code, let's understand our tools first. Using DuckDB can provide you with the power of a tool similar to Snowflake if your data size is around 100GB to a few TBs. You don't need to go through cloud configuration, but you can process your data incredibly fast locally on your laptops.
Using dbt can bring software engineering best practices — like version control, dependency management, and automated testing — to your SQL code, turning a folder of scripts into a reliable, documented, and professional data transformation layer.
Next, let's explore how we can utilize these two tools to assist a coffee shop.
Our ELT Data Pipeline
You have been hired as the Data Engineer for a growing chain of coffee shops. Currently, their data storage is not organized.
- Sales Data sits in CSV exports from their Point-of-Sale (POS) system.
- Budget Data sits in Excel sheets maintained by the finance department.
However, they are still manually copying and pasting rows into a master spreadsheet. This is slow, error-prone, and unscalable.
Therefore, you decide to help them build an ELT pipeline to extract data from multiple sources, load data into the centralized database (DuckDB), and work on some transformation from raw data into clean "Data Marts" for the Sales and Finance teams using DBT.
Step 1: Extract and Load
First, we need to get the data out of the files and into DuckDB.
1–1: Create a project folder and install the necessary libraries
pip install duckdb dbt-duckdb pandas1–2: Create a file named load.py. This script will create our database and load the raw files into DuckDB.
import duckdb
def load_data():
# Connect to a persistent DuckDB file
# If the file doesn't exist, this creates it
con = duckdb.connect('coffeeshop.duckdb')
print("🚀 Starting ELT Process...")
# 1. create a schema for raw data
con.sql("CREATE SCHEMA IF NOT EXISTS raw;")
# 2. Load Sales Data (CSV)
# read_csv_auto is a DuckDB magic function that infers types
print("... Loading Sales Data")
con.sql("""
CREATE OR REPLACE TABLE raw.source_sales AS
SELECT * FROM read_csv_auto('data/sales_data.csv');
""")
# 3. Load Budget Data (CSV/Excel converted to CSV for simplicity)
print("... Loading Budget Data")
con.sql("""
CREATE OR REPLACE TABLE raw.source_budget AS
SELECT * FROM read_csv_auto('data/budget_data.csv');
""")
# Validation check - to see if data is loaded into database successfully
count = con.sql("SELECT count(*) FROM raw.source_sales").fetchone()
print(f"✅ Loaded {count[0]} sales records.")
con.close()
if __name__ == "__main__":
load_data()- All source data is stored in the data folder in your root project.
Duckdb.connect('coffeeshop.duckdb'): After running this, you can see a file namedcoffeeshop.duckdbin your project folder, and this is your serverless database. Sounds easy, doesn't it?read_csv_auto: This is DuckDB's superpower. It scans your CSV, figures out if columns are integers, dates, or strings, and creates the table definition automatically. DuckDB can read your file directly and change it into a database.
Step 2: dbt Setup and Configuration
Now, we have all of our data stored in coffeeshop.duckdb. Next, we are going to use dbt. Let's set up the dbt project first.
2–1: Initialize your dbt project in the terminal:
dbt init coffee_analyticsSelect duckdb when asked for the database adapter.
2–2: Set up profiles.yml
This usually lives in your home directory ~/.dbt/, but you can also configure it inside your project. It tells dbt to connect to our DuckDB database.
coffee_analytics:
target: dev
outputs:
dev:
type: duckdb
# This path must point to the yourdatabase
path: 'absolute/full/path/to/your/project/coffee.duckdb'
threads: 12–3: Define the sources
In your DBT project, you will see the folder structure as below.
coffee_analytics/
├── dbt_project.yml
├── profiles.yml
└── models/
├── marts/
│ ├── finance_variance.sql <-- Final output for Finance
│ └── sales_performance.sql <-- Final output for Sales
├── staging/
│ ├── stg_budget.sql <-- Cleaned raw budget data
│ └── stg_sales.sql <-- Cleaned raw sales data
└── sources.yml <-- Defines raw tables (source_sales, source_budget)Next, go to the models folder and create a file called sources.yml . This tells DBT where the raw data lives (the schema and the table).
version: 2
sources:
- name: bean_raw
# Our schema is raw - can be found in step 1
schema: raw <-- This tells dbt the actual schema name in the database (DuckDB)
tables:
# We have two tables - source_sales and source_budget
- name: source_sales
- name: source_budget- bean_raw defines the source_name, which will be used in the SQL

Step 3: Transformation (The Data Marts)
The most exciting part. We are going to create data marts with DBT.
Generally, we will create two layers of models:
- Staging: Cleaning up column names and fixing types.
- Marts: Joining data to answer business questions.
3–1: Staging Layer
In our models folder, we will create a folder called staging, and we will have two files inside this newly created folder.
models/staging/stg_sales.sql
select
order_id,
store_id,
-- Convert date string to actual date object
strptime(transaction_date, '%Y-%m-%d') as transaction_date,
product_name as product,
quantity,
unit_price,
(quantity * unit_price) as total_revenue
from {{ source('bean_raw', 'source_sales') }}models/staging/stg_budget.sql
select
store_id,
date_part('month', date) as month,
date_part('year', date) as year,
target_revenue,
regional_manager_name,
is_promotional_month -- (T/F)
from {{ source('bean_raw', 'source_budget') }}3–2: Data Marts layer
Now, let's answer the stakeholders' questions. Create a folder models/marts.
models/marts/sales_performance.sql
The first data mart is designed for the Sales team, enabling them to track revenue by store.
with sales as (
select * from {{ ref('stg_sales') }}
)
select
store_id,
date_trunc('month', transaction_date) as sales_month,
sum(quantity) as total_items_sold,
sum(total_revenue) as gross_revenue
from sales
group by 1, 2
order by 1, 2- The syntax is like CTE; instead of selecting data from a table, we choose from the staging_sales, which is created earlier.
models/marts/finance_variance.sql
The second data mart is to help the finance team compare the actual spending with the budget goals.
with actuals as (
select * from {{ ref('sales_performance') }}
),
budgets as (
select * from {{ ref('stg_budget') }}
)
select
a.store_id,
a.sales_month,
a.gross_revenue as actual_revenue,
b.target_revenue,
-- Calculate variance
(a.gross_revenue - b.target_revenue) as revenue_variance,
round((a.gross_revenue / b.target_revenue) * 100, 2) as achievement_pct
from actuals a
left join budgets b
on a.store_id = b.store_id
and month(a.sales_month) = b.monthYou may ask:
Could we query from raw data directly rather than using dbt?
The reason why we use DBT is that the actual application is far more complex than the example we have above. The report may need to aggregate different tables and have a different index to show.
What's more, the finance team may need the result every week. Therefore, creating a data mart can address the business need clearly.
Step 4: Executing the Pipeline
All set now, time to run the whole flow. In your terminal, please make sure you are in your project root folder.
- Run python load.py to get all source data into the database, and you will see the output similar to this: loaded 5000 sales records.
- Run dbt build. This command will run your models and any tests you have configured. You will see DBT creating views and tables directly inside your DuckDB file.
- Verify the results. You can verify the data using the DuckDB CLI or by creating a Python script to query the final mart:
# quick_check.py
import duckdb
con = duckdb.connect('beanthere.duckdb')
# Querying the dbt model (which is now a table/view in the DB)
df = con.sql("SELECT * FROM finance_variance WHERE achievement_pct < 100").df()
print(df)If you can get the results. It means you successfully transformed your data with DBT.
Conclusion
We just built an ELT pipeline without spinning up a single cloud resource. We ingested raw data with Python, stored it in DuckDB, and used dbt to engineer clean, reliable data marts.
Not every company handles data volumes that necessitate a cloud-based solution. In these cases, DuckDB can be utilized effectively in production.
By choosing lightweight tools like DuckDB and dbt, you can keep your data stack as agile as your code. This agility is crucial for empowering you to make data-driven decisions effectively.
If you like this article and want to show some love:
- Clap 50 times — each one helps more than you think! 👏
- Follow me, so you won't miss it when a new article is published
- You can Buy Me a Coffee to support me more.
- Let's contact me at LinkedIn or lhungen@gmail.com to chat more about data!
I also want to share with you some excellent articles to improve your coding skills in SQL and build your knowledge in data engineering.