Not a Medium member, you can read the article here.

When it comes to data engineering, it often requires a credit card and a cloud subscription, but mastering modern data transformation can actually be achieved on our local machine.

In this article, we are going to build a functional ELT (Extract, Load, Transform) pipeline using DuckDB (our analytical engine) and dbt (data transformation).

Here is what we will cover:

  • The Architecture: Understanding the "Local Stack" approach.
  • The Ingestion: Using Python to load raw business data into DuckDB.
  • The Configuration: Setting up dbt-duckdb to connect to our local database.
  • The Transformation: Building specific data marts for Sales and Finance teams.

Let's get started!

Why DuckDB & DBT?

Before starting to code, let's understand our tools first. Using DuckDB can provide you with the power of a tool similar to Snowflake if your data size is around 100GB to a few TBs. You don't need to go through cloud configuration, but you can process your data incredibly fast locally on your laptops.

Using dbt can bring software engineering best practices — like version control, dependency management, and automated testing — to your SQL code, turning a folder of scripts into a reliable, documented, and professional data transformation layer.

Next, let's explore how we can utilize these two tools to assist a coffee shop.

None
Photo by Nafinia Putra on Unsplash

Our ELT Data Pipeline

You have been hired as the Data Engineer for a growing chain of coffee shops. Currently, their data storage is not organized.

  1. Sales Data sits in CSV exports from their Point-of-Sale (POS) system.
  2. Budget Data sits in Excel sheets maintained by the finance department.

However, they are still manually copying and pasting rows into a master spreadsheet. This is slow, error-prone, and unscalable.

Therefore, you decide to help them build an ELT pipeline to extract data from multiple sources, load data into the centralized database (DuckDB), and work on some transformation from raw data into clean "Data Marts" for the Sales and Finance teams using DBT.

Step 1: Extract and Load

First, we need to get the data out of the files and into DuckDB.

1–1: Create a project folder and install the necessary libraries

pip install duckdb dbt-duckdb pandas

1–2: Create a file named load.py. This script will create our database and load the raw files into DuckDB.

import duckdb

def load_data():
    # Connect to a persistent DuckDB file
    # If the file doesn't exist, this creates it
    con = duckdb.connect('coffeeshop.duckdb')

    print("🚀 Starting ELT Process...")

    # 1. create a schema for raw data
    con.sql("CREATE SCHEMA IF NOT EXISTS raw;")

    # 2. Load Sales Data (CSV)
    # read_csv_auto is a DuckDB magic function that infers types
    print("... Loading Sales Data")
    con.sql("""
        CREATE OR REPLACE TABLE raw.source_sales AS 
        SELECT * FROM read_csv_auto('data/sales_data.csv');
    """)

    # 3. Load Budget Data (CSV/Excel converted to CSV for simplicity)
    print("... Loading Budget Data")
    con.sql("""
        CREATE OR REPLACE TABLE raw.source_budget AS 
        SELECT * FROM read_csv_auto('data/budget_data.csv');
    """)
    
    # Validation check - to see if data is loaded into database successfully
    count = con.sql("SELECT count(*) FROM raw.source_sales").fetchone()
    print(f"✅ Loaded {count[0]} sales records.")
    
    con.close()

if __name__ == "__main__":
    load_data()
  • All source data is stored in the data folder in your root project.
  • Duckdb.connect('coffeeshop.duckdb'): After running this, you can see a file named coffeeshop.duckdb in your project folder, and this is your serverless database. Sounds easy, doesn't it?
  • read_csv_auto: This is DuckDB's superpower. It scans your CSV, figures out if columns are integers, dates, or strings, and creates the table definition automatically. DuckDB can read your file directly and change it into a database.

Step 2: dbt Setup and Configuration

Now, we have all of our data stored in coffeeshop.duckdb. Next, we are going to use dbt. Let's set up the dbt project first.

2–1: Initialize your dbt project in the terminal:

dbt init coffee_analytics

Select duckdb when asked for the database adapter.

2–2: Set up profiles.yml

This usually lives in your home directory ~/.dbt/, but you can also configure it inside your project. It tells dbt to connect to our DuckDB database.

coffee_analytics:
  target: dev
  outputs:
    dev:
      type: duckdb
      # This path must point to the yourdatabase
      path: 'absolute/full/path/to/your/project/coffee.duckdb'
      threads: 1

2–3: Define the sources

In your DBT project, you will see the folder structure as below.

coffee_analytics/
├── dbt_project.yml 
├── profiles.yml 
└── models/
    ├── marts/
    │   ├── finance_variance.sql   <-- Final output for Finance
    │   └── sales_performance.sql  <-- Final output for Sales
    ├── staging/
    │   ├── stg_budget.sql         <-- Cleaned raw budget data
    │   └── stg_sales.sql          <-- Cleaned raw sales data
    └── sources.yml                <-- Defines raw tables (source_sales, source_budget)

Next, go to the models folder and create a file called sources.yml . This tells DBT where the raw data lives (the schema and the table).

version: 2

sources:
  - name: bean_raw
    # Our schema is raw - can be found in step 1
    schema: raw <-- This tells dbt the actual schema name in the database (DuckDB)
    tables:
       # We have two tables - source_sales and source_budget
      - name: source_sales
      - name: source_budget
  • bean_raw defines the source_name, which will be used in the SQL
None
DuckDB and dbt

Step 3: Transformation (The Data Marts)

The most exciting part. We are going to create data marts with DBT.

Generally, we will create two layers of models:

  • Staging: Cleaning up column names and fixing types.
  • Marts: Joining data to answer business questions.

3–1: Staging Layer

In our models folder, we will create a folder called staging, and we will have two files inside this newly created folder.

models/staging/stg_sales.sql

select
    order_id,
    store_id,
    -- Convert date string to actual date object
    strptime(transaction_date, '%Y-%m-%d') as transaction_date,
    product_name as product,
    quantity,
    unit_price,
    (quantity * unit_price) as total_revenue
from {{ source('bean_raw', 'source_sales') }}

models/staging/stg_budget.sql

select
    store_id,
    date_part('month', date) as month,  
    date_part('year', date) as year,    
    target_revenue,
    regional_manager_name,
    is_promotional_month -- (T/F)
from {{ source('bean_raw', 'source_budget') }}

3–2: Data Marts layer

Now, let's answer the stakeholders' questions. Create a folder models/marts.

models/marts/sales_performance.sql

The first data mart is designed for the Sales team, enabling them to track revenue by store.

with sales as (
    select * from {{ ref('stg_sales') }}
)

select
    store_id,
    date_trunc('month', transaction_date) as sales_month,
    sum(quantity) as total_items_sold,
    sum(total_revenue) as gross_revenue
from sales
group by 1, 2
order by 1, 2
  • The syntax is like CTE; instead of selecting data from a table, we choose from the staging_sales, which is created earlier.

models/marts/finance_variance.sql

The second data mart is to help the finance team compare the actual spending with the budget goals.

with actuals as (
    select * from {{ ref('sales_performance') }}
),

budgets as (
    select * from {{ ref('stg_budget') }}
)

select
    a.store_id,
    a.sales_month,
    a.gross_revenue as actual_revenue,
    b.target_revenue,
    -- Calculate variance
    (a.gross_revenue - b.target_revenue) as revenue_variance,
    round((a.gross_revenue / b.target_revenue) * 100, 2) as achievement_pct
from actuals a
left join budgets b 
    on a.store_id = b.store_id 
    and month(a.sales_month) = b.month

You may ask:

Could we query from raw data directly rather than using dbt?

The reason why we use DBT is that the actual application is far more complex than the example we have above. The report may need to aggregate different tables and have a different index to show.

What's more, the finance team may need the result every week. Therefore, creating a data mart can address the business need clearly.

Step 4: Executing the Pipeline

All set now, time to run the whole flow. In your terminal, please make sure you are in your project root folder.

  1. Run python load.py to get all source data into the database, and you will see the output similar to this: loaded 5000 sales records.
  2. Run dbt build. This command will run your models and any tests you have configured. You will see DBT creating views and tables directly inside your DuckDB file.
  3. Verify the results. You can verify the data using the DuckDB CLI or by creating a Python script to query the final mart:
# quick_check.py
import duckdb

con = duckdb.connect('beanthere.duckdb')
# Querying the dbt model (which is now a table/view in the DB)
df = con.sql("SELECT * FROM finance_variance WHERE achievement_pct < 100").df()
print(df)

If you can get the results. It means you successfully transformed your data with DBT.

Conclusion

We just built an ELT pipeline without spinning up a single cloud resource. We ingested raw data with Python, stored it in DuckDB, and used dbt to engineer clean, reliable data marts.

Not every company handles data volumes that necessitate a cloud-based solution. In these cases, DuckDB can be utilized effectively in production.

By choosing lightweight tools like DuckDB and dbt, you can keep your data stack as agile as your code. This agility is crucial for empowering you to make data-driven decisions effectively.

If you like this article and want to show some love:

  • Clap 50 times — each one helps more than you think! 👏
  • Follow me, so you won't miss it when a new article is published
  • You can Buy Me a Coffee to support me more.
  • Let's contact me at LinkedIn or lhungen@gmail.com to chat more about data!

I also want to share with you some excellent articles to improve your coding skills in SQL and build your knowledge in data engineering.