So far, we've explored: βœ… Medallion Architecture (Bronze, Silver, Gold) βœ… dbt Sources & Freshness Checks βœ… Building a Bronze Table (Raw Data Ingestion)

Today, we're moving to the Silver layer by creating a de-duplicated and standardized Account Dimension table in dbt.

By the end of this post, you'll understand: βœ” What the Account Dimension is and why it's important βœ” How to clean, standardize, and de-duplicate account data in dbt βœ” How to build and test a Silver model

Let's dive in! 🎯

1️⃣ What is the Account Dimension?

An Account Dimension is a core reference table that contains a single, clean version of each customer account. It is used for:

πŸ”Ή Consistent reporting β€” Ensures every department refers to the same account details πŸ”Ή Data quality β€” Removes duplicates and standardizes formats πŸ”Ή Efficient joins β€” Optimizes queries by providing a structured dataset

In our case, the raw accounts table (Bronze layer) might contain: ❌ Duplicate accounts (same email, different IDs) ❌ Inconsistent formatting (e.g., IBM, I.B.M, ibm) ❌ Null or missing values

In the Silver layer, we will clean this up!

2️⃣ Our Raw Data (Bronze Layer)

πŸ“‚ Bronze Table (stg_accounts.sql)

SELECT 
    id AS account_id,
    name,
    email,
    created_at
FROM {{ source('raw', 'accounts') }}

This contains raw, unclean data from our ingestion layer.

Example raw data:

None

We need to: βœ” Remove duplicates (same email, different names) βœ” Standardize names (convert to uppercase) βœ” Keep the latest record (by created_at)

πŸ“‚ Silver Table (dim_accounts.sql)

WITH base AS (
    SELECT 
        account_id, 
        UPPER(TRIM(name)) AS account_name,  -- Standardizing name format
        email, 
        created_at,
        ROW_NUMBER() OVER (PARTITION BY email ORDER BY created_at DESC) AS row_num
    FROM {{ ref('stg_accounts') }}
)
SELECT 
    account_id,
    account_name,
    email,
    created_at
FROM base
WHERE row_num = 1  -- Deduplicating by keeping the latest record

What's Happening Here?

βœ… UPPER(TRIM(name)) β€” Converts names to uppercase for consistency βœ… ROW_NUMBER() OVER (PARTITION BY email ORDER BY created_at DESC) β€” Identifies duplicate accounts by email and keeps the latest record βœ… WHERE row_num = 1 β€” Filters out duplicates, keeping only the most recent entry

Final Clean Data (Silver Layer)

None

πŸš€ Our data is now clean and de-duplicated!

4️⃣ Adding dbt Tests for Data Quality

Since this is a critical dimension table, we need to enforce data quality using dbt tests.

πŸ“‚ Add tests in dim_accounts.yml

version: 2
models:
  - name: dim_accounts
    description: "Standardized and de-duplicated account dimension"
    columns:
      - name: account_id
        tests:
          - unique
          - not_null
      - name: email
        tests:
          - unique
          - not_null

What These Tests Do:

βœ… Unique account_id – Ensures no duplicate accounts exist βœ… Not Null account_id – Ensures every account has an ID βœ… Unique email – Ensures each email appears only once

Run dbt tests:

dbt test --select dim_accounts

🎯 This ensures our Silver model maintains high data quality.

5️⃣ Wrapping Up

Key Takeaways

βœ… The Account Dimension helps maintain clean, standardized, and de-duplicated accounts βœ… Silver models improve data quality by removing duplicates and formatting inconsistencies βœ… dbt tests ensure data integrity in critical business tables

Tomorrow in Day 11, we'll move to the Gold layer by creating a Business Revenue Model! Stay tuned! πŸš€

πŸ’¬ How do you handle de-duplication in your data pipelines? Any dbt best practices? Let's discuss! πŸ‘‡