When switching data processing libraries feels too risky but staying with Pandas costs you hours every day — here's the migration path that transformed our pipeline
So here's the thing — I was sitting there and watching this progress bar, right? And it's been THREE HOURS and we're at 15%. Fifteen percent. I remember thinking "this is my life now, just watching progress bars" and honestly feeling a bit dramatic about it but also… not wrong?
Our ETL pipeline was processing 50 million transaction records. Should've been 30 minutes. Instead? Six hours. Every. Single. Day.
And look, we'd tried everything. More RAM? Check. Bigger instances? Yep. We even had this whole conversation about splitting the dataset across multiple machines (which, sidebar — why is that always the solution everyone jumps to? "Just throw more hardware at it!" Like we're made of money or something).
Nothing worked though.
Then Someone Mentioned Polars
I don't even remember who brought it up first. Maybe on a Slack thread? But someone was like "hey have you heard about this Polars thing, it's written in Rust, supposedly 5–10x faster than pandas" and my immediate reaction was… skepticism. Deep skepticism.
Because come on — pandas has been THE library for what, fifteen years? It's like saying you found a better wheel. You know that feeling when something sounds too good to be true so you just… file it away mentally as "probably marketing bullshit"?
But I was desperate. Actually desperate. So I tried it.
Six months later — and I still get a little excited telling people this — that same pipeline runs in 36 minutes. THIRTY-SIX MINUTES. From six hours. And our hardware costs? Down 60%.
More importantly though (and this is the part that actually matters), we can now do analyses that we literally couldn't do before. Like, there were problems we'd look at and go "yeah we'd love to answer that but it would take three days to process" and now… we just do them. It's wild.
Okay But Why Was Pandas So Slow Anyway?
I need to backtrack here because understanding this helped me wrap my head around everything else.
Pandas has this fundamental problem — and it's not even really pandas' fault, it's more about when it was designed — where it only uses one CPU core. ONE CORE.
Think about it like this: you have eight brilliant people on your team, but only one of them is allowed to work while the other seven just… watch. That's pandas on a modern multi-core system. It drives me crazy thinking about all that wasted potential just sitting there idle.
And the memory thing! Oh god the memory thing. A 1GB CSV file? That'll be 4–6GB of RAM when you load it into pandas, thank you very much. Because pandas stores everything as Python objects and there's all this overhead and…
Actually I once had this operation that looked totally innocent — just a groupby with apply — and it used 3x the memory of my entire DataFrame. Just exploded. The process crashed and I lost like two hours of work because I hadn't saved intermediate results (learned that lesson the hard way).
# This innocent-looking thing? Memory bomb.
df_processed = df.groupby('category').apply(complex_transformation)What really gets me though is the eager execution. Every operation runs immediately without thinking about what's coming next:
df_filtered = df[df['amount'] > 1000] # Scans everything
df_grouped = df_filtered.groupby('category') # Scans again
df_final = df_grouped.sum() # Scans AGAINIt's like if you were cooking and you got out every ingredient, used one, put everything away, then got everything out again for the next ingredient. Just… inefficient. But you don't really think about it until someone shows you a better way.
Enter Polars (Which Scared Me a Little)
So Polars is built on this Apache Arrow columnar format, which I'd heard about but never really understood. Let me try to explain how I finally wrapped my head around it.
Traditional storage (pandas) is like reading a book line by line:
- Row 1: John, 25, Engineer, 75000
- Row 2: Jane, 30, Manager, 85000
- Row 3: Bob, 28, Developer, 70000
Columnar storage is more like… imagine if you pulled out just the ages from everyone: [25, 30, 28]. Or just the salaries: [75000, 85000, 70000].
Why does this matter? Oh man, so many reasons. The CPU cache efficiency alone — when you're filtering by age, you only load age data into cache. And compression! Similar values compress way better when they're grouped together. Plus you can use these SIMD instructions that process entire columns at once…
(I'm getting excited about cache efficiency. What has my life become?)
The Lazy Evaluation Thing Blew My Mind
This was the concept that took me the longest to really get, but once I did it was like… everything clicked.
Polars doesn't execute operations immediately. Instead it builds this query plan and optimizes the ENTIRE pipeline before running anything:
use polars::prelude::*; // DataFrame brain for Rust — we'll stay in lazy mode so nothing runs early
// build a lazy query: describe *what* to do, not *when*
// (filter → group → aggregate → sort). no I/O yet.
let query_plan = LazyFrame::scan_csv("transactions.csv", ScanArgsCSV::default())? // declare a CSV scan (lazy source)
.filter(col("amount").gt(lit(1000))) // keep only rows where amount > 1000
.group_by([col("category")]) // bucket by category
.agg([col("amount").sum().alias("total_amount")]) // sum amounts per bucket
.sort("total_amount", SortOptions::default()); // order by the sum (ascending by default)
// nothing happened yet — this is just a plan (deferred execution)
let results = query_plan.collect()?; // NOW we execute: Polars optimizes + runs the whole pipelineIt can do things like:
- Push filters as early as possible (predicate pushdown — fancy term I learned)
- Only read the columns you actually need
- Reorder joins to be more efficient
- Plan memory usage so you don't spike and crash
It's like the difference between following GPS directions step by step versus looking at the whole route first and going "oh wait, I can take this shortcut."
Actually Making the Switch Was Terrifying
I'm not gonna lie — even after seeing the benchmarks, I was nervous. This meant:
- Learning Rust (I'd barely touched it before)
- Rewriting production code (scary!)
- Convincing my team this wasn't a terrible idea
We started small. Just one pipeline. The most painful one.
# Cargo.toml — lean data stack for a lazy DataFrame pipeline
[dependencies]
# Polars: the DataFrame engine. We enable lazy so queries are planned/optimized,
# and common I/O formats so we can read/write without extra crates.
polars = { version = "0.35", features = ["lazy", "csv", "json", "parquet"] }
# Tokio: async runtime. "full" because it's a dev box and we don't want to hunt features.
tokio = { version = "1.0", features = ["full"] }
# Anyhow: ergonomic error handling so main() can use Result and ? without ceremony.
anyhow = "1.0"First migration:
use polars::prelude::*; // DataFrame brain — we'll keep it lazy until the very end
use anyhow::Result; // ergonomic errors so `?` reads cleanly
#[tokio::main] // async main so we can scale later without refactors
async fn main() -> Result<()> {
let start = std::time::Instant::now(); // quick stopwatch — good enough for a sanity check
// build the whole query as a lazy plan: filter → filter → group → agg → sort → collect
// nothing executes until `collect()`, so Polars can optimize the pipeline for us
let df = LazyFrame::scan_csv("large_dataset.csv", ScanArgsCSV::default())? // declare the CSV scan (deferred)
.filter(col("transaction_date").gt(lit("2024-01-01"))) // keep only 2024+ transactions
.filter(col("amount").gt(lit(100))) // toss tiny transactions
.group_by([col("customer_id"), col("product_category")]) // bucket per customer & category
.agg([
col("amount").sum().alias("total_spent"), // how much in total
col("amount").count().alias("transaction_count"), // how many swipes
col("amount").mean().alias("avg_transaction"), // average ticket size
])
.sort("total_spent", SortOptions::default().with_order_descending(true)) // biggest spenders first
.collect()?; // okay, NOW run it
// tiny report so we know it actually did work and didn't eat the world
println!("Processed {} rows in {:?}", df.height(), start.elapsed()); // rows post-aggregation
println!("Memory usage: {:.2}MB", df.estimated_size() as f64 / 1_000_000.0); // rough footprint
Ok(()) // all good — leave the rest for the next stage
}What took pandas 45 minutes ran in 4 minutes. FOUR MINUTES.
I literally ran it twice because I thought there was a bug. Then I just sat there grinning at my screen like an idiot.
The Mental Shift Was Harder Than The Code
The biggest challenge wasn't actually the syntax or learning Rust (though that had its moments, don't get me wrong). It was thinking differently about data processing.
Old way (pandas brain):
# load-everything-now version — simple, a bit heavy, but easy to read
# pull raw tables into memory (yes, all of it)
df1 = pd.read_csv("sales.csv") # transactions: customer_id, product_id, revenue, …
df2 = pd.read_csv("customers.csv") # customers: customer_id, … (joins on customer_id)
df3 = pd.read_csv("products.csv") # products: product_id, category, …
# join the world: sales → customers → products (order matters only for clarity here)
merged = (
df1
.merge(df2, on="customer_id") # tack on customer fields
.merge(df3, on="product_id") # tack on product/category fields
)
# roll up by product category — one number per category
result = merged.groupby("category").agg({"revenue": "sum"}) # total revenue per categoryNew way (Polars brain):
// same plan, just friendlier breadcrumbs — lazy all the way until collect()
let result = LazyFrame::scan_csv("sales.csv", ScanArgsCSV::default())? // declare sales as a lazy source (no I/O yet)
.join( // stitch in customer info
LazyFrame::scan_csv("customers.csv", ScanArgsCSV::default())?, // second lazy source
[col("customer_id")], // left key
[col("customer_id")], // right key
JoinArgs::new(JoinType::Inner), // inner join: only matching rows survive
)
.join( // then pull product/category data
LazyFrame::scan_csv("products.csv", ScanArgsCSV::default())?, // third lazy source
[col("product_id")], // left key
[col("product_id")], // right key
JoinArgs::new(JoinType::Inner), // inner again — keep it tight
)
.group_by([col("category")]) // bucket rows by category
.agg([col("revenue").sum()]) // total revenue per bucket
.collect()?; // NOW execute: optimizer + streaming kick inSee how our 50GB dataset never fully loads into memory? Polars just… streams it. Processes in optimal chunks. I remember the first time I watched the memory usage stay flat while processing a huge file and just thinking "this shouldn't be possible."
Some Advanced Stuff That Made Me Feel Smart
Once I got comfortable, I started playing with more advanced patterns. This is where it gets really fun.
Predicate pushdown (I love saying that, makes me sound like I know what I'm doing):
// lazy plan over parquet — filter early, move less, think faster
let optimized_query = LazyFrame::scan_parquet("huge_dataset.parquet", ScanArgsParquet::default())? // declare parquet source (no I/O yet)
.select([ // prune columns up front
col("timestamp"),
col("user_id"),
col("revenue"),
col("region"),
])
.filter( // pushdown-friendly filters
col("timestamp").gt(lit("2024-01-01")) // only 2024+
.and(col("revenue").gt(lit(1000))) // meaningful spend
.and(col("region").eq(lit("US"))) // keep US rows
)
.group_by([col("user_id")]) // bucket by user
.agg([col("revenue").sum()]) // total revenue per user
.collect()?; // NOW execute: optimizer + parquet predicate pushdown do their thingAnd parallel processing! This was cool:
// fan out per-file work on the runtime, pull results back, then glue DataFrames together
async fn process_in_parallel(file_paths: Vec<&str>) -> Result<DataFrame> {
// spin up one task per path — each task builds a lazy plan and collects it
let tasks: Vec<_> = file_paths
.into_iter()
.map(|path| {
tokio::spawn(async move {
LazyFrame::scan_csv(path, ScanArgsCSV::default()) // declare the CSV scan (lazy)
.unwrap() // keep the original behavior (panic on scan error)
.filter(col("status").eq(lit("active"))) // keep only active rows
.select([col("*")]) // same columns through (explicit no-op select)
.collect() // NOW execute for this file
})
})
.collect();
// wait for all tasks to finish — order here matches the input order
let results = futures::future::join_all(tasks).await;
// unwrap JoinHandle results (propagate panics) and gather DataFrames (or errors) as-is
let dfs: Result<Vec<DataFrame>> = results
.into_iter()
.map(|res| res.unwrap()) // original intent: if a task panicked, crash here
.collect();
// stitch the per-file frames into one — default union behavior
Ok(concat(dfs?, UnionArgs::default())?)
}The Numbers Are Almost Embarrassing
Okay so here's where I get to brag a little:
Processing times:
- Daily ETL: 6 hours → 36 minutes (90% drop!)
- Real-time analytics: 15 seconds → 1.2 seconds
- Monthly reports: 4 hours → 18 minutes
Memory:
- 50M row dataset: 24GB → 3.2GB (I mean… come on)
- Peak usage: 48GB → 6GB
- Crashes: Frequent → Literally none
Money stuff (this got management excited):
- Downgraded from r5.8xlarge to r5.2xlarge instances
- 75% cost reduction on compute
- Plus better compression meant cheaper storage
I showed these numbers to my manager and she just stared at me for a second and went "wait, are you sure these are right?" Yes. Yes they are.
It Wasn't All Smooth Though
Let me be honest about the pain points because they were real.
The API is similar but different enough to be annoying:
Pandas:
df.groupby('category').apply( # operate group-by-group
lambda g: g.nlargest(5, 'value') # within each group, pick the 5 largest by 'value'
) # returns a stacked DataFrame with a groupby index on topPolars (had to figure this out):
df.lazy()
.with_columns([
col("value").rank(RankOptions::default().with_method(RankMethod::Ordinal))
.over([col("category")])
.alias("rank")
])
.filter(col("rank").le(lit(5)))
.collect()?Took me like an hour to figure out that window functions were the way to go here.
Debugging lazy evaluation is weird:
You can't just print intermediate steps like you're used to. But you CAN inspect query plans:
// build the lazy plan — filter → group → sum (nothing executes yet)
let query = LazyFrame::scan_csv("data.csv", ScanArgsCSV::default())?
.filter(col("amount").gt(lit(1000))) // keep only meaningful rows
.group_by([col("category")]) // bucket by category
.agg([col("amount").sum()]); // total per bucket
// sanity check: peek at what Polars will actually run (optimizer's view)
// this has saved me from facepalm moments more times than I'll admit
println!("{}", query.describe_optimized_plan()?);Not everything has a Rust equivalent:
Sometimes you just need to go back to Python. And that's okay! We did a hybrid approach:
// Heavy lifting in Rust
let processed_data = polars_processing_pipeline()?;
processed_data.write_parquet("results.parquet", ParquetWriteOptions::default())?;Then in Python:
import polars as pl # fast DataFrame engine; keep work in Polars as long as possible
import matplotlib.pyplot as plt # plotting (expects pandas/NumPy under the hood)
df = pl.read_parquet("results.parquet") # load once; Parquet is columnar so this is snappy
# only convert to pandas when the plotting library forces our hand
pandas_df = df.to_pandas() # careful: this materializes in RAM; do it at the edge, not earlier
# quick line plot — dates on X, revenue on Y (assumes 'date' is already parseable by matplotlib)
plt.plot(pandas_df['date'], pandas_df['revenue']) # tiny, direct, gets the picture on screenSome Patterns I Wish I'd Known Earlier
Streaming for massive datasets:
// same plan, just friendlier breadcrumbs — stream it so memory stays chill
let result = LazyFrame::scan_csv("massive_dataset.csv", ScanArgsCSV::default())? // declare the CSV source (lazy; no I/O yet)
.group_by([col("category")]) // bucket rows by category
.agg([col("sales").sum()]) // sum sales per bucket
.collect_streaming()?; // the magic: execute in streaming mode (low memory)Complex business logic without getting messy:
use polars::prelude::*; // DataFrame brain
use anyhow::Result; // lightweight errors
// keep the logic exactly the same — just naming it and adding human breadcrumbs
fn complex_business_logic() -> Expr {
// tiered commission: >10k → 15%, >1k → 10%, else → 5%
when(col("revenue").gt(lit(10_000))) // top tier first
.then(col("revenue") * lit(0.15)) // 15%
.when(col("revenue").gt(lit(1_000))) // mid tier
.then(col("revenue") * lit(0.10)) // 10%
.otherwise(col("revenue") * lit(0.05)) // floor tier
.alias("commission") // name the result column
}
fn main() -> Result<()> {
// lazy plan: declare → compute later
let df = LazyFrame::scan_csv("sales.csv", ScanArgsCSV::default())? // no I/O yet
.with_columns([complex_business_logic()]) // add the computed column
.collect()?; // NOW run it
println!("{}", df.head(Some(5))); // quick sanity peek
Ok(())
}Being smart about data types:
// pick your types on purpose — fewer bytes, faster scans, fewer surprises
let schema = Schema::from_iter([
("customer_id".to_string(), DataType::UInt32), // fits most IDs; don't pay for 64-bit if you don't need it
("transaction_date".to_string(), DataType::Date), // date-only is enough; skip full timestamp storage
("amount".to_string(), DataType::Float32), // Float32 is fine if you don't need sub-cent precision
("category".to_string(), DataType::Categorical(None, CategoricalOrdering::Physical)), // compress repeated labels
]);
// same plan, just nudging Polars with an explicit schema — it really helps memory/CPU
let df = LazyFrame::scan_csv(
"data.csv",
ScanArgsCSV::default().with_schema(Some(schema.into())) // tell the scanner what to expect
)?
.collect()?; // NOW execute the pipelineThe Unexpected Benefits
The speed was great, obviously. But there were other things I didn't anticipate:
We stopped getting paged at 3 AM for memory crashes. Like… just stopped. That alone was worth it.
Development got faster too — 10x faster feedback loops meant I could try ideas that would've taken too long before. That's huge for productivity.
And honestly? Working with fast tools just feels better. There's something about not waiting around that makes you more willing to experiment, to try things, to iterate.
Plus learning Rust turned out to be valuable for other projects. Who knew?
When You Should (and Shouldn't) Switch
Real talk: Polars isn't always the answer.
You should definitely consider it if:
- You're processing large datasets regularly (anything over 1GB)
- Performance actually matters for your use case
- You're memory constrained and every GB counts
- You do a lot of batch processing that could benefit from optimization
- Your team is willing to invest time learning something new
Maybe stick with pandas if:
- Your datasets are small (<100MB) and performance is fine
- You're heavily integrated with Python ML libraries that don't play nice with Rust
- Your team doesn't have bandwidth for a learning curve right now
- You're mostly prototyping and need to move fast
The decision really depends on your situation. For us, the pain of status quo outweighed the pain of migration. But that's not true for everyone.

What This Actually Meant For Us
Here's the thing — this wasn't just about swapping libraries. It changed how we think about data processing entirely.
We went from "can we even do this analysis?" to "how fast can we get results?"
We went from batch-only to real-time analytics being possible.
We went from constant infrastructure cost discussions to… not having those discussions anymore.
That 90% time reduction was amazing, sure. But the real win? Being able to solve problems we'd written off as impossible. That's the game changer.
And look, I'm not saying you should immediately rewrite everything in Rust. That would be insane. But if you're hitting the same walls we were — progress bars that never move, memory crashes at 2 AM, infrastructure costs that make your CFO cry — maybe it's worth looking at Polars.
Just start small. Pick one painful pipeline. See what happens.
Worst case? You learn something new. Best case? You get your evenings back and your manager thinks you're a wizard.
Enjoyed the read? Let's stay connected!
If you found value in this content, consider:
- Joining my free newsletter for weekly insights & behind-the-scenes updates
- ☕ Buying me a coffee to keep the creativity flowing
- 🔔Follow me on Medium, X so you never miss an update!
Your support means the world and helps me create more content you'll love. ❤️