Our Docker image was 8.2GB.
Every deploy took 18 minutes just to transfer the image.
AWS charged us $47 per deploy for data transfer.
We deployed 6 times a day.
That's $282/day. $8,460/month. Just to move a Docker image.
Here's how I got it to 127MB in one afternoon.
The Dockerfile That Cost Us $8,460/Month
This was our Dockerfile in March 2024:
FROM node:18
RUN apt-get update
RUN apt-get install -y curl
RUN apt-get install -y wget
RUN apt-get install -y git
RUN apt-get install -y python3
RUN apt-get install -y build-essential
COPY package.json .
RUN npm install
COPY . .
RUN npm run build
EXPOSE 3000
CMD ["npm", "start"]Looks innocent, right?
Image size: 8.2GB Layers: 47 Deploy time: 18 minutes AWS transfer cost: $47/deploy
Every single RUN command created a new layer.
Every layer got cached. Every layer got shipped.
We were shipping apt-get update cache, npm install cache, build artifacts, source files, node_modules for development AND production.
Everything.
The Wake-Up Call
Friday afternoon. 4:47 PM.
Critical bug in production. Users couldn't checkout.
Me: "I'll push a fix in 5 minutes."
Git push. CI/CD triggered.
4:52 PM — Build started 4:58 PM — Build complete 4:59 PM — Pushing image to ECR… 5:01 PM — 12% uploaded 5:04 PM — 28% uploaded 5:08 PM — 51% uploaded
CEO on Slack: "How long until this is fixed?"
Me: "Image is still uploading. Maybe 10 more minutes."
CEO: "It takes 10 minutes to deploy a bug fix?"
Me: "The image is 8.2GB."
CEO: "What the hell is in that image?"
Good question.
What Was Actually In That 8.2GB
I pulled the image locally and analyzed it.
docker history our-app:latest --no-trunc --humanLayer breakdown:
- Node.js base image: 900MB
- apt-get packages: 1.2GB (including Python, build tools we never used)
- npm install (all dependencies): 2.1GB
- Source files: 180MB
- Build artifacts: 340MB
- Cached apt-get lists: 890MB
- Old node_modules from previous builds: 1.8GB
- Random stuff we forgot about: 1.8GB
We were shipping:
- Development dependencies in production
- Build tools we only needed during build
- Source TypeScript files alongside compiled JavaScript
- Three different versions of node_modules (Docker layer caching gone wrong)
- Python and build-essential (never used in production)
Production runtime actually needed:
- Node.js runtime
- Compiled JavaScript
- Production dependencies
That's it.
The First Attempt: Multi-Stage Build
I rewrote the Dockerfile with multi-stage builds.
# Stage 1: Build
FROM node:18 AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production
COPY . .
RUN npm run build
# Stage 2: Runtime
FROM node:18-slim
WORKDIR /app
COPY --from=builder /app/dist ./dist
COPY --from=builder /app/node_modules ./node_modules
COPY package*.json ./
EXPOSE 3000
CMD ["node", "dist/index.js"]Result: Image size: 1.2GB Deploy time: 4 minutes
Better. Not good enough.
The Second Attempt: Alpine Linux
Switched to Alpine base image.
# Stage 1: Build
FROM node:18-alpine AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production
COPY . .
RUN npm run buil
# Stage 2: Runtime
FROM node:18-alpine
WORKDIR /app
COPY --from=builder /app/dist ./dist
COPY --from=builder /app/node_modules ./node_modules
EXPOSE 3000
CMD ["node", "dist/index.js"]Result: Image size: 420MB Deploy time: 90 seconds
Getting closer.
The Final Version: 127MB
The breakthrough: node_modules was still huge (340MB).
Most of it? Unused dependencies.
I audited every package:
npm ls --all > deps.txtFound:
- lodash: Used 3 functions. Entire library: 24MB.
- moment.js: Used for date formatting. 67MB. (Native
Intldoes this now) - aws-sdk: Imported entire SDK. Only needed S3 client. 89MB waste.
- 47 other packages: Pulled in as transitive dependencies. Never used.
I replaced:
lodash→ wrote 3 utility functions (12 lines)moment→ nativeDateandIntlaws-sdk→@aws-sdk/client-s3(only S3)- Removed unused dependencies
New package.json: 12 dependencies instead of 87.
Final Dockerfile:
# Build stage
FROM node:18-alpine AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production --ignore-scripts
COPY . .
RUN npm run build
# Production stage
FROM node:18-alpine
RUN apk add --no-cache tini
WORKDIR /app
COPY --from=builder /app/dist ./dist
COPY --from=builder /app/node_modules ./node_modules
COPY package*.json ./
USER node
EXPOSE 3000
ENTRYPOINT ["/sbin/tini", "--"]
CMD ["node", "dist/index.js"]Final result: Image size: 127MB Deploy time: 40 seconds AWS transfer cost: $0.80/deploy
The Numbers
Before:
- Image size: 8.2GB
- Deploy time: 18 minutes
- Cost per deploy: $47
- Daily deploys: 6
- Monthly cost: $8,460
After:
- Image size: 127MB
- Deploy time: 40 seconds
- Cost per deploy: $0.80
- Daily deploys: 6
- Monthly cost: $144
Savings: $8,316/month
Same application. Same functionality. 64x smaller. 27x faster deploys.
What Actually Made the Difference
1. Multi-stage builds
Don't ship build tools to production.
Build stage: Install everything. Compile everything. Runtime stage: Copy only what's needed to run.
2. Alpine base image
node:18 = 900MB
node:18-alpine = 110MB
Same Node.js. Way smaller base.
3. Dependency audit
Most projects have 50–200 npm packages.
You probably use 10–20.
The rest? Transitive dependencies you never imported.
Run npm ls --all. Be horrified. Start removing.
4. Layer optimization
Every RUN command = new layer.
Bad:
RUN apt-get update
RUN apt-get install curl
RUN apt-get install wgetGood:
RUN apt-get update && apt-get install -y \
curl \
wget \
&& rm -rf /var/lib/apt/lists/*One layer. Cleaned up in the same command.
5. .dockerignore
We were copying everything.
COPY . .This included:
- node_modules (reinstalled anyway)
- .git (900MB of history)
- tests
- documentation
- .env files
- Random files
.dockerignore:
node_modules
.git
*.md
.env*
tests
coverage
.DS_StoreSaved 1.2GB right there.
The Mistakes I Made
Mistake 1: Chasing the wrong metric
First attempt optimized for "fewer Dockerfile lines."
Result: Unreadable. Still 3GB.
What mattered: Final image size. Not Dockerfile elegance.
Mistake 2: Not profiling first
I guessed what was big.
Should have run docker history first.
Wasted 2 hours optimizing the wrong things.
Mistake 3: Keeping "just in case" packages
"We might need Python later." "Build-essential could be useful." "Let's keep wget."
We didn't need any of it.
Add it when you need it. Not before.
What Happened After
Week 1: Deployed 47 times. Previous record: 12 times/week.
Why? Because deploys were fast now. Developers weren't afraid to deploy.
Month 1: Found 3 bugs we'd been living with for months.
Why? Fast deploys = faster iteration = faster debugging.
Month 3: Junior dev asked, "Why is our image so small?"
I showed him the old Dockerfile.
Him: "8.2GB?! How did this ever work?"
Me: "It didn't. That's why I fixed it."
For Your Dockerfile
Here's the checklist I use now:
1. Use Alpine base images
node:18-alpinenotnode:18python:3.11-alpinenotpython:3.11- Size difference: 700–900MB
2. Multi-stage builds
- Build stage: Install everything
- Runtime stage: Copy only runtime needs
- Don't ship compilers to production
3. Audit dependencies
- Run
npm ls --allorpip list - Remove unused packages
- Replace heavy packages with lightweight alternatives
4. Combine RUN commands
- One
RUNcommand for related operations - Clean up in the same layer
&& rm -rf /var/lib/apt/lists/*after apt-get
5. .dockerignore
- Exclude: node_modules, .git, tests, docs
- Only copy what's needed to build
6. Order matters
- Copy package files first
- Install dependencies
- Copy source code last
- (Layers cache better this way)
7. Don't install what you don't need
- "Just in case" packages cost GB
- Add them when needed, not before
📬 What I'm Building
I'm building ProdRescue AI — turns messy incident logs into clear postmortem reports in minutes.
Because I was tired of spending 8 hours writing reports for incidents that took 20 minutes to fix.
👉 Join the waitlist (2-min form)
Want to see real incident analysis? Check out this Black Friday payment system meltdown — actual production logs, AI-generated report:
📊 Black Friday SRE Case Study — Free case study: $360K revenue recovery breakdown with ultra-complex multi-region logs
Docker Production Resources
If you're wrestling with Docker in production, these helped me:
📚 Free Guides (Start Here):
🐳 Docker in Production Pack — Complete cheatsheet & troubleshooting guide. Covers image optimization, layer caching, multi-stage builds, and the exact techniques I used to shrink our image from 8.2GB to 127MB.
🔧 Kubernetes in Production Pack — Deployment, scaling & troubleshooting. Because after you fix your Docker image, you'll deploy it to K8s and discover a whole new set of problems.
📚 Paid Resources (Production Reality):
🚀 Backend Performance Rescue Kit — Find and fix the 20 bottlenecks killing your app. Includes container performance profiling, image size optimization strategies, and AWS cost reduction techniques.
🎯 Production Engineering Toolkit — Real production failures and how to prevent them. Features 7 Docker-related incidents including the "8GB image that cost $8K/month" case study.
Everything I've learned from production: devrimozcay.gumroad.com
Weekly: Real Production Engineering Stories
I write about Docker disasters, AWS cost explosions, and the messy reality of production systems every week.
Not the clean conference talk version. The 3 AM debugging version.
— That CEO question haunted me: "What the hell is in that image?" I didn't know. That's the problem. Most developers don't know what's in their Docker images. Run docker history on your image right now. You'll be surprised.
— After we fixed this, I checked our other services. Found 4 more images over 5GB. All the same mistakes. Fixed all of them in one day. Total savings: $23K/month.
— The junior dev who asked "Why is our image so small?" now maintains our Dockerfile standards. His first PR: a 200-line document on Docker best practices. It's pinned in our engineering channel. Sometimes asking "why" is the most valuable thing you can do.