Our Docker image was 8.2GB.

Every deploy took 18 minutes just to transfer the image.

AWS charged us $47 per deploy for data transfer.

We deployed 6 times a day.

That's $282/day. $8,460/month. Just to move a Docker image.

Here's how I got it to 127MB in one afternoon.

The Dockerfile That Cost Us $8,460/Month

This was our Dockerfile in March 2024:

FROM node:18
RUN apt-get update
RUN apt-get install -y curl
RUN apt-get install -y wget
RUN apt-get install -y git
RUN apt-get install -y python3
RUN apt-get install -y build-essential
COPY package.json .
RUN npm install
COPY . .
RUN npm run build
EXPOSE 3000
CMD ["npm", "start"]

Looks innocent, right?

Image size: 8.2GB Layers: 47 Deploy time: 18 minutes AWS transfer cost: $47/deploy

Every single RUN command created a new layer.

Every layer got cached. Every layer got shipped.

We were shipping apt-get update cache, npm install cache, build artifacts, source files, node_modules for development AND production.

Everything.

The Wake-Up Call

Friday afternoon. 4:47 PM.

Critical bug in production. Users couldn't checkout.

Me: "I'll push a fix in 5 minutes."

Git push. CI/CD triggered.

4:52 PM — Build started 4:58 PM — Build complete 4:59 PM — Pushing image to ECR… 5:01 PM — 12% uploaded 5:04 PM — 28% uploaded 5:08 PM — 51% uploaded

CEO on Slack: "How long until this is fixed?"

Me: "Image is still uploading. Maybe 10 more minutes."

CEO: "It takes 10 minutes to deploy a bug fix?"

Me: "The image is 8.2GB."

CEO: "What the hell is in that image?"

Good question.

What Was Actually In That 8.2GB

I pulled the image locally and analyzed it.

docker history our-app:latest --no-trunc --human

Layer breakdown:

  • Node.js base image: 900MB
  • apt-get packages: 1.2GB (including Python, build tools we never used)
  • npm install (all dependencies): 2.1GB
  • Source files: 180MB
  • Build artifacts: 340MB
  • Cached apt-get lists: 890MB
  • Old node_modules from previous builds: 1.8GB
  • Random stuff we forgot about: 1.8GB

We were shipping:

  • Development dependencies in production
  • Build tools we only needed during build
  • Source TypeScript files alongside compiled JavaScript
  • Three different versions of node_modules (Docker layer caching gone wrong)
  • Python and build-essential (never used in production)

Production runtime actually needed:

  • Node.js runtime
  • Compiled JavaScript
  • Production dependencies

That's it.

The First Attempt: Multi-Stage Build

I rewrote the Dockerfile with multi-stage builds.

# Stage 1: Build
FROM node:18 AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production
COPY . .
RUN npm run build

# Stage 2: Runtime
FROM node:18-slim
WORKDIR /app
COPY --from=builder /app/dist ./dist
COPY --from=builder /app/node_modules ./node_modules
COPY package*.json ./
EXPOSE 3000
CMD ["node", "dist/index.js"]

Result: Image size: 1.2GB Deploy time: 4 minutes

Better. Not good enough.

The Second Attempt: Alpine Linux

Switched to Alpine base image.

# Stage 1: Build
FROM node:18-alpine AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production
COPY . .
RUN npm run buil

# Stage 2: Runtime
FROM node:18-alpine
WORKDIR /app
COPY --from=builder /app/dist ./dist
COPY --from=builder /app/node_modules ./node_modules
EXPOSE 3000
CMD ["node", "dist/index.js"]

Result: Image size: 420MB Deploy time: 90 seconds

Getting closer.

The Final Version: 127MB

The breakthrough: node_modules was still huge (340MB).

Most of it? Unused dependencies.

I audited every package:

npm ls --all > deps.txt

Found:

  • lodash: Used 3 functions. Entire library: 24MB.
  • moment.js: Used for date formatting. 67MB. (Native Intl does this now)
  • aws-sdk: Imported entire SDK. Only needed S3 client. 89MB waste.
  • 47 other packages: Pulled in as transitive dependencies. Never used.

I replaced:

  • lodash → wrote 3 utility functions (12 lines)
  • moment → native Date and Intl
  • aws-sdk@aws-sdk/client-s3 (only S3)
  • Removed unused dependencies

New package.json: 12 dependencies instead of 87.

Final Dockerfile:

# Build stage
FROM node:18-alpine AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production --ignore-scripts
COPY . .
RUN npm run build

# Production stage
FROM node:18-alpine
RUN apk add --no-cache tini
WORKDIR /app
COPY --from=builder /app/dist ./dist
COPY --from=builder /app/node_modules ./node_modules
COPY package*.json ./
USER node
EXPOSE 3000
ENTRYPOINT ["/sbin/tini", "--"]
CMD ["node", "dist/index.js"]

Final result: Image size: 127MB Deploy time: 40 seconds AWS transfer cost: $0.80/deploy

The Numbers

Before:

  • Image size: 8.2GB
  • Deploy time: 18 minutes
  • Cost per deploy: $47
  • Daily deploys: 6
  • Monthly cost: $8,460

After:

  • Image size: 127MB
  • Deploy time: 40 seconds
  • Cost per deploy: $0.80
  • Daily deploys: 6
  • Monthly cost: $144

Savings: $8,316/month

Same application. Same functionality. 64x smaller. 27x faster deploys.

What Actually Made the Difference

1. Multi-stage builds

Don't ship build tools to production.

Build stage: Install everything. Compile everything. Runtime stage: Copy only what's needed to run.

2. Alpine base image

node:18 = 900MB node:18-alpine = 110MB

Same Node.js. Way smaller base.

3. Dependency audit

Most projects have 50–200 npm packages.

You probably use 10–20.

The rest? Transitive dependencies you never imported.

Run npm ls --all. Be horrified. Start removing.

4. Layer optimization

Every RUN command = new layer.

Bad:

RUN apt-get update
RUN apt-get install curl
RUN apt-get install wget

Good:

RUN apt-get update && apt-get install -y \
    curl \
    wget \
    && rm -rf /var/lib/apt/lists/*

One layer. Cleaned up in the same command.

5. .dockerignore

We were copying everything.

COPY . .

This included:

  • node_modules (reinstalled anyway)
  • .git (900MB of history)
  • tests
  • documentation
  • .env files
  • Random files

.dockerignore:

node_modules
.git
*.md
.env*
tests
coverage
.DS_Store

Saved 1.2GB right there.

The Mistakes I Made

Mistake 1: Chasing the wrong metric

First attempt optimized for "fewer Dockerfile lines."

Result: Unreadable. Still 3GB.

What mattered: Final image size. Not Dockerfile elegance.

Mistake 2: Not profiling first

I guessed what was big.

Should have run docker history first.

Wasted 2 hours optimizing the wrong things.

Mistake 3: Keeping "just in case" packages

"We might need Python later." "Build-essential could be useful." "Let's keep wget."

We didn't need any of it.

Add it when you need it. Not before.

What Happened After

Week 1: Deployed 47 times. Previous record: 12 times/week.

Why? Because deploys were fast now. Developers weren't afraid to deploy.

Month 1: Found 3 bugs we'd been living with for months.

Why? Fast deploys = faster iteration = faster debugging.

Month 3: Junior dev asked, "Why is our image so small?"

I showed him the old Dockerfile.

Him: "8.2GB?! How did this ever work?"

Me: "It didn't. That's why I fixed it."

For Your Dockerfile

Here's the checklist I use now:

1. Use Alpine base images

  • node:18-alpine not node:18
  • python:3.11-alpine not python:3.11
  • Size difference: 700–900MB

2. Multi-stage builds

  • Build stage: Install everything
  • Runtime stage: Copy only runtime needs
  • Don't ship compilers to production

3. Audit dependencies

  • Run npm ls --all or pip list
  • Remove unused packages
  • Replace heavy packages with lightweight alternatives

4. Combine RUN commands

  • One RUN command for related operations
  • Clean up in the same layer
  • && rm -rf /var/lib/apt/lists/* after apt-get

5. .dockerignore

  • Exclude: node_modules, .git, tests, docs
  • Only copy what's needed to build

6. Order matters

  • Copy package files first
  • Install dependencies
  • Copy source code last
  • (Layers cache better this way)

7. Don't install what you don't need

  • "Just in case" packages cost GB
  • Add them when needed, not before

📬 What I'm Building

I'm building ProdRescue AI — turns messy incident logs into clear postmortem reports in minutes.

Because I was tired of spending 8 hours writing reports for incidents that took 20 minutes to fix.

👉 Join the waitlist (2-min form)

Want to see real incident analysis? Check out this Black Friday payment system meltdown — actual production logs, AI-generated report:

📊 Black Friday SRE Case Study — Free case study: $360K revenue recovery breakdown with ultra-complex multi-region logs

Docker Production Resources

If you're wrestling with Docker in production, these helped me:

📚 Free Guides (Start Here):

🐳 Docker in Production Pack — Complete cheatsheet & troubleshooting guide. Covers image optimization, layer caching, multi-stage builds, and the exact techniques I used to shrink our image from 8.2GB to 127MB.

🔧 Kubernetes in Production Pack — Deployment, scaling & troubleshooting. Because after you fix your Docker image, you'll deploy it to K8s and discover a whole new set of problems.

📚 Paid Resources (Production Reality):

🚀 Backend Performance Rescue Kit — Find and fix the 20 bottlenecks killing your app. Includes container performance profiling, image size optimization strategies, and AWS cost reduction techniques.

🎯 Production Engineering Toolkit — Real production failures and how to prevent them. Features 7 Docker-related incidents including the "8GB image that cost $8K/month" case study.

Everything I've learned from production: devrimozcay.gumroad.com

Weekly: Real Production Engineering Stories

I write about Docker disasters, AWS cost explosions, and the messy reality of production systems every week.

Not the clean conference talk version. The 3 AM debugging version.

👉 Subscribe on Substack

— That CEO question haunted me: "What the hell is in that image?" I didn't know. That's the problem. Most developers don't know what's in their Docker images. Run docker history on your image right now. You'll be surprised.

— After we fixed this, I checked our other services. Found 4 more images over 5GB. All the same mistakes. Fixed all of them in one day. Total savings: $23K/month.

— The junior dev who asked "Why is our image so small?" now maintains our Dockerfile standards. His first PR: a 200-line document on Docker best practices. It's pinned in our engineering channel. Sometimes asking "why" is the most valuable thing you can do.