If you've ever worked with a massive dataset of images, you know how quickly it can get out of control. Duplicate files, wrong formats, corrupt images, and inconsistent sizes will ruin your pipeline fast.

Last month, I inherited a dataset of over 10,000 user-uploaded images for a machine learning project. They were a mess. Some were screenshots, some were photographs, some had no extension, some were 10x too large, and many were corrupted. Instead of cleaning them manually, I built a Python tool that fixed all of it.

Here's the full breakdown of how I automated the cleanup.

1. Scanning and Verifying Image Files

The first step was just figuring out what I had. I needed to scan the directory, confirm file extensions, and detect any corrupt files.

I used the os and Pillow libraries:

pip install pillow
import os
from PIL import Image

def scan_images(root_dir):
    valid_images = []
    for filename in os.listdir(root_dir):
        filepath = os.path.join(root_dir, filename)
        try:
            with Image.open(filepath) as img:
                img.verify()  # raises if image is corrupt
            valid_images.append(filepath)
        except Exception as e:
            print(f"Corrupt or unreadable: {filename} → {e}")
    return valid_images

valid_files = scan_images("dataset/images")
print(f"Found {len(valid_files)} valid images")

This caught over 400 broken images I had no idea were corrupted.

2. Converting All Formats to JPG

Next up: some files were .png, some .webp, some even .tiff. I wanted everything in .jpg for consistency and compatibility.

def convert_to_jpg(file_path):
    base, _ = os.path.splitext(file_path)
    new_path = base + ".jpg"
    try:
        with Image.open(file_path) as img:
            rgb = img.convert('RGB')
            rgb.save(new_path, "JPEG")
        os.remove(file_path)
        print(f"Converted {file_path} to JPG")
    except Exception as e:
        print(f"Failed to convert {file_path}: {e}")

for file in valid_files:
    if not file.endswith(".jpg"):
        convert_to_jpg(file)

This reduced storage by 30% and made my later pipeline steps a lot simpler.

3. Resizing Massive Images

Some files were over 5000x5000 pixels. For training, I needed everything scaled to 512x512.

def resize_image(file_path, size=(512, 512)):
    try:
        with Image.open(file_path) as img:
            img = img.resize(size, Image.ANTIALIAS)
            img.save(file_path)
            print(f"Resized {file_path}")
    except Exception as e:
        print(f"Failed to resize {file_path}: {e}")

for file in os.listdir("dataset/images"):
    if file.endswith(".jpg"):
        resize_image(os.path.join("dataset/images", file))

Now every image had the same dimensions — no surprises during training.

4. Detecting and Removing Duplicate Images

Duplicates kill your model's generalization. I used hashing to find images with identical pixels.

pip install imagehash
import imagehash
from PIL import Image
from collections import defaultdict

hashes = defaultdict(list)
def find_duplicates(image_folder):
    for filename in os.listdir(image_folder):
        path = os.path.join(image_folder, filename)
        try:
            with Image.open(path) as img:
                h = str(imagehash.average_hash(img))
                hashes[h].append(path)
        except:
            continue

    duplicates = []
    for hash_val, paths in hashes.items():
        if len(paths) > 1:
            duplicates.extend(paths[1:])  # keep the first, mark the rest
    return duplicates

dupes = find_duplicates("dataset/images")
print(f"Found {len(dupes)} duplicates")

for file in dupes:
    os.remove(file)

I removed over 600 exact duplicates with this block alone.

5. Renaming Files to Match Dataset Conventions

Image file names were chaotic — some included emojis, others had spaces or strange characters. I needed them renamed as img_00001.jpg, img_00002.jpg, and so on.

def rename_images(folder):
    files = sorted([f for f in os.listdir(folder) if f.endswith(".jpg")])
    for idx, filename in enumerate(files):
        new_name = f"img_{idx:05d}.jpg"
        os.rename(os.path.join(folder, filename), os.path.join(folder, new_name))
    print(f"Renamed {len(files)} images")

rename_images("dataset/images")

Now my dataset looked clean and professional — exactly what I wanted.

6. Packaging Everything Into a Command-Line Tool

Once the individual parts worked, I merged them into a CLI tool using argparse.

import argparse

parser = argparse.ArgumentParser(description="Clean and normalize image dataset")
parser.add_argument('--path', type=str, required=True, help='Path to image folder')
args = parser.parse_args()

# reuse functions: scan, convert, resize, de-dupe, rename
files = scan_images(args.path)
for f in files:
    convert_to_jpg(f)
    resize_image(f)
dupes = find_duplicates(args.path)
for d in dupes:
    os.remove(d)
rename_images(args.path)

I could now run:

python clean_images.py --path dataset/images

…and walk away.

7. Bonus: Building a Progress Bar for Long Tasks

When processing 10k images, you want feedback. So I added tqdm:

pip install tqdm
from tqdm import tqdm

for file in tqdm(valid_files, desc="Resizing images"):
    resize_image(file)

This gave a smooth progress bar — essential for long-running jobs.

Conclusion

This project started as a headache and ended up being one of the most useful automation tools I've built this year. The full pipeline now handles corruption, format conversion, resizing, deduplication, and renaming — and can be reused for any dataset I get in the future.

I could've done all this manually… but it would've taken me a month.

Thank you for being a part of the community

Before you go: