If you've ever worked with a massive dataset of images, you know how quickly it can get out of control. Duplicate files, wrong formats, corrupt images, and inconsistent sizes will ruin your pipeline fast.
Last month, I inherited a dataset of over 10,000 user-uploaded images for a machine learning project. They were a mess. Some were screenshots, some were photographs, some had no extension, some were 10x too large, and many were corrupted. Instead of cleaning them manually, I built a Python tool that fixed all of it.
Here's the full breakdown of how I automated the cleanup.
1. Scanning and Verifying Image Files
The first step was just figuring out what I had. I needed to scan the directory, confirm file extensions, and detect any corrupt files.
I used the os and Pillow libraries:
pip install pillow
import os
from PIL import Image
def scan_images(root_dir):
valid_images = []
for filename in os.listdir(root_dir):
filepath = os.path.join(root_dir, filename)
try:
with Image.open(filepath) as img:
img.verify() # raises if image is corrupt
valid_images.append(filepath)
except Exception as e:
print(f"Corrupt or unreadable: {filename} → {e}")
return valid_images
valid_files = scan_images("dataset/images")
print(f"Found {len(valid_files)} valid images")This caught over 400 broken images I had no idea were corrupted.
2. Converting All Formats to JPG
Next up: some files were .png, some .webp, some even .tiff. I wanted everything in .jpg for consistency and compatibility.
def convert_to_jpg(file_path):
base, _ = os.path.splitext(file_path)
new_path = base + ".jpg"
try:
with Image.open(file_path) as img:
rgb = img.convert('RGB')
rgb.save(new_path, "JPEG")
os.remove(file_path)
print(f"Converted {file_path} to JPG")
except Exception as e:
print(f"Failed to convert {file_path}: {e}")
for file in valid_files:
if not file.endswith(".jpg"):
convert_to_jpg(file)This reduced storage by 30% and made my later pipeline steps a lot simpler.
3. Resizing Massive Images
Some files were over 5000x5000 pixels. For training, I needed everything scaled to 512x512.
def resize_image(file_path, size=(512, 512)):
try:
with Image.open(file_path) as img:
img = img.resize(size, Image.ANTIALIAS)
img.save(file_path)
print(f"Resized {file_path}")
except Exception as e:
print(f"Failed to resize {file_path}: {e}")
for file in os.listdir("dataset/images"):
if file.endswith(".jpg"):
resize_image(os.path.join("dataset/images", file))Now every image had the same dimensions — no surprises during training.
4. Detecting and Removing Duplicate Images
Duplicates kill your model's generalization. I used hashing to find images with identical pixels.
pip install imagehash
import imagehash
from PIL import Image
from collections import defaultdict
hashes = defaultdict(list)
def find_duplicates(image_folder):
for filename in os.listdir(image_folder):
path = os.path.join(image_folder, filename)
try:
with Image.open(path) as img:
h = str(imagehash.average_hash(img))
hashes[h].append(path)
except:
continue
duplicates = []
for hash_val, paths in hashes.items():
if len(paths) > 1:
duplicates.extend(paths[1:]) # keep the first, mark the rest
return duplicates
dupes = find_duplicates("dataset/images")
print(f"Found {len(dupes)} duplicates")
for file in dupes:
os.remove(file)I removed over 600 exact duplicates with this block alone.
5. Renaming Files to Match Dataset Conventions
Image file names were chaotic — some included emojis, others had spaces or strange characters. I needed them renamed as img_00001.jpg, img_00002.jpg, and so on.
def rename_images(folder):
files = sorted([f for f in os.listdir(folder) if f.endswith(".jpg")])
for idx, filename in enumerate(files):
new_name = f"img_{idx:05d}.jpg"
os.rename(os.path.join(folder, filename), os.path.join(folder, new_name))
print(f"Renamed {len(files)} images")
rename_images("dataset/images")Now my dataset looked clean and professional — exactly what I wanted.
6. Packaging Everything Into a Command-Line Tool
Once the individual parts worked, I merged them into a CLI tool using argparse.
import argparse
parser = argparse.ArgumentParser(description="Clean and normalize image dataset")
parser.add_argument('--path', type=str, required=True, help='Path to image folder')
args = parser.parse_args()
# reuse functions: scan, convert, resize, de-dupe, rename
files = scan_images(args.path)
for f in files:
convert_to_jpg(f)
resize_image(f)
dupes = find_duplicates(args.path)
for d in dupes:
os.remove(d)
rename_images(args.path)I could now run:
python clean_images.py --path dataset/images…and walk away.
7. Bonus: Building a Progress Bar for Long Tasks
When processing 10k images, you want feedback. So I added tqdm:
pip install tqdm
from tqdm import tqdm
for file in tqdm(valid_files, desc="Resizing images"):
resize_image(file)This gave a smooth progress bar — essential for long-running jobs.
Conclusion
This project started as a headache and ended up being one of the most useful automation tools I've built this year. The full pipeline now handles corruption, format conversion, resizing, deduplication, and renaming — and can be reused for any dataset I get in the future.
I could've done all this manually… but it would've taken me a month.
Thank you for being a part of the community
Before you go:
- Be sure to clap and follow the writer ️👏️️
- Follow us: X | LinkedIn | YouTube | Newsletter | Podcast | Twitch
- Start your own free AI-powered blog on Differ 🚀
- Join our content creators community on Discord 🧑🏻💻
- For more content, visit plainenglish.io + stackademic.com