Detecting Fingerprint Deepfakes Using Ensemble Deep Learning: When GANs Learn to Fake Your…

There is something unsettling about the idea that the same generative AI technology used to create convincing fake faces and voices can also produce fake fingerprints. Fingerprint authentication has been considered one of the most reliable forms of biometric verification for decades. Unlike passwords, you cannot change your fingerprint if it gets compromised. Unlike tokens, you always have it with you. The stability and uniqueness of fingerprints are precisely what makes them valuable as authentication factors.

But GANs have changed the calculus. Tools like PrintsGAN can now synthesize fingerprint images that look, at pixel level, essentially identical to real ones. Physical spoofs like silicone molds and gelatin prints have been around for years, and fingerprint liveness detection has improved significantly in response. The new problem is that you no longer need physical materials. An attacker with access to a fingerprint image, from a database breach, from a surface you touched, from a high-resolution photograph, can potentially generate a synthetic fingerprint that fools a scanner without ever creating a physical object.

This project addresses that threat head-on. We built an ensemble deep learning system combining three convolutional neural networks using a weighted voting scheme, trained and tested on a dataset that includes not just traditional spoof types but GAN-generated synthetic fingerprints. The ensemble reached 96.7% accuracy and, more importantly, achieved an equal error rate of just 1.3%, which is the number that matters most when you are designing a system where false acceptances have serious security consequences.

Why a Single CNN Is Not Enough

Before getting into the architecture, it is worth explaining why we used an ensemble rather than just training one large model.

The fingerprint spoofing problem has an interesting structure. Different types of spoof have different visual characteristics. Silicone molds tend to have slightly different ridge texture than real fingers. Gelatin prints often show air bubbles or surface irregularities. 3D-printed replicas have a characteristic layer texture from the printing process. GAN-generated synthetic fingerprints are different again, they do not have the physical artifacts of material-based spoofs, but they have their own distinctive patterns that come from how generative models learn to produce plausible-looking minutiae.

No single architecture is equally good at detecting all of these. EfficientNet-B0, which uses compound scaling to balance depth, width, and resolution, tends to excel at capturing fine-grained texture differences across the whole image. ResNet-18, with its residual connections, is better at learning deep hierarchical features that distinguish overall structural patterns. A lightweight custom CNN we called DIET (Data-Efficient Image Transformer) is optimized for fast extraction of local patterns and brings architectural diversity that reduces the correlation between errors across models.

When you combine classifiers that make different kinds of errors, the ensemble is more robust than any individual member. An input that fools EfficientNet might not fool ResNet-18. A spoof type that ResNet-18 misclassifies might be correctly identified by DIET. Averaging the probability outputs spreads the risk across all three models.

The Dataset

We built a custom dataset containing 6,080 genuine fingerprint images from 250 different individuals, with approximately 50 samples per person, and 7,460 spoof images spanning five fabrication methods: silicone molds, gelatin prints, 3D-printed replicas, GAN-generated synthetic fingerprints, and composite spoofs that combine multiple materials.

The dataset is balanced to give roughly equal representation of live and spoof classes overall, though within the spoof class there is intentional diversity in the fabrication methods. This is important because a model trained on mostly one type of spoof, say silicone molds, may not generalize to GAN-generated fingerprints, which have a completely different visual signature.

All images were captured at standard scanner resolution. The GAN-generated fingerprints were produced using established generative models trained on fingerprint images, specifically to simulate the kind of synthetic fingerprints an attacker might produce using publicly available tools.

Preprocessing Pipeline

Before feeding images to the models, each one passes through a four-step preprocessing pipeline.

Grayscale conversion removes color channel redundancy. Real fingerprints and their spoof counterparts have almost no meaningful color variation, and processing three channels would just add unnecessary computation without contributing useful information.

Resizing to 224 by 224 pixels standardizes input dimensions across all three CNN architectures. EfficientNet-B0 and ResNet-18 both expect 224x224 inputs by default, and standardizing here means the preprocessing is consistent across the ensemble.

Normalization using ImageNet statistics, mean and standard deviation per channel, stabilizes gradient-based learning during fine-tuning. Because two of the three models start from ImageNet pretrained weights, using the same statistics that were used during their original training helps preserve the representations they have already learned.

Conversion to PyTorch tensors prepares the data for GPU-accelerated forward passes. The dataset is split 80/20 between training and validation with stratified sampling to ensure each spoof type is proportionally represented in both splits.

Data augmentation during training included random horizontal flips and slight rotation, which helps the models generalize to fingerprints captured at slightly different orientations without overfitting to the specific alignment in the training data.

Architecture Details

EfficientNet-B0 is the workhorse of the ensemble. Its compound scaling approach jointly optimizes network depth, width, and image resolution rather than scaling one dimension at a time. For fingerprint analysis, this means it simultaneously develops deep feature hierarchies for structural analysis and wide feature maps for capturing texture across the full image at high resolution. The pretrained ImageNet weights provide a strong initialization that transfers well to fingerprint texture classification.

ResNet-18 uses skip connections that allow gradients to flow directly through multiple layers during backpropagation, which helps with the vanishing gradient problem that plagues deep networks trained from scratch. For this task, ResNet-18's residual architecture is particularly good at learning the difference between genuine ridge patterns and the subtle structural artifacts that appear in physical spoofs.

DIET is our custom lightweight CNN. It has two convolutional layers followed by batch normalization and pooling, then three fully connected layers before the final binary output. It is designed to be fast, having an inference time of 8.9 milliseconds compared to EfficientNet-B0's 18.7 milliseconds, while still contributing meaningfully to the ensemble through diverse feature extraction. The architectural simplicity means it is less likely to overfit to the specific textures in the training set, which helps ensemble diversity.

The final ensemble prediction is computed as:

P_final = 0.35 * P_ResNet + 0.40 * P_EfficientNet + 0.25 * P_DIET

The weights were determined empirically based on each model's validation performance. EfficientNet-B0 receives the highest weight because it consistently achieved the highest individual accuracy (97.1%) during validation. ResNet-18 receives the second-highest weight at 35%, and DIET at 25% contributes its diversity benefit without diluting the better-performing models too much.

The final class label is assigned to whichever class has the higher probability in P_final.

Training Procedure and Results

All three networks were jointly trained for 10 epochs using the Adam optimizer with a learning rate of 0.0001 and a batch size of 32. Cross-entropy loss was computed on the averaged predictions across all three models before backpropagation, which means each network is updated in a way that contributes to the collective ensemble goal rather than just optimizing its own individual output.

This joint training approach, where loss is computed on averaged predictions, encourages complementary feature learning. If two of the three models are already confident about a particular type of spoof, the gradient signal pushing the third model to agree is weaker, which allows the third model to learn slightly different features. The result is more architectural diversity in the final ensemble than you would get from training each model independently and then combining them.

The ensemble reached 96.7% accuracy and an F1-score of 0.961. The equal error rate of 0.013, or 1.3%, is the metric I consider most important here. Equal error rate is measured at the operating point where false acceptance rate equals false rejection rate. In a biometric security context, false acceptances are catastrophic because they let attackers in. A 1.3% EER means the system is operating at a point where it is highly unlikely to accept a spoofed fingerprint while still maintaining a reasonable false rejection rate for genuine users.

To put the ensemble's performance in context, compare it against the baselines. SVM with handcrafted features reached 81.4% accuracy and an EER of 18.6%. XGBoost with handcrafted features reached 83.2% accuracy and an EER of 16.8%. A simple CNN reached 90.1% accuracy and an EER of 9.9%. Individual CNN architectures performed better, with EfficientNet-B0 at 97.1% and ResNet-18 at 96.3%, but the ensemble's EER of 1.3% is dramatically lower than any individual model's EER, which ranged from 2.9% to 4.1%.

The tradeoff is inference time. The ensemble takes 43.8 milliseconds to process a single image, compared to 6.8 milliseconds for a simple CNN and 18.7 milliseconds for EfficientNet-B0 alone. For most fingerprint authentication scenarios this is acceptable. Unlocking a phone or accessing a secure system can accommodate a 44-millisecond wait. For very high-throughput applications like border control with thousands of verifications per hour, optimizing this would be a priority.

Visualizing What the Models Learn

Understanding what features drive the classification decisions turned out to be as interesting as the accuracy numbers themselves.

We extracted embeddings from the EfficientNet-B0 component and visualized them using scatter plots in a reduced feature space. The genuine and spoof classes form distinct clusters with clear separation, which confirms that the models are learning genuine discriminative features rather than spurious correlations. The correlation coefficient matrix across the top five features shows that the learned features are largely uncorrelated with each other, meaning the ensemble is not just computing the same thing five times with different architectures.

The confusion matrix for the ensemble shows 1,480 true positives (correctly identified spoof fingerprints), 1,200 true negatives (correctly identified genuine fingerprints), and only 6 false acceptances in the validation set. Those 6 false acceptances are the cases that warrant the most study, because in a security system they represent the failures that matter.

When we examined those 6 cases, they were all GAN-generated synthetic fingerprints that happened to share visual characteristics with genuine fingerprints from the training distribution. This suggests that improving the training data diversity, specifically adding more varied GAN-generated fingerprints produced by different generative models, would likely close this gap further.

Robustness to GAN-Generated Spoofs Specifically

The most interesting finding is the ensemble's performance specifically on GAN-generated synthetic fingerprints, as distinct from physical spoof types. The ROC curves broken down by spoof type show that the ensemble maintains high discriminatory ability, AUC close to 1.0, across all spoof types including GAN-generated ones.

This was not guaranteed. A model trained primarily on silicone and gelatin spoofs might learn texture features that help it detect those specific materials but fail to transfer to AI-generated fingerprints. The fact that all three models together handle GAN-generated spoofs as well as physical ones is a product of the dataset design, which explicitly includes GAN fingerprints in both training and validation, and the ensemble's architectural diversity.

As generative fingerprint models improve over time, this will need to be re-evaluated. The arms race between fingerprint generators and detectors is ongoing. But the ensemble approach means you can update one component model to improve on newer spoof types without redesigning the entire system.

Limitations and Future Work

The inference time is the most pressing practical limitation. At 43.8 milliseconds per verification, the system would need optimization for deployment on embedded hardware like fingerprint scanners with constrained processors. Knowledge distillation, where a smaller student model learns to mimic the ensemble's predictions, could potentially bring inference time down significantly while preserving most of the accuracy.

Adversarial robustness is another open question. The GAN-generated fingerprints in our dataset were produced with standard generative models, not adversarially optimized to evade our specific detector. A motivated attacker who can query the detection system and iteratively refine their generated fingerprints could potentially find inputs that fool the ensemble. Adversarial training, where the training set is augmented with examples specifically crafted to evade the current model, would help address this.

Explainability is missing from the current implementation. Adding Grad-CAM or SHAP visualizations would let you see exactly which regions of the fingerprint image are driving the ensemble's decision, which would be valuable both for debugging failure cases and for understanding what the models have learned.

Finally, we want to deploy this on actual fingerprint scanner hardware rather than processing pre-captured images. The full pipeline from scanner capture to liveness decision is where real-world performance will be tested.

The Broader Point

Fingerprint deepfakes are not a theoretical future threat. The tools to generate them exist and are improving. Physical fingerprint scanning systems designed before GAN-based synthesis was feasible are not adequate for the current threat model. Building detection systems that explicitly include AI-generated spoofs in their training and evaluation pipeline is not optional anymore, it is necessary for any biometric authentication system deployed in a high-security context.

The ensemble approach described here is not the final answer, but it represents the right direction: architecturally diverse models, trained on diverse spoof types including synthetic ones, evaluated on the metrics that matter for security applications rather than just overall accuracy.

Technologies: PyTorch, EfficientNet-B0, ResNet-18, CNN, Adam Optimizer, CUDA

Dataset: Custom dataset (6,080 genuine + 7,460 spoof images including GAN-generated synthetic fingerprints)

Contents