Overview

CLIP (Contrastive Language–Image Pre-training) is a new neural network introduced by OpenAI. There is a very detailed paper talking about it and you can go through it if you are interested. It is claimed that CLIP's performance is much more representative of how it will fare on datasets that measure accuracy in different, non-ImageNet settings.

None
Image from CLIP

ResNext is a simple, highly modularized network architecture for image classification. The network is constructed by repeating a building block that aggregates a set of transformations with the same topology.

Without going into any theory, I am going to use both models and perform some image classification testing using Jupyter Notebook and Google Colab.

Google Colab

From Google Colab, open the notebook available in this repository.

None
Google Colab — Open Notebook from GitHub

For better performance, ensure that you change the runtime type to GPU or TPU under Runtime -> Change runtime type in Google Colab.

None
Google Colab — GPU Runtime

Project and Library Setup

Let's install the Python libraries and clone the repository to download additional Python files and the images that will be used for the testing.

Since the Colab virtual machine comes with PyTorch and cudatoolkit pre-installed, I will not be installing them again.

None
Project and Library Setup

As you can see from the screenshot above, the current CUDA version is 10.1

Import Libraries and Pre-trained Models

Let's import the required libraries used by the notebook.

For the CLIP pre-trained model, I download it from the OpenAI site using the provided CLIP code snippet.

For the ResNext pre-trained model, I use the model from PyTorch Hub.

None
Import Libraries and Pre-trained Models

Prediction using CLIP and ResNext on ImageNet Classes

I implemented 2 methods — predict_clip and predict_resnext using the 1000 ImageNet classes. Both methods return the top 5 probable classes.

None
Prediction Methods in CLIP and ResNext

Image Classification Testing

Using a combination of different images, I performed a test using both prediction methods.

Using a simple panda image, both models are able to predict correctly.

None
Prediction made by CLIP and ResNext

And here is the test for other images.

None
CLIP and ResNext Image Classification Test

Just by this quick test and the observation from the results, it seems CLIP is able to make better predictions for unseen object categories. However, CLIP is taking a much longer time to come out with the prediction.

Also check out the following articles to see how we can host machine learning models using Streamlit and FastAPI, including ResNext.

And more articles below on practical usages of machine learning.