Deep transfer learning

The art of reusing models trained by others

Vincent Mueller

Towards Data Science

· ~5 min read · September 14, 2021 (Updated: January 5, 2022) · Free: No

Intro

Good machine learning models require massive amounts of data and many GPUs or TPUs to train. And most of the time they can only perform a specific task.

Universities and large companies sometimes release their models. But it may very well be you want to develop a machine learning application but there is no model available that is suited for your task.

But don't worry, you don't have to gather massive amounts of data and spend tons of cash to develop your own model. You can use transfer learning instead. This decreases the training time and you can achieve a good performance with much less data.

What is transfer learning?

In transfer learning, we use the knowledge a model has gathered training on a specific task to solve a different but related task. The model can profit from the things it has learned from the previous task to learn the new one faster.

Let's make an example here and say you want to detect dogs on images. On the internet you find a model that can detect cats. Since this is a similar enough task, you shoot a few pictures of your dog and retrain the model into detecting dogs. Well maybe it will then be biased and only recognize your pet, but I think you get the point😉.

Maybe the model has learned to recognize cat's by their fur or the fact that they have eyes and this would also be very helpful in recognizing a dog.

There are actually two types of transfer learning, feature extraction and fine tuning.

In general both of these methods follow the same procedure:

Initialize the pre-trained model (the model from which we want to learn)
Reshape the final layers to have the same number of outputs as the number of classed in the new dataset
Define which layers we want to update
Train on new dataset

Feature extraction

Let's consider a convolutional neural network architecture with filters a dense layer and one output neuron.

(Image by author)

The network gets trained to predict the probability that there is a cat on an image. We need a big data set (images with and without cats) and training time is long. This step is called "pre-training".

(Image by author)

Then comes the fun part. We train the network again but this time with a small image data set containing dogs. During training, all layers except the output layer get "frozen". This means we do not update them during training. After training, the network outputs the probability, that a dog is visible on the image. This training procedure will take less time than the previous pre-training.

(Image by author)

Optionally we can also "unfreeze" the last two layers, namely the output and the the Dense layer. This depends on the amount data we have. If we have less data, we might consider training only the last layer.

Fine tuning

In fine tuning, we start with a pre-trained model but we update all the weights.

(Image by author)

Transfer learning example in pytorch

I will use the dataset cats vs. dogs from kaggle. The dataset can be found here. You can always use a different data set.

The task here is a little bit different than my example above. The model to recognize on which images are dogs present and on which are cats. For the code to function you will have to organize your data in the following structure:

(Image by author)

You can find a more detailed walk through of cats vs. dogs here.

Setup

We start by importing the needed libraries.

We check for a CUDA compatible CPU, else we will use the CPU.

Then we load the pretrained ResNet50 from torchvision.

Data augmentation is done via applying different transformations to images, thereby preventing overfitting.

We create the data loaders, which will load the images from the memory.

Creating the learning rate scheduler, which will modify the learning rate during training. Alternatively, you can use the ADAM optimizer, which adapts the learning rate automatically and does not require a scheduler.

Optimizer for feature extraction

Here gradients will be calculated only for the last layer and therefore only the last layer will be trained.

Optimizer for fine tuning

Here all layers will be trained.

Training

Let's define the training loop.

Finally, we can train our model.

Either using feature extraction:

Or using fine tuning:

Don't be too proud to use transfer learning

When I recommend to people that they could use transfer learning for their ML project, they sometimes refuse and would rather train a model on their own than using transfer learning. But no one should be ashamed using transfer learning, because:

Training neural networks uses energy and therefore increases global carbon emissions. Transfer learning saves our planet by reducing training time.
When training data is scarce, transfer learning might be the only option for your model to perform well. In computer vision there is often a lack of training data.

Conclusion

Transfer learning is a handy tool for the modern data scientist. You can use model others have pre-trained and perform transfer learning on them in order to save time, computer resources and reduce the amount of data required for training.