Train a Text Classifier using the Amazing Attention Transformers

Build and Train a Transformer from scratch

Vatsal Saglani

~7 min read · March 18, 2021 (Updated: August 15, 2023) · Free: No

Natural Language Processing

Continuing from the last part in this part we will be looking at two different techniques to train a Self-Attention Transformer network to classify a piece of text (a question in our case) into two different categories each category containing some number of classes. We will be using the Encoder-Decoder modules we used in the previous part to code a transformer network and then train it.

Technique — 1: N Class Head Classification Transformer

Our end goal is to provide two different class names to the given question. Here, we can pass the features extracted from the Encoder-Decoder layers of the Self Attention Transformer to two fully connected linear layers; one predicts the main class and another one predicts the subclass.

Image by Author

Training the Model

Dataset class for model training

The data we pre-processed in the 1st part is in a list of dictionaries with question-tokens, question-class, and question-subclass as keys in each dictionary inside the list representing the tokenized question, class of the question, and the subclass of that question. In the Dataset object, we will be padding the question-tokens to the max length of tokens in a question and in our case, it's 100. We will return the padded question-tokens under the key source_seq and class and subclass of that padded sequence with labels class and subclass.

Image by Author

Training Steps

Imports, Seeding and Logging

Image by Author

2. Utilities

The functions under the utility section include a function to select a device on which to load a model and data, a function to count trainable model parameters, a function to count the performance of the model after training on a batch, and a function to load some pickle files.

Image by Author

3. Data and Tokenizer loading

Image by Author

4. Model initialization

We will initialize the model parameters like the size of the vocabulary, padding id, class label id ([CLS]), number of classes in every category, the maximum length of the sequence formed from the data, batch size to train with, and number of workers to use in training.

Image by Author

5. Dataloader and Optimizer

Image by Author

6. Train and Save at best accuracy

Image by Author

Training Logs

Image by Author

Inference using the trained model

As the Subclass names to the index list have around 47 values I have saved those in a pickle file everything will be available in the GitHub repo.

Load classname details

Image by Author

Load Subclass and details and index to class and index to subclass

Image by Author

Prediction Function

Image by Author

With more data regarding the questions and their class and subclass or a pre-trained transformer model, one can get better performance quickly by fine-tuning for a few epochs. But for certain use cases, it's hard to find any pre-trained model and sometimes one might need to train a transformer on the task using a bulk of data if there isn't that much time to go through the pre-training process and then fine-tuning.

Technique — 2: One Encoder N Decoder Classification Transformer

Here, we will look at one more way to leverage the Self-Attention Transformer model to classify a piece of text into two different categories, each category containing some number of classes. We will have two decoders and one encoder wherein the encoder will extract the question features and pass it to the first decoder with [CLS] index and then pass the encoder features to the second decoder with the class index for that question. The features extracted from both decoders and then passes through two FullyConnected layers; one to get the class logits and other to get the subclass logits.

The Classification Transformer model will look something like this,

Image by Author

One can call this as One Encoder N - Decoder approach when a sub-category falls under a parent class category. While in the previous model(technique — 1) only one Decoder layer wherein the input to the model is just the question; here, in this approach, one needs to provide the question text and main class labels. The main class labels won't be shown to the first decoder just to the second decoder.

Let's take an example. Suppose the question is When was Google founded?

For this question the class is NUMERIC and the subclass is date.

The inputs to our model will be the question text (When was Google founded?) and class label (NUMERIC). The class label (NUMERIC) won't be passed to the Decoder - 1 as Decoder - 1 feature will be predicting the class names. The class label (NUMERIC) will be passed only to the Decoder - 2 so that the 2nd Decoder has some knowledge about what subclass categories can come under one class category. It's quite convoluted I know 😅.

Image by Author

The Encoder and Decoder part will remain the same but as seen above the Classification Transformer will contain Two Decoders. You can check the previous part (Part — 1) for the Encoder and Decoder code.

Classification Transformer

Image by Author

The Dataset object will also remain the same from the first technique above.

Training the Model

Import, Seeding, and Logging

Image by Author

Utilities

Image by Author

Data and Tokenizer

Image by Author

Model Initialization

Image by Author

Training

Image by Author

The training logs would look something like this

Image by Author

Inference

As discussed in the above parts the model inputs will be the question-text and the class-label. At the start, we won't have the class-label for the question-text so we will predict it via. the features obtained from the Decoder — 1 and passing it to the FullyConnected layer for Decoder — 1 and then obtain the class-label. After obtaining the class-label the question-text and class-label will be passed to the Decoder — 2 and the features obtained from them will be passed to the FullyConnected layer for Decoder — 2 and the subclass-label will be obtained. Let's look at that in code.

Image by Author

Some inference examples

Image by Author

With this part, we come to the end of this blog series about text classification using Self Attention Transformers and different techniques for classification where there are N categories with every category having several classes.

The code for all the parts is available in this GitHub repo.

You can check out a live version for the models here.

#python #ai #machine-learning #naturallanguageprocessing #data-science

Train a Text Classifier using the Amazing Attention Transformers

Build and Train a Transformer from scratch

Natural Language Processing

Technique — 1: N Class Head Classification Transformer

Training the Model

Technique — 2: One Encoder N Decoder Classification Transformer

Classification Transformer

Inference

Reporting a Problem