Natural Language Processing
Continuing from the last part in this part we will be looking at two different techniques to train a Self-Attention Transformer network to classify a piece of text (a question in our case) into two different categories each category containing some number of classes. We will be using the Encoder-Decoder modules we used in the previous part to code a transformer network and then train it.
Technique — 1: N Class Head Classification Transformer
Our end goal is to provide two different class names to the given question. Here, we can pass the features extracted from the Encoder-Decoder layers of the Self Attention Transformer to two fully connected linear layers; one predicts the main class and another one predicts the subclass.


Training the Model
- Dataset class for model training
The data we pre-processed in the 1st part is in a list of dictionaries with question-tokens, question-class, and question-subclass as keys in each dictionary inside the list representing the tokenized question, class of the question, and the subclass of that question. In the Dataset object, we will be padding the question-tokens to the max length of tokens in a question and in our case, it's 100. We will return the padded question-tokens under the key source_seq and class and subclass of that padded sequence with labels class and subclass.

Training Steps
- Imports, Seeding and Logging

2. Utilities
The functions under the utility section include a function to select a device on which to load a model and data, a function to count trainable model parameters, a function to count the performance of the model after training on a batch, and a function to load some pickle files.

3. Data and Tokenizer loading

4. Model initialization
We will initialize the model parameters like the size of the vocabulary, padding id, class label id ([CLS]), number of classes in every category, the maximum length of the sequence formed from the data, batch size to train with, and number of workers to use in training.

5. Dataloader and Optimizer

6. Train and Save at best accuracy

Training Logs

Inference using the trained model
As the Subclass names to the index list have around 47 values I have saved those in a pickle file everything will be available in the GitHub repo.
- Load classname details

- Load Subclass and details and index to class and index to subclass

- Prediction Function

With more data regarding the questions and their class and subclass or a pre-trained transformer model, one can get better performance quickly by fine-tuning for a few epochs. But for certain use cases, it's hard to find any pre-trained model and sometimes one might need to train a transformer on the task using a bulk of data if there isn't that much time to go through the pre-training process and then fine-tuning.
Technique — 2: One Encoder N Decoder Classification Transformer
Here, we will look at one more way to leverage the Self-Attention Transformer model to classify a piece of text into two different categories, each category containing some number of classes. We will have two decoders and one encoder wherein the encoder will extract the question features and pass it to the first decoder with [CLS] index and then pass the encoder features to the second decoder with the class index for that question. The features extracted from both decoders and then passes through two FullyConnected layers; one to get the class logits and other to get the subclass logits.
The Classification Transformer model will look something like this,

One can call this as One Encoder N - Decoder approach when a sub-category falls under a parent class category. While in the previous model(technique — 1) only one Decoder layer wherein the input to the model is just the question; here, in this approach, one needs to provide the question text and main class labels. The main class labels won't be shown to the first decoder just to the second decoder.
Let's take an example. Suppose the question is When was Google founded?
For this question the class is NUMERIC and the subclass is date.
The inputs to our model will be the question text (When was Google founded?) and class label (NUMERIC). The class label (NUMERIC) won't be passed to the Decoder - 1 as Decoder - 1 feature will be predicting the class names. The class label (NUMERIC) will be passed only to the Decoder - 2 so that the 2nd Decoder has some knowledge about what subclass categories can come under one class category. It's quite convoluted I know 😅.

The Encoder and Decoder part will remain the same but as seen above the Classification Transformer will contain Two Decoders. You can check the previous part (Part — 1) for the Encoder and Decoder code.
Classification Transformer

The Dataset object will also remain the same from the first technique above.
Training the Model
- Import, Seeding, and Logging

- Utilities

- Data and Tokenizer

- Model Initialization

- Training

The training logs would look something like this

Inference
As discussed in the above parts the model inputs will be the question-text and the class-label. At the start, we won't have the class-label for the question-text so we will predict it via. the features obtained from the Decoder — 1 and passing it to the FullyConnected layer for Decoder — 1 and then obtain the class-label. After obtaining the class-label the question-text and class-label will be passed to the Decoder — 2 and the features obtained from them will be passed to the FullyConnected layer for Decoder — 2 and the subclass-label will be obtained. Let's look at that in code.

Some inference examples



With this part, we come to the end of this blog series about text classification using Self Attention Transformers and different techniques for classification where there are N categories with every category having several classes.
The code for all the parts is available in this GitHub repo.
You can check out a live version for the models here.