Build Your Own Sentiment NLP

Apart of my final project for my Masters in Data Science, I trained, tested and deployed my own Natural Language Processing (NLP) model for…

Kai Wilson

~5 min read · July 29, 2024 (Updated: July 29, 2024) · Free: Yes

Apart of my final project for my Masters in Data Science, I trained, tested and deployed my own Natural Language Processing (NLP) model for sentiment analysis.

You can find the project's results showcased on the following shiny dashboard, along with the code and research paper on GitHub. This article will focus on creating the NLP to analyze the sentiment.

What you'll need

An IDE that supports python, I will be using VS code
Install and import the neccessary libraries

import pandas as pd
import numpy as np
import joblib
import spacy 
from spacy.training.example import Example
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import train_test_split

Text training data with categories

Creating the Model and Preparing Training Data

   def train_and_evaluate_model(nlp, train_data, num_epochs=10):
    # Check if the training data is in the format of a dataframe
    if not isinstance(train_data, pd.DataFrame):
        raise TypeError("train_data must be a pandas DataFrame")
    
    # Check if required columns exist
    if 'text' not in train_data.columns or 'cats' not in train_data.columns:
        raise ValueError("train_data must have column labels 'text' and 'cats'")

To ensure that the model is reuseable, make sure to set up some error catching in the beginning. The function will only accept the training data if it is in the format of a data frame. The function will also check if there is a column named text. This will ensure the data is properly formatted before continuing.

# split the data
    x_train, x_test = train_test_split(train_data, test_size = 0.3, random_state=42)
    x_hold, x_val = train_test_split(x_test, test_size=0.5)

First, the model training function will take in the training data and split it into three different sets: a training set, a validation set, and a final testing set.

if 'textcat' not in nlp.pipe_names:
        textcat = nlp.add_pipe("textcat")
        for label in ["positive", "negative", "neutral"]:
            textcat.add_label(label)
    else:
        textcat = nlp.get_pipe("textcat")

The next part of the function will go through and find if there is a text categorization component in the NLP pipeline. If there is not a pipeline, then it will add a new text categorization component to the pipeline. It will add positive, negative, and neutral text categorizers. If the text categorization already exists, it will retrieve the existing text categorization component from the pipeline.

optimizer = nlp.initialize()

With the training sets prepared and text categorizers pipelines are created or added, we will now prepare the Spacy model by setting up an optimizer and initializing the weights of all the components in the NLP pipeline. This is a crucial step because it will ensure that all parts of the model are properly initialized and ready to be updated during the training iterations.

Training the Model

for epoch in range(num_epochs):
        shuffled_data = x_train.sample (frac =1, random_state = 42).reset_index(drop=True)
        for i, sample in shuffled_data.iterrows():
            doc = nlp.make_doc(sample['text'])
            cats = sample['cats'] if isinstance(sample['cats'], dict) else eval(sample['cats'])
            gold = Example.from_dict(doc, {"cats": cats})
            nlp.update([gold], sgd=optimizer, drop=0.5)

The code will iterate over the epoch for a specified number of times that is defined in the code above. For each epoch, the loop will shuffle the training data randomly and prepare the data so that it is in the correct format. After those steps, the function will create the gold standard for training. The gold standard in machine learning refers to high-quality manually labeled data that serves as a ground truth for training and evaluation.

This is used to provide the correct answers that the model should learn to predict. The gold standard is used to calculate the loss (error) during training, which guides the model's learning process. The gold standard is structured in this format using the Example.from_dict() it will take the created input text, then it will take the category labels. It will then combine the input doc with the expected output cats.

The gold standard will allow the Spacy Model to compare the model's predictions with the correct answers during training. From there, nlp.update() is called, and it will compare the models' predictions with the gold standard. The difference between the gold standard and the prediction will calculate the loss and update the model's parameters.

Reporting the Model Results and Deploying the Trained Model

def report_of_model(nlp, data, stage):
    y_true = []
    y_pred = []
    for i, sample in data.iterrows():
        doc = nlp(sample['text'])
        true_cats = sample['cats'] if isinstance(sample['cats'], dict) else eval(sample['cats'])
        pred_cats = doc.cats
        true_label = max(true_cats, key=true_cats.get)
        pred_label = max(pred_cats, key=pred_cats.get)
        y_true.append(true_label)
        y_pred.append(pred_label)

    print(f"\n Classification Report for {stage} Data:")
    print(classification_report(y_true, y_pred))

To report the progress of the model, we will create a new function. It will create a document from the text and then it will extract the true category labels, handling the different formats. Then it will get the predicted categories from the model.

By finding the category with the highest score it will separate them by a true or predicted label. The function will then append the labels to the respective list. As a result, you will have a list of y_true and y_pred. These variables are then placed into a classification report and the results are printed out to the console.

Deployment

joblib.dump(nlp, "sentiment_model.joblib")
    return nlp

After the training, it will save the model into a joblib file so we can deploy the model and then return the trained model if we wanted to start using it within the session it was trained.

def predict_sentiment_df(joblib_file_path , dataframe):
    if not isinstance(dataframe, pd.DataFrame):
        raise TypeError("train_data must be a pandas DataFrame")
# Check if required columns exist
    if 'text' not in dataframe.columns:
        raise ValueError("train_data must have column labels 'text'")
    try: 
        trained_model = joblib.load(joblib_file_path)
    except Exception as e:
        raise ValueError(f'Error loading joblib file please make sure you have the correct filepath')
  
    
    results = []
    texts = dataframe['text']
    docs = list(trained_model.pipe(texts))
    for docs in docs:
        scores = docs.cats
        positive, negative, neutral = scores.values()
        predicted_class = max(scores, key=scores.get)
        results.append({
                'text': docs.text,
                'predicted_class': predicted_class,
                'positive': positive,
                'negative': negative,
                'neutral': neutral
        })
    results_df = pd.DataFrame(results)
    return results_df

Now an example of how you would use the model once it is deployed, you can create the function similar to the one above in your existing product.

Just as before, the function will take in a data frame and it will check if the proper column is assigned. Next, the function will check if the path to the joblib file is correct.

Once those steps have been applied, and everything looks correct. The function will use the model to make predictions on the given texts. Then it will return the results in a data frame.

#nlp #machine-learning #sentiment-analysis #python #data-science

< Go to the original