Natural Language Processing (NLP) is all the rage in the current AI and ML space. NLP finds applications in various domains, including sentiment analysis, chatbots, language translation, and more. There have been many advanced techniques and algorithms that have been developed in the frontiers of the space, but, sometimes its better to revisit some of its basics. In this edition, we'll discuss some fundamental ideas on how to preprocess textual data and perform feature engineering using TF-IDF for unigrams, bigrams, and trigrams.
TLDR in code format: https://colab.research.google.com/drive/1A53XA6PPVJ7UH--TMpc1VyYqOho66uOa?usp=sharing
Preprocessing
Before diving into feature engineering, it's crucial to preprocess the text data. Preprocessing involves several steps:
1. Lowercasing
The first step is to convert all text to lowercase to ensure uniformity and prevent the model from treating words with different cases as distinct.
2. Removing Punctuation
Punctuation marks are typically not informative for essay scoring. Removing them helps reduce noise in the text data.
3. Tokenization
Tokenization is the process of splitting the text into individual words or tokens. In this project, we use the NLTK library for tokenization.
4. Removing Stopwords
Stopwords are common words like "the," "and," "in," etc., that don't carry significant information. We remove these words to focus on more meaningful content.
def preprocess_text(text):
# Convert text to lowercase
text = text.lower()
# Remove punctuation
text = text.translate(str.maketrans('', '', string.punctuation))
# Tokenize the text
words = nltk.word_tokenize(text)
# Remove stopwords
words = [word for word in words if word not in stopwords.words('english')]
# Join the words back into a cleaned sentence
cleaned_text = ' '.join(words)
return cleaned_text
# Apply the preprocess_text function
train['essay_text'] = train['essay_text'].apply(preprocess_text)
test['essay_text'] = test['essay_text'].apply(preprocess_text) Feature Engineering
1. Number of Words and Characters
Two simple features are created: the number of words (WC) and the number of characters (CharC) in each essay. These features capture the length of the essays, which could potentially be related to their scores.
# Create a new column for the number of words
train['WC'] = train['essay_text'].apply(lambda x: len(x.split()))
test['WC'] = test['essay_text'].apply(lambda x: len(x.split()))
# Create a new column for the number of characters
train['CharC'] = train['essay_text'].apply(lambda x: len(x))
test['CharC'] = test['essay_text'].apply(lambda x: len(x))2. TF-IDF Vectorization for Unigrams, Bigrams, and Trigrams
TF-IDF (Term Frequency-Inverse Document Frequency) is a technique used to represent text data in a numerical format. It measures the importance of a word within a document relative to a collection of documents. Here, we use TF-IDF vectorization to convert the essays into numerical feature vectors.
- Unigrams: Single words are considered as features.
- Bigrams: Pairs of consecutive words are treated as features.
- Trigrams: Triplets of consecutive words are treated as features.
For each of these three cases, we create separate TF-IDF matrices to capture different aspects of the essays.
corpus = train['essay_text'].values.tolist()
# Initialize TF-IDF vectorizers for unigrams, bigrams, and trigrams
tfidf_vectorizer_unigram = TfidfVectorizer(ngram_range=(1, 1))
tfidf_vectorizer_bigram = TfidfVectorizer(ngram_range=(2, 2))
tfidf_vectorizer_trigram = TfidfVectorizer(ngram_range=(3, 3))
# Fit and transform the training data corpus with the vectorizers
tfidf_matrix_unigram = tfidf_vectorizer_unigram.fit_transform(corpus)
tfidf_matrix_bigram = tfidf_vectorizer_bigram.fit_transform(corpus)
tfidf_matrix_trigram = tfidf_vectorizer_trigram.fit_transform(corpus)
# Get feature names for unigrams, bigrams, and trigrams
unigram_feature_names = tfidf_vectorizer_unigram.get_feature_names_out()
bigram_feature_names = tfidf_vectorizer_bigram.get_feature_names_out()
trigram_feature_names = tfidf_vectorizer_trigram.get_feature_names_out()
# (Optional) Add Columns back into train
unigram_df = pd.DataFrame(data=tfidf_matrix_unigram.toarray(), columns=unigram_feature_names)
bigram_df = pd.DataFrame(data=tfidf_matrix_bigram.toarray(), columns=bigram_feature_names)
trigram_df = pd.DataFrame(data=tfidf_matrix_trigram.toarray(), columns=trigram_feature_names)
train = pd.concat([train, unigram_df, bigram_df, trigram_df], axis=1)XGBoost Model with Grid Search
Yes, XGBoost can be used for NLP tasks, either classification or regression depending on your use case.
# Split the data into training and validation sets
X = train.drop(['score', 'essay_text'], axis=1)
y = train['score']
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=42)
# Model
xgb_model = xgb.XGBRegressor()
# Define grid search
param_grid = {
'n_estimators': [100, 200, 300],
'learning_rate': [0.01, 0.1, 0.2],
'max_depth': [3, 4, 5],
}
# Create GridSearchCV
grid_search = GridSearchCV(estimator=xgb_model, param_grid=param_grid, scoring='neg_mean_squared_error', cv=5)
grid_search.fit(X_train, y_train)
# Get the best model
best_xgb_model = grid_search.best_estimator_
# Make predictions on the validation set
y_pred = best_xgb_model.predict(X_valid)
# Calculate the mean squared error (MSE) on the validation set
mse = mean_squared_error(y_valid, y_pred)
print(f"Validation MSE: {mse}")
print("Best Parameters:", grid_search.best_params_)- Parts of this article were written using Generative AI
- Subscribe/leave a comment if you want to stay up-to-date with the latest AI trends.
Plug: Checkout all my digital products on Gumroad here. Please purchase ONLY if you have the means to do so. Use code: MEDSUB to get a 10% discount!