Content-based recommender is the system to rely on the similarity of items when it recommends items to users. For example, when a user likes a movie, the system finds and recommends movies which have more similar features to the movie the user likes. (Feature 1)

In this article, I will practice how to create the Content-based recommender using the MovieLens Dataset.
Read the Data
Let's read the data. There are multiple versions of the movie dataset from the MovieLens. The version used in this practice has 9,742 movies.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
movies = pd.read_csv('movies.csv')The file has 3 columns: movieID, title, and genres. The column title has the format, title (year). For example, the title in the first row is Toy Story (1995). Since it is more convenient to separate title from year for later use, I will separate these two in advance.
# the function to extract titles
def extract_title(title):
year = title[len(title)-5:len(title)-1]
# some movies do not have the info about year in the column title. So, we should take care of the case as well.
if year.isnumeric():
title_no_year = title[:len(title)-7]
return title_no_year
else:
return title
# the function to extract years
def extract_year(title):
year = title[len(title)-5:len(title)-1]
# some movies do not have the info about year in the column title. So, we should take care of the case as well.
if year.isnumeric():
return int(year)
else:
return np.nan
# change the column name from title to title_year
movies.rename(columns={'title':'title_year'}, inplace=True)
# remove leading and ending whitespaces in title_year
movies['title_year'] = movies['title_year'].apply(lambda x: x.strip())
# create the columns for title and year
movies['title'] = movies['title_year'].apply(extract_title)
movies['year'] = movies['title_year'].apply(extract_year) Explore the Feature (genres)
The column genres is the only feature used for this recommendation engine. Since the movies which do not have info about genres are unnecessary in this practice, I will drop those movies in the data.
r,c = movies[movies['genres']=='(no genres listed)'].shape
print('The number of movies which do not have info about genres:',r)
[Out] The number of movies which do not have info about genres: 34
# remove the movies without genre information and reset the index
movies = movies[~(movies['genres']=='(no genres listed)')].reset_index(drop=True)Each movie contains multiple genres as follows:

Let's see how many times each genre appears in the data.
# remove '|' in the genres column
movies['genres'] = movies['genres'].str.replace('|',' ')
# count the number of occurences for each genre in the data set
counts = dict()
for i in movies.index:
for g in movies.loc[i,'genres'].split(' '):
if g not in counts:
counts[g] = 1
else:
counts[g] = counts[g] + 1
# create a bar chart
plt.bar(list(counts.keys()), counts.values(), color='g')
plt.xticks(rotation=45)
plt.xlabel('Genres')
plt.ylabel('Counts')
tf-idf (Term Frequency and Inverse Document Frequency) and Cosine Similarity

Feature 3 simply illustrates the process of the recommendation engine. Since Movie 1 and Movie 2 are considered similar each other and they are not similar to Movie 3, if a user likes Movie 1, then the system should recommend Movie 2 to the user.
In order to implement this process, two steps are required.
- Step 1: Quantify the features for each movie (tf-idf)
- Step 2: Calculate the similarity between movies (Cosine Similarity)
Term Frequency and Inverse Document Frequency (tf-idf)
tf-idf is a numerical statistic which is used to calculate the importance of a word to a document in a collection of documents.
There are several ways to define and normalize tf and idf in practice. But the basic formula is as follows:
tf-idf (i, j) = tf (i, j) × idf (i, N)
- tf (i, j) = f (i, j) / ∑ₖ f(k, j)
- idf (i, N) = log(N/df (i))
- f (i, j): the number of times that word i occurs in document j
- ∑ₖ f(k, j) : the number of words in document j
- df ᵢ: the number of documents where the word i appears
- N: the total number of documents.

Look at Feature 4 for better understanding the formula. According to the formula above:
- i = {Action, Adventure, Animation, Comedy}
- j = {Document 1, Document 2, Document 3} (N = 3)
- As a document contains less words, each word in the document becomes more important. In this sense, tf-idf for 'Adventure' in Document 2 should be greater than tf-idf for 'Adventure' in Document 1.
- As a word appears in less documents, the word becomes more important in the document where the word appears. tf-idf for Comedy in each document should be very low because all the documents have the word.
tf-idf is used to quantify the importance of a genre (word) to a movie (document) in the data set (a collection of documents). From the bar chart in the Feature 2, we can expect that Drama and Comedy will have low tf-idf scores because about half of the movies contain the genres.
The TfidVectorizer() class from sklearn.feature_extraction.text library can be used to calculate and vectorize the tf-idf scores for each movie.
from sklearn.feature_extraction.text import TfidfVectorizer
# change 'Sci-Fi' to 'SciFi' and 'Film-Noir' to 'Noir'
movies['genres'] = movies['genres'].str.replace('Sci-Fi','SciFi')
movies['genres'] = movies['genres'].str.replace('Film-Noir','Noir')
# create an object for TfidfVectorizer
tfidf_vector = TfidfVectorizer(stop_words='english')
# apply the object to the genres column
tfidf_matrix = tfidf_vector.fit_transform(movies['genres'])What does tfidf_matrix look like?
tfidf_matrix.shape
[Out] (9708, 19)- The
tfidf_matrixis the matrix with 9708 rows(movie) and 19 columns(genre). - The row number of the matrix corresponds to the index of the movies data frame
print(list(enumerate(tfidf_vector.get_feature_names())))
[Out][(0, 'action'), (1, 'adventure'), (2, 'animation'), (3, 'children'), (4, 'comedy'), (5, 'crime'), (6, 'documentary'), (7, 'drama'), (8, 'fantasy'), (9, 'horror'), (10, 'imax'), (11, 'musical'), (12, 'mystery'), (13, 'noir'), (14, 'romance'), (15, 'scifi'), (16, 'thriller'), (17, 'war'), (18, 'western')]- The columns of the
tfidf_matrixare in this order:('action', 'adventure, 'animation', 'children', 'comedy', 'crime', 'documentary', 'drama', 'fantasy', 'horror', 'imax', 'musical', 'mystery', 'noir', 'romance', 'scifi, 'thriller', 'war', 'western') - The first row vector of the matrix is Toy Story and has the values as follows: (0, 0.4168, 0.5163, 0.5049, 0.2674, 0, 0, 0, 0.483, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0). According to this vector, tf-idf (animation) = 0.5163 and tf-idf (children) = 0.5049, which means that animation is the most significant genre for Toy Story and children is the next.
Cosine Similarity
The next step is to calculate the similarity between movies . In this step, we can use the cosine similarity which can be used to calculate the distance between two vectors.
The formula of the cosine similarity is as follows:
Similarity = cos(θ) = (A⋅B)/(∥A∥×∥B∥)
- A & B: non-zero vectors
- θ: the measure of the angle between A and B
- A⋅B: dot product
- ∥A∥ or ∥B∥: the length of the vector A or B

In Feature 5, consider two different movies (Movie 1 and Movie 2) in the two dimensional space in terms of genres. It is clear that as cos(θ) decreases (θ increases), the two vectors become farther, and that as cos(θ) increases (θ decreases), the two vectors become closer.
The linear_kernel() class in sklearn.metrics.pariwise can be used to calculate the cosine similarity.
from sklearn.metrics.pairwise import linear_kernel
# create the cosine similarity matrix
sim_matrix = linear_kernel(tfidf_matrix,tfidf_matrix) print(sim_matrix)
- sim_matrix is a 9708×9708 matrix.
- sim_matrix ᵢⱼ is the similarity value between movie i and movie j.
- The diagonal elements in the matrix represent the similarity of a movie with itself. (Therefore, the values are 1.)
- sim_matrix ᵢⱼ = sim_matrix ⱼᵢ.
Create a Movie Recommender
We have finished calculating the similarity values for all the pairs of the movies in the dataset.
'Did you mean…?' Trick
We often misspell a movie title. When we make misspellings while using Google to search something, Google asks us, 'Did you mean…?' in order to help our search. I apply Levenshtein Distance in order to implement this trick to the recommendation engine. This is a technique to calculate the distance between words. The fuzz class in fuzzywuzzy library can be used to implement the Levenshtein Distance in Python.
from fuzzywuzzy import fuzz
# create a function to find the closest title
def matching_score(a,b):
return fuzz.ratio(a,b)fuzz.ratio(a,b) calculates the Levenshtein Distance between a and b, and returns the score for the distance. If the two words, a and b, are exactly the same, the score becomes 100. As the distance between the words increases, the score falls.
# a function to convert index to title_year
def get_title_year_from_index(index):
return movies[movies.index == index]['title_year'].values[0]
# a function to convert index to title
def get_title_from_index(index):
return movies[movies.index == index]['title'].values[0]
# a function to convert title to index
def get_index_from_title(title):
return movies[movies.title == title].index.values[0]
# a function to return the most similar title to the words a user type
def find_closest_title(title):
leven_scores = list(enumerate(movies['title'].apply(matching_score, b=title)))
sorted_leven_scores = sorted(leven_scores, key=lambda x: x[1], reverse=True)
closest_title = get_title_from_index(sorted_leven_scores[0][0])
distance_score = sorted_leven_scores[0][1]
return closest_title, distance_scoreThe function find_closest_title() is supposed to return the most similar title in the data to the words a user types. Without this, the recommender only works when a user enters the exact title which the data has.
Make the Recommendation Engine
def contents_based_recommender(movie_user_likes, how_many):
closest_title, distance_score = find_closest_title(movie_user_likes)
# When a user does not make misspellings
if distance_score == 100:
movie_index = get_index_from_title(closest_title)
movie_list = list(enumerate(sim_matrix[int(movie_index)]))
# remove the typed movie itself
similar_movies = list(filter(lambda x:x[0] != int(movie_index), sorted(movie_list,key=lambda x:x[1], reverse=True)))
print('Here\'s the list of movies similar to '+'\033[1m'+str(closest_title)+'\033[0m'+'.\n')
for i,s in similar_movies[:how_many]:
print(get_title_year_from_index(i))
# When a user makes misspellings
else:
print('Did you mean '+'\033[1m'+str(closest_title)+'\033[0m'+'?','\n')
movie_index = get_index_from_title(closest_title)
movie_list = list(enumerate(sim_matrix[int(movie_index)]))
similar_movies = list(filter(lambda x:x[0] != int(movie_index), sorted(movie_list,key=lambda x:x[1], reverse=True)))
print('Here\'s the list of movies similar to '+'\033[1m'+str(closest_title)+'\033[0m'+'.\n')
for i,s in similar_movies[:how_many]:
print(get_title_year_from_index(i))Test the Recommender
It's time to test the engine. Let's find similar movies to 'Monsters, Inc.'. Set the engine recommends 20 movies.
contents_based_recommender('Monsters, Inc.', 20)
The recommender seems to find pretty similar movies to the movie I chose. Then, let's see what happens if I misspell the title.
contents_based_recommender('Monster Incorporation.', 20)
The system found what I wanted and recommended the list well.
The data and code for this practice can be found here.