Building a Content-based Recommender using a Cosine-Similarity Algorithm

9 min readDec 5, 2020

In this three-part series, we will teach you everything you need to build and deploy your own Chatbot. This is the first part of the series, where the rest of the series will cover the following topics:

Part 0: Project introduction and overview
Part 1: Building a content-based recommender using a cosine similarity algorithm
Part 2: Deploying the model on AWS serverless computing platform and creating an interactive chatbot web application.
Part 3: Model analysis and future recommendations.

We use a Jupyter Python 3 notebook as a collaborative coding environment for this project, and other bits of code for the web app development and deployment. All the code for this series is available in this GitHub repository.

Before we explain our building process, try and test our fully-deployed Chatbot — maybe it can recommend your next Movie :)

Movie Recommendation Algorithms

A recommender system is a subclass of information filtering system that seeks to predict the “rating” or “preference” a user would give to an item. They are heavily used in many commercial applications such as Netflix, Youtube, and Amazon Prime. A recommender system helps users to find relevant items in a fraction of time and without the need to search the entire dataset.

There are different approaches to build a movie recommender system:

Simple Recommender: this approach ranks all movies based on specific criteria: popularity, awards, and/or genre, then it suggests top movies to users without considering their individual preferences. An example could be Netflix’s “Top 10 in the U.S. Today.”
Collaborative Filtering Recommender: this approach utilizes a user’s past behavior to predict items that the users might be interested in. It considers a user’s previously watched movies, numerical ratings given to those items, and previously watched movies by similar users.
Content-based Filtering Recommender: this approach utilizes the properties and the metadata of a particular item to suggest other items with similar characteristics. For example, a recommender can analyze a movie’s genre and director to recommend additional movies with similar properties.

Each recommender has its advantages and limitations. The simple recommender provides general recommendations that might match the user’s taste. Collaborative filtering can provide accurate recommendations but it required a large amount of information about a user’s previous history and attitude. On the other hand, content-based filtering needs little information about the user’s history to provide a good recommendation, but it’s limited in scope as it always needs an item with pre-tagged characteristics to build its recommendations.

As we don’t have access to a user’s past behavior, we will be integrating a content-based filtering mechanism in our Movie bot.

Data Collection and Cleaning

We train our chatbot recommendation algorithm on movies released on or before July 2017, hosted on Kaggle. The dataset contains metadata for over 45,000 movies listed in the Full MovieLens Dataset. Data points include cast, crew, plot, languages, genres, TMDB vote counts, vote averages, and other details.

The files used in our projects are:

movies_metadata.csv: The main Movies Metadata file. Contains information on 45,000 movies featured in the Full MovieLens dataset.

keywords.csv: Contains the movie plot keywords for our MovieLens movies. Available in the form of a stringified JSON Object.

credits.csv: Consists of Cast and Crew Information for all our movies. Available in the form of a stringified JSON Object.

ratings_small.csv: The subset of 100,000 ratings from 700 users on 9,000 movies. Ratings are on a scale of 1–5 and have been obtained from the official GroupLens website.

We start by downloading the data from the Kaggle website and storing it internally in Google Drive. To load the data on the notebook, we mount Google Drive and use pd.read_csv function to import and load the data.

Listing 1. Importing data from Google Drive

Then, we explored the shape and columns name of each CSV file to gain more understanding of the data:

movies_metadata.csv has a shape of (45466 rows, 24 columns) and columns as [‘adult’, ‘belongs_to_collection’, ‘budget’, ‘genres’, ‘homepage’, ‘id’, ‘imdb_id’, ‘original_language’, ‘original_title’, ‘overview’, ‘popularity’, ‘poster_path’, ‘production_companies’, ‘production_countries’, ‘release_date’, ‘revenue’, ‘runtime’, ‘spoken_languages’, ‘status’, ‘tagline’, ‘title’, ‘video’, ‘vote_average’, ‘vote_count’]
ratings.csv has a shape of (26024289 rows, 4 columns) and columns as [‘userId’, ‘movieId’, ‘rating’, ‘timestamp’]
credits.csv has a shape of (45476 rows, 3 columns) and columns as [‘cast’, ‘crew’, ‘id’]
keywords.cvs has a shape of (46419 rows, 2 columns) and columns as [‘id’, ‘keywords’]

As we can see from above, the data are not exactly perfectly organized to be used together yet. The tables have different dimensions, and it could get quite confusing to merge them together for our recommendation engine. Luckily all of them have a column for movie ID, which is the unique IMDB id each movie has that will allow us to merge them effectively.

Listing 2. Data Exploration

Content-based Filtering Recommender

Our goal in this section is to build a recommender system by training the model on Natural Language processing to understand and suggest similar movies to a user’s input. We will calculate pairwise cosine similarity scores for all movies based on their Cast, Keywords, Directors, and Genres, then recommend movies with the highest similarity scores to the user’s input.

Choosing the right features

We combine all the significant features in one data frame. We use astype(‘init’) to change the movies ID datatype to an integer then use it as a reference key to merge ‘credits’ and ‘keywords’ data frames within ‘metadata’ data frame. As a result, ‘metadata’ df gets a shape of 27 columns.

Listing 3. Combining keywords and credits to the main metadata

So let’s have a look at metadata’s important features:

Table-1. The Metadata Table after combining keywords and credits

As some movies can have tens of keywords and cast members, we will be selecting the most relevant information in each feature. We will write functions to help us extract only the director name from the “Crew” feature and the top three relevant information from other features.

Before doing that, we need to change the data types from “stringified” lists into their original Python data types to avoid errors with our functions.

Listing 4. Converting data type from strings to its original objects

get_director(x) function extracts the director’s name and returns NaN if it’s not found. get_lists(x) function returns the top three elements from each feature.

Listing 4. Extracting relevant features from metadata

By applying these two functions to our features, we extract and only work with relevant data on our recommendation system:

Table-2. The Metadata Table after extracting relevant information

Creating a Word Soup

As the recommender system requires understanding features from text, we need to apply Natural Language Processing before building the main recommender.

Our objective is to eventually have one big word soup for each movie such that we can vectorize these soups and then compute cosine similarity. We start by converting all words to lowercase and removing all the spaces between them. Removing spaces between words helps the model to differentiate between recurring names. We want to have names be put together without a space to make sure that when we vectorize, we aren’t storing the “Robert” in “Robert De Niro” with the same variable as the “Robert” in “Robert Downey Junior”, because it would be arbitrary to say that these actors are similar just based on their name. When we store names as “robertdeniro” and “robertdowneyjunior” instead, we are differentiating between actors by creating separate vectorizations.

Listing 5. Cleaning data for the vectorization process

Then, we combine all features into one list “metadata soup” as they have the same weight in choosing the recommended movie. The function iterates over the rows of our metadata and joins the keywords, cast, director, and genres columns into one big word soup. Each element will be separated by a space “ ” that will signal to our vectorization function that is a particular word, to be encoded separately and uniquely.

Listing 6. Creating a word soup

Table 3. The Metadata Table after adding the soup column

Now that we have the soup for each movie, we want to create a new soup every time our recommender is run to capture the user’s input. We want to collect the user’s input so we can then vectorize them. Later, we will compute the pairwise cosine similarity between that input and each movie in our database, and rank which are the most similar movies to that input.

Listing 7. Getting the user’s input functions

Recommendation Model Based on Count Vectorizer and Cosine Similarity

Now that we have the features clean, let’s teach our model how to read them.

As mentioned above, our recommendation model takes both the pre-processed movie features and the user’s preferences as input. Then, it soupifys the user’s input and add it as a row to the metadata. The model then vectorizes the word soup using CountVectorizer function from the scikit-learn python library. CountVectorizer takes documents (different stings) and returns a tokenized matrix. Each word soup is encoded into frequencies of words in that word soup. For example, the following sentences, stored in a list:

corpus = [

‘This is the first document.’,

‘This document is the second document.’,

‘And this is the third one.’,

‘Is this the first document?’]

If we apply the CountVectorizer to them, we would get the following table:

The table reflects the frequency of each word in the sentence. Let’s say Word2 is the word ‘document’, so word2 appeared one time in the first sentence and two times in the second sentence.

We vectorize user input by adding the inputted word soup to the metadata table, as the last entry, and then running the vectorization process for all data. While this isn’t the most efficient way to go about it, the CountVectorize function is very quick to run and spends little resources. The bigger problem we have to face is the cosine similarity calculations

Cosine Similarity

Our recommendation model utilizes all movies' properties and the metadata to calculate and find the most similar movie to the user input. We use the cosine function to compute the similarity score between movies, where each movie will have a similarity score with every other movie in our dataset.

Cosine similarity is a mathematical computation that tells us the similarity between two vectors A and B. In effect, we are calculating the cosine of the angle theta between these two vectors. The function returns a value between -1, indicating complete opposite vectors, to 1, indicating the same vector. 0 indicates a lack of correlation between the vectors, and intermediate values indicate intermediate levels of similarity. Fortunately, we don’t need to write a new function to compute the equation as scikit-learn already has a pre-built function called cosine_similarities(). The formula we use to compute cosine similarity is the following:

The cosine similarity function increases linearly in complexity as we increase the size of A and B (note that A and B have the same size,n). The dot product of A and B will require n+t more computations if we add t more values to A and B, and the magnitude of each of these will also increase linearly. So far, no trouble in computational complexity.

However, our algorithm performs cosine similarity computation between each possible pair of movies. If we have k movies, then we need to perform k² computations. This is the reason why we had to reduce the number of movies in our dataset from 45,000 to 10,000 as the 35,000 movie difference translates into 1.925*10⁹ computations.

Of course, there are methods to decrease the number of computations required and therefore allow us to use the entire dataset, but we decided to leave these as backlogged potential improvements for now.

Finally, we define our recommendation system as a function that takes the user’s input (Genera, Actor, and director), perform a cosine similarity score between the search terms and all movies, then recommends the top 10 most similar movies to the user's preference.

Listing 8. The final recommendation function with cosine similarity scores

Let’s try out our recommendation:

Conclusion

In this part of the series, we discussed different kinds of recommendation algorithms and selected the content-based filtering approach to be our main algorithm for the chatbot. We explained in detail the mechanism of building a recommendation model using count vectorizer and cosine similarity scores.

In the next part of this series, we will illustrate the process of deploying our recommendation model on an AWS serverless platform and building an interactive chatbot web application, Here.

Resources

(Tutorial) Recommender Systems in Python. (n.d.). Retrieved December 06, 2020, from https://www.datacamp.com/community/tutorials/recommender-systems-python

Recommender system. (2020, November 29). Retrieved December 06, 2020, from https://en.wikipedia.org/wiki/Recommender_system