Building a Content-based Recommender using a Cosine-Similarity Algorithm

Movie Recommendation Algorithms

A recommender system is a subclass of information filtering system that seeks to predict the “rating” or “preference” a user would give to an item. They are heavily used in many commercial applications such as Netflix, Youtube, and Amazon Prime. A recommender system helps users to find relevant items in a fraction of time and without the need to search the entire dataset.

  1. Simple Recommender: this approach ranks all movies based on specific criteria: popularity, awards, and/or genre, then it suggests top movies to users without considering their individual preferences. An example could be Netflix’s “Top 10 in the U.S. Today.”
  2. Collaborative Filtering Recommender: this approach utilizes a user’s past behavior to predict items that the users might be interested in. It considers a user’s previously watched movies, numerical ratings given to those items, and previously watched movies by similar users.
  3. Content-based Filtering Recommender: this approach utilizes the properties and the metadata of a particular item to suggest other items with similar characteristics. For example, a recommender can analyze a movie’s genre and director to recommend additional movies with similar properties.

Data Collection and Cleaning

We train our chatbot recommendation algorithm on movies released on or before July 2017, hosted on Kaggle. The dataset contains metadata for over 45,000 movies listed in the Full MovieLens Dataset. Data points include cast, crew, plot, languages, genres, TMDB vote counts, vote averages, and other details.

Listing 1. Importing data from Google Drive
  • movies_metadata.csv has a shape of (45466 rows, 24 columns) and columns as [‘adult’, ‘belongs_to_collection’, ‘budget’, ‘genres’, ‘homepage’, ‘id’, ‘imdb_id’, ‘original_language’, ‘original_title’, ‘overview’, ‘popularity’, ‘poster_path’, ‘production_companies’, ‘production_countries’, ‘release_date’, ‘revenue’, ‘runtime’, ‘spoken_languages’, ‘status’, ‘tagline’, ‘title’, ‘video’, ‘vote_average’, ‘vote_count’]
  • ratings.csv has a shape of (26024289 rows, 4 columns) and columns as [‘userId’, ‘movieId’, ‘rating’, ‘timestamp’]
  • credits.csv has a shape of (45476 rows, 3 columns) and columns as [‘cast’, ‘crew’, ‘id’]
  • keywords.cvs has a shape of (46419 rows, 2 columns) and columns as [‘id’, ‘keywords’]
Listing 2. Data Exploration

Content-based Filtering Recommender

Our goal in this section is to build a recommender system by training the model on Natural Language processing to understand and suggest similar movies to a user’s input. We will calculate pairwise cosine similarity scores for all movies based on their Cast, Keywords, Directors, and Genres, then recommend movies with the highest similarity scores to the user’s input.

Choosing the right features

We combine all the significant features in one data frame. We use astype(‘init’) to change the movies ID datatype to an integer then use it as a reference key to merge ‘credits’ and ‘keywords’ data frames within ‘metadata’ data frame. As a result, ‘metadata’ df gets a shape of 27 columns.

Listing 3. Combining keywords and credits to the main metadata
Table-1. The Metadata Table after combining keywords and credits
Listing 4. Converting data type from strings to its original objects
Listing 4. Extracting relevant features from metadata
Table-2. The Metadata Table after extracting relevant information

Creating a Word Soup

As the recommender system requires understanding features from text, we need to apply Natural Language Processing before building the main recommender.

Listing 5. Cleaning data for the vectorization process
Listing 6. Creating a word soup
Table 3. The Metadata Table after adding the soup column
Listing 7. Getting the user’s input functions

Recommendation Model Based on Count Vectorizer and Cosine Similarity

Now that we have the features clean, let’s teach our model how to read them.

Listing 8. The final recommendation function with cosine similarity scores

Conclusion

In this part of the series, we discussed different kinds of recommendation algorithms and selected the content-based filtering approach to be our main algorithm for the chatbot. We explained in detail the mechanism of building a recommendation model using count vectorizer and cosine similarity scores.

Resources

(Tutorial) Recommender Systems in Python. (n.d.). Retrieved December 06, 2020, from https://www.datacamp.com/community/tutorials/recommender-systems-python

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store