Project Report submitted
for Artificial Intelligence (UCS411)
Submitted by :
Garvish Manan (102303083)
Rachit Mahajan (102303495)
Lakshya Swar (102303086)
Submitted to
Dr. Simranjeet Kaur
DEPARTMENT OF COMPUTER ENGINEERING
THAPAR INSTITUTE OF ENGINEERING AND TECHNOLOGY,
PATIALA, PUNJAB
INDIA
Jan-May 2025
MOVIE RECOMMENDATION
SYSTEM USING K-NEAREST
NEIGHBOURS ALGORITHM
(KNN)
1
Table of Contents
1. Abstract ................................................................................................................. 3
2. Introduction........................................................................................................... 4
3. Problem Statement ............................................................................................... 6
4. Objective................................................................................................................ 7
5. Methodology .......................................................................................................... 9
5.1 Dataset Collection ...................................................................................... 9
5.2 Data Preprocessing and Merging ................................................................ 9
5.3 Creating a unified ‘Tags’ Field ................................................................. 10
5.4 Text Vectorization – CountVectorizer ( ) .................................................. 11
Settings Used ............................................................................................. 11
5.5 Similarity Computation – Cosine Similarity ............................................ 12
5.6 Recommendation Logic – K-Nearest Neighbors (KNN) ......................... 12
5.7 Enhancing Results with Poster Display using IMDb API ........................ 13
5.8 Building the web app with Streamlit ........................................................ 14
6. Results.................................................................................................................. 15
7. Conclusion .......................................................................................................... 16
8. References............................................................................................................ 16
2
1. Abstract
In today’s digital era, users are flooded with content choices across platforms like Netflix,
Prime Video, and Disney+. While the availability of thousands of movies sounds exciting,
it often leads to choice overload, where users struggle to decide what to watch next. To
tackle this issue, recommender systems have become essential. These systems aim to
suggest relevant content, saving time and improving the user experience.
This project presents a Content-Based Movie Recommendation System built using
classical Machine Learning techniques, specifically the K-Nearest Neighbors (KNN)
algorithm. It utilizes basic, interpretable ML tools and focuses solely on movie metadata
like genres, keywords, plot summaries, cast, and crew information.
The data used comes from the TMDB 5000 Movies and Credits dataset which includes
structured and semi-structured text data. After cleaning the data and extracting key
features, a text based representation (called tags) is created for each movie by combining
the different attributes. These tags are then converted into numerical format using the
CountVectorizer, a very basic frequency based text feature extractor.
Once vectorized, the system uses cosine similarity to compare movies and identify those
that are textually closest to the selected title. The KNN algorithm then retrieves the top 5
most similar movies. Finally, the recommended results are presented to the user through
the Streamlit web UI, which also displays posters of the suggested movies using the
IMDB api.
This project demonstrates that even with simple and interpretable machine learning
models like KNN, user-friendly applications can be built. One can understand core ML
concepts like data preprocessing, vectorization and similarity measurement without
entering the black box complexity of deep learning.
3
This project follows a clear and stepwise approach towards building a functional project
where the simplicity of the methods used make the system easy to understand, efficient to
implement, and suitable for both learning purposes and real world applications.
2. Introduction
In the modern digital world, we are surrounded by a massive amount of content whether
it’s movies, music, books, or shopping products. As the number of available choices
become huge, it becomes harder for users to decide what to pick next.. This is where
recommender systems come into play. These systems help narrow down the choices by
suggesting relevant items based on user preferences or data patterns.
There are mainly three types of recommender systems used today:
1. Collaborative Filtering – This method recommends items by analyzing user
behavior, such as ratings, watch history, or likes. It looks for users with similar
tastes and suggests what they liked. While powerful, this method often suffers from
a problem that it can’t recommend well when a user or item has little or no data.
2. Content-Based Filtering – Instead of stressing on user interactions, this method
focuses on the properties of the items themselves. For example, if a user likes a
sci-fi movie with space and robots, the system will recommend other movies with
similar features. This makes content-based systems reliable even with limited user
data.
3. Hybrid Systems – These systems combine both collaborative and content based
methods to get the best of both worlds. However, for this project with a small
dataset and limited to using classical ML techniques it would have been complex to
implement Hybrid Systems.
4
For this project, we chose to use a content-based recommendation system. This method
works best when we have enough useful details about each movie like its genre, story
summary, keywords, main actors, and director. By looking at this kind of content, the
system can find movies that are similar in topic or style and suggest them to users.
This is different from other systems that depend on user reviews, watch history, or ratings
(like collaborative filtering). Content-based systems do not need any user data. This is
helpful in cases where such data doesn’t exist for example, if a movie is new and hasn’t
been watched or rated much yet. We can still recommend it if it has similarities to other
known movies.
Another good thing about content-based systems is that they are easy to understand. If a
user asks why a certain movie was recommended, we can explain it clearly like “These
movies are both thrillers with time-travel themes,” or “They have the same cast.” This
makes the system more trustworthy and easier to explain in both learning and real-life
situations.
We used Python to build the system, because it’s easy to work with and has many useful
tools. We used:
● Pandas to handle and clean the movie data,
● Scikit-learn to convert movie information into numbers and use the K-Nearest
Neighbors (KNN) model, and
● Streamlit to make a simple web app where users can get recommendations by
selecting a movie they like.
This introduction gives a clear picture of what the project is about. In the next parts, we’ll
talk more about the actual problem we tried to solve, the steps we followed, and how well
the system performed.
5
3. Problem Statement
The overwhelming volume of digital content, especially movies, has made it increasingly
difficult for users to discover relevant titles efficiently. Platforms today provide thousands
of options, but the very scale of this offering leads to choice paralysis. Users are left
confused about what to watch next, leading to disengagement.
While industry grade recommendation engines used by platforms like Netflix or YouTube
provide effective suggestions, they are often powered by deep learning, collaborative
filtering, or hybrid systems that require extensive user data, high computational resources,
and complex infrastructure, however we kept it limited to content based filtering due to
lack of user interaction data.
The core issue addressed by this project is:
Can we design an effective movie recommendation system that works even without user
interaction data, and still provides meaningful suggestions using only content metadata?
This becomes especially important when:
● The dataset does not contain user ratings, reviews, or watch history
● The goal is to understand and apply interpretable ML methods rather than black
box models
● The system must be lightweight, modular, and explainable
● New items (movies) need to be recommended despite having no user data known as
the cold start problem
Another part of the problem lies in the nature of movie metadata itself, which is
semi-structured, textual, and varied. Features like plot summaries, keywords, genres, cast,
6
and crew contain rich information, but extracting and meaningfully combining them into a
usable format for ML models beholds a technical challenge. Vectorizing such data while
preserving context and meaning especially without deep NLP models is non-trivial.
Hence, this project sets out to solve:
● How to design a system that can convert heterogeneous movie metadata into a
unified, machine understandable form
● How to calculate similarity between movies based purely on content data
● How to recommend top similar movies in a simple yet attractive UI
● How to do all this using only classical ML techniques and lightweight tools,
making the system easy to fit in the provided guidelines
This project tackles the problem from the lens of practical ML education building an
application that balances functionality, interpretability, and scalability using modest
resources.
4. Objective
The objective of this project is to develop a practical and efficient content-based movie
recommendation system that utilizes movie metadata such as genres, keywords, plot
summaries, cast, and crew details. By using classical machine learning techniques, the
goal is to create a lightweight and easy to understand system that effectively recommends
movies.
The following objectives were pursued:
● Efficient Text Representation Using Feature Extraction Techniques:
We utilized the CountVectorizer method to convert movie metadata (like plot
7
summaries and genres) into numerical features. This vectorization technique
captures word frequency patterns and helps in representing textual data in a
structured format that can be processed by the ML model.
● Employ Cosine Similarity for Measuring Movie Similarity:
By calculating cosine similarity between movie vectors, the system identifies films
with the closest content-based features. This ensures that the system provides
accurate recommendations based on similarities in movie content such as genre,
cast, and themes.
● Enhance User Experience with Real-Time API Integration:
We incorporated the IMDb API to fetch and display real-time movie posters for
recommended titles. This not only improves the aesthetics of the user interface but
also provides an engaging, interactive experience for users.
● Create a Simple and Intuitive User Interface:
The frontend interface was built using Streamlit, allowing users to interact with the
recommendation system easily. Users can input a movie they like, and the system
will generate a list of similar movies, providing a smooth and user-friendly
browsing experience.
● Scalability for Broader Applications:
The architecture allows easy adaptation to other domains like books, music, or
products. This was achieved by structuring the backend to handle various types of
metadata input, which can be modified to suit different recommendation needs.
● Integrate Software Engineering Best Practices:
In addition to focusing on machine learning, we adhered to software engineering
best practices, including version control using Git, clean code principles, and
documentation. (Readme file alongside [Link]). This ensured that the
system is not only functional but also easy to set up by any beginner.
8
5. Methodology
The movie recommendation system was developed using a clear, step-by-step pipeline
designed to be modular, explainable, and lightweight. This methodology focuses on
content-based filtering, a classical recommendation technique that uses item features
rather than user behavior. The system recommends similar movies by analyzing
descriptive content like plot, cast, and genre using the K-Nearest Neighbors (KNN)
algorithm and cosine similarity for comparison.
Below are the stages followed:
5.1 Dataset Collection
We used the TMDB 5000 Movies and Credits Dataset, which is publicly available on
Kaggle. This dataset includes:
● tmdb_5000_movies.csv: Contains details like title, genres, overview (plot
summary), keywords, vote average, and popularity.
● tmdb_5000_credits.csv: Contains the cast and crew data for each movie.
These two datasets were chosen because they provide structured and semi structured
metadata necessary for content-based recommendations. Importantly, the dataset does not
contain user ratings or interaction data, so collaborative filtering was not applicable.
5.2 Data Preprocessing and Merging
To prepare the data for analysis, we performed the following steps:
● Merging the two datasets using the common movie ID to combine metadata
information.
● Dropping unnecessary columns such as budget, homepage, vote_count, and other
fields not relevant to content similarity.
Handling missing values by dropping or filling them where appropriate.
9
● Parsing JSON-like strings (present in columns like genres, keywords, cast, and
crew) to extract meaningful fields using Python’s ast.literal_eval and list
comprehension.
○ From cast, only the top 3 actors were selected.
○ From crew, we filtered out only the director by checking the job role.
This step converted complex fields into simple lists of strings, making them suitable for
text-based processing.
5.3 Creating a unified ‘Tags’ Field
To capture the core essence of each movie, we created a new feature called tags, which
consolidates several key metadata attributes into a single textual representation. This
unified field serves as the foundation for similarity comparison in the recommendation
engine.
The tags field was formed by combining the following movie details:
● Overview (plot summary)
● Genres
● Keywords
● Names of the top 3 cast members
● Director’s name
To make the data consistent and ready for further processing:
● All text was converted to lowercase to avoid mismatches due to case sensitivity.
● Multi-word terms were joined (e.g., “Science Fiction” → “sciencefiction”) so that
such phrases are treated as single meaningful units.
10
This approach allowed us to turn semi-structured and varied movie metadata into a
uniform text string that summarizes the thematic and stylistic identity of each film.
5.4 Text Vectorization – CountVectorizer ( )
To enable similarity comparison between movies, we needed a way to convert the textual
tags field into numerical form. For this, we applied the CountVectorizer from Scikit-learn,
a frequency based vectorization method.
CountVectorizer transforms a collection of text documents into a matrix where:
● Each row represents a movie,
● Each column corresponds to a word from the top vocabulary, and
● The values indicate how frequently that word appears in that movie’s tags.
Settings Used:
● max_features = 5000: Limits the vocabulary to the most frequent 5000 terms.
● stop_words='english': Removes common words like “the”, “and”, “is”, etc., which
carry little meaning in determining movie similarity.
Note:
This method is based on the Bag of Words (BoW) model, which assumes that the meaning
or context of a document can be approximated by the frequency of its words. While BoW
ignores grammar and word order, it works effectively for tasks involving content
comparison.
No deep semantic or syntactic processing was involved, only straightforward word
frequency analysis using classical vectorization.
11
5.5 Similarity Computation – Cosine Similarity
Once we had vector representations of all movies, we needed a way to compare them. We
used cosine similarity, a popular distance metric for text data.
Cosine Similarity measures the angle between two vectors rather than their absolute
distance. It is defined as:
Where:
● A and B are the vector representations of two movies.
● A.B is the dot product, and ||A||, ||B|| are the magnitudes.
Why cosine similarity? Because two movies can be similar in content even if they have
different magnitudes of words (i.e., one has a longer description), but their direction
(theme and content) remains the same.
We computed the cosine similarity matrix for all movie pairs and stored it in memory for
efficient time saving lookups during recommendation.
5.6 Recommendation Logic – K-Nearest Neighbors (KNN)
With the similarity matrix available, we implemented a simple K-Nearest Neighbors style
algorithm to generate recommendations.
12
Steps followed:
● Given a movie title input by the user, we locate its index in the dataset.
● We retrieve its row from the similarity matrix.
● We sort the scores in descending order and return the top 5 most similar movies
(excluding the selected one itself).
Fig.1 KNN Recommendation illustration
5.7 Enhancing Results with Poster Display using IMDb API
To make the recommendations visually appealing, we fetched movie posters using the
IMDb API. Each recommended movie’s ID is used to construct a URL to retrieve its
poster image. This was done through a helper function using HTTP requests.
This added a professional and polished touch to the system, making it more than just a
text-based tool.
13
5.8 Building the web app with Streamlit
We used Streamlit, a Python-based web framework designed for ML and data science
apps, to create a clean and interactive user interface.
Features of the UI:
● Dropdown to select a movie title
● Button to trigger recommendation
● Side-by-side display of recommended movies with poster and title
● Instant loading and responsiveness using Streamlit’s caching features
Fig.2 Workflow of the Content_Based Movie Recommendation System using KNN
Stage Tool / Technique Purpose
Data Cleaning Pandas Read, merge, and preprocess CSVs
Feature Extraction Python string Extract relevant metadata
manipulation
Text Representation CountVectorizer Convert tags into vectors
14
Similarity Measure Cosine Similarity Compare movies by content
Recommendation Logic KNN Retrieve top similar movies
Poster Fetching IMDb API Enhance output with visuals
UI Framework Streamlit Build interactive web interface
6. Results
When a user selects a movie, the system retrieves five similar titles by computing similarity
scores based on structured metadata such as genres, keywords, plot overviews, lead actors, and
director names. For instance, when tested with the input movie The Dark Knight, the
recommended movies included Batman Begins, The Dark Knight Rises, Man of Steel, Iron Man,
and Watchmen. These suggestions align closely with the superhero genre and share common
stylistic and thematic elements like dark narratives and action sequences.
The system processes user queries efficiently, returning recommendations within a fraction of a
second due to the precomputed and stored cosine similarity matrix. This ensures high
responsiveness without relying on complex or computationally heavy algorithms. Since the
recommendations are based on straightforward metadata fields and word frequency vectors, the
underlying logic remains easy to interpret and transparent, making it reliable for analysis and
debugging.
However, the system is not without limitations. Since it relies on a basic word count approach, it
may not fully capture deeper semantics or the tone of the content. Movies with richer metadata
may also influence results more heavily, introducing bias. Additionally, the system lacks
personalization and delivers the same set of recommendations to all users for a given movie.
Despite these limitations, the model achieves its goal effectively by producing coherent and
contextually accurate recommendations using lightweight, interpretable techniques. This validates
the design choice to focus on metadata driven feature extraction and basic vector operations for
similarity computation.
15
7. Conclusion
This project gave us hands-on experience in building a complete machine learning system using
simple and interpretable techniques. Through this, we understood the importance of data
preprocessing how cleaning, transforming, and structuring movie metadata can directly impact the
quality of recommendations. Creating the unified tags field taught us how combining multiple
text sources can help define the unique identity of each item.
Beyond technical skills, this project emphasized design thinking, how to keep things modular,
explainable, and visually engaging through tools like Streamlit and API integration. It taught us
that good user experience is as important as a working algorithm.
Key Takeaways:
● Clean data is the foundation of any ML project.
● Cosine similarity is a powerful way to find related content without complex logic.
● User-friendly interfaces make ML tools more accessible and fun to use.
● You don’t need deep learning to solve real-world problems, start simple, think smart.
Overall, this project was not just about coding a solution—it was about learning how to turn data
into decisions and building a bridge between raw information and real user needs.
8. References
1. Scikit-learn Documentation – CountVectorizer:
[Link]
[Link]
2. Scikit-learn Documentation – Cosine Similarity:
[Link]
[Link]
3. Movie Dataset (TMDb 5000 Movie Dataset):
[Link]
16
4. Bag of Words Model – Towards Data Science:
[Link]
b4a91
5. Cosine Similarity Explained – Medium:
[Link]
6. K-Nearest Neighbors Algorithm – GeeksforGeeks:
[Link]
7. Introduction to Recommendation Systems – Analytics Vidhya:
[Link]
17