0% found this document useful (0 votes)
20 views39 pages

Sentiment Analysis of Movie Reviews

This document is a mini project report on sentiment analysis of movie reviews submitted by 4 students to fulfill requirements for their AI and ML degree. It analyzes movie reviews to determine the overall sentiment (positive or negative) expressed in the text using natural language processing and machine learning techniques. The project involves collecting a dataset of movie reviews, preprocessing the text, applying algorithms like Naive Bayes and SVM to train models that can accurately classify reviews by sentiment. The results and performance of different models are evaluated to understand how reviewers felt about the movies.

Uploaded by

Subhashri C
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views39 pages

Sentiment Analysis of Movie Reviews

This document is a mini project report on sentiment analysis of movie reviews submitted by 4 students to fulfill requirements for their AI and ML degree. It analyzes movie reviews to determine the overall sentiment (positive or negative) expressed in the text using natural language processing and machine learning techniques. The project involves collecting a dataset of movie reviews, preprocessing the text, applying algorithms like Naive Bayes and SVM to train models that can accurately classify reviews by sentiment. The results and performance of different models are evaluated to understand how reviewers felt about the movies.

Uploaded by

Subhashri C
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

VISVESVARAYA TECHNOLOGICAL UNIVERSITY

BELAGAVI-590018

A Mini Project Report On


“SENTIMENT ANALYSIS ON MOVIE
REVIEWS”
Submitted for the requirement of The VII Semester AI and ML Application Development Laboratory
(18AIL76)
Bachelor Of Engineering in Artificial Intelligence & Machine Learning

Submitted by
V SHASHANK (1AM20AI048)
DANESH S (1AM20AI011)
ANAND SHIYANI (1AM20AI005)
GOUTAM SHARMA (1AM20AI014)

Under the Support and Guidance of

Mrs. C. Subhashri
Assistant Professor,
Dept. of AIML

AMC ENGINEERING COLLEGE


DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
18 K.M. Bannerghatta Main Road, Bengaluru-560083
th

2023-2024
DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
AMC ENGINEERING COLLEGE
18th K.M. Bannerghatta Main Road Bengaluru – 560083

CERTIFICATE

I Certify that the Mini Project work entitled “SENTIMENT ANALYSIS ON MOVIE REVIEWS”
carried out by V Shashank(1AM20AI048), Danesh S(1AM20AI011) , Anand Shiyani
(1AM20AI005), Goutam Sharma (1AM20AI014) are bonafide students of AMC Engineering
College in partial fulfilment of the requirement of VII semester (AI and ML Application Development
Laboratory (18AIL76)) Bachelor of Engineering in Artificial Intelligence And Machine Learning
Visvesvaraya Technological University, Belagavi, during the year 2023 – 2024. It is certified that all
corrections/suggestions indicated for Internal Assessment have been incorporated in the Report
deposited in the departmental library. The Mini Project report has been approved as it satisfies the
academic requirements..

Signature of the Guide Signature of the HOD Signature of the Principal


Mrs. [Link] Dr. Rajesh Eswarawaka Dr. R Nagaraja
Assistant Professor, Professor & HOD, Principal
Dept of AIML Dept. of AIML

Name of the examiners Signature with date

1.

2.
DECLARATION

We V Shashank(1AM20AI048), Danesh S(1AM20AI011) , Anand Shiyani


(1AM20AI005), Goutam Sharma (1AM20AI014) students of VII semester of BE, Artificial
Intelligence & Machine Learning, AMC Engineering College hereby declare that the Mini
project work entitled “SENTIMENT ANALYSIS ON MOVIE REVIEWS” has been carried
out by us at AMC Engineering College, Bengaluru and submitted in partial fulfilment of the
course requirements of Bachelor of Engineering in ARTIFICAL INTELLIGENCE and
MACHINE LEARNING of Vishvesvaraya Technological University, Belagavi, during the
academic year 2023- [Link] also declare that, to the best of our knowledge and belief, the
work reported here does not from part of any other dissertation on the basis of which a degree
or an award was conferred on an earlier occasion on this by any other student.

Date:
Place: Bengaluru

V SHASHANK (1AM20AI048)

DANESH S (1AM20AI011)

ANAND SHIYANI (1AM20AI005)

GOUTAM SHARMA (1AM20AI014)


ACKNOWLEDGEMENT

It gives us immense pleasure to present before you our project titled “ SENTIMENT
ANALYSIS ON MOVIE REVIEWS”. The joy and satisfaction that accompany the
successful completion of any task would be incomplete without the mention of those
who made it possible. We are glad to express our gratitude towards our prestigious
institution AMC ENGINEERING COLLEGE for providing us with utmost
knowledge, encouragement and the maximum facilities in undertaking this mini
project.
First of all, We would like to thank the Management of AMC Engineering College
for providing such a healthy environment for the successful completion of the Mini
project work.
In this regard, We express our sincere gratitude to Dr. R Nagaraja Principal AMCEC,
for providing us all the facilities in this college.
We express our deepest gratitude and special thanks to Dr. Rajesh Eswarawaka
Professor & H.O.D, Dept. Of Artificial Intelligence and Machine Learning, for all
his guidance and encouragement.
We sincerely acknowledge the guidance and constant encouragement of our mini-
project guide, Mrs. [Link] Assistant Prof., Dept. Of Artificial Intelligence &
Machine Learning.

V SHASHANK (1AM20AI048)
DANESH S (1AM20AI011)
ANAND SHIYANI (1AM20AI005)
GOUTAM SHARMA (1AM20AI014)
ABSTRACT

Movie reviews are an essential tool for determining a film's impact. While
assigning a movie a number or star rating, it informs us about the quantitative
success or failure of a movie. A collection of movie reviews is what offers us
a deeper qualitative perspective on many elements of the film. Online text
material is now so fragmented that it overwhelms consumers. Without actually
seeing a film, it is simple for someone to submit remarks about it. As a result,
it is necessary to analyze the tone of the movie reviews. This can be achieved
through sentiment analysis of these reviews. Sentiment analysis is a method
used in natural language processing (NLP) for assessing the emotional
undertone of a document. It is also referred to as opinion mining. It analyzes
sentiment and contextual information of the text using ML, AI, computational
linguistics, and data mining to predict whether the text is conveying positive,
negative, or neutral emotions.
Sentiment analysis may not only identify sentiment but also extract the
subject, opinion holder, and polarity - the ratio of positive to negativity from
a text. This study compares the sentiment analysis capabilities of machine
learning models in order to assist viewers in understanding the proper
appraisal of films as well as assist producers in predicting box office success.
TABLE OF CONTENTS

Chapter Title Page


No

1
INTRODUCTION 2

2 SYSTEM REQUIREMENTS 4
SOFTWARE CONFIGURATION
HARDWARE CONFIGURATION

3 DATASET AND LIBRARIES 5


LIBRARIES AND MODULES 6
8
DATA PREPROCESSING
10
WORD CLOUD FOR SENTIMENT ANALYSIS

LEMMATIZATION WITH WORDNET 12


DATA SPLITTING 14
15
FEATURE ENGINEERING

4 17
IMPLEMENTATION

METHODOLOGY
18
ALGORITHMS

MODEL TRAINING 20

5 RESULT WITH SCREENSHOTS 25


CONCLUSION 29
FUTURE SCOPE 30
REFERENCES 31
Sentiment Analysis on Movie Reviews

Chapter 1
INTRODUCTION

Sentiment Analysis on Movie Reviews


Assessing a movie’s performance through reviews is crucial. While assigning a
numerical rating quantifies its success, a collection of reviews offers qualitative
insights. These reviews delve into a movie’s strengths, weaknesses, and whether it
meets expectations. Sentiment Analysis, a facet of machine learning, extracts subjective
information from these texts, revealing the reviewer’s attitude towards various aspects.
By deciphering sentiments like happiness, sadness, or anger, it unveils the reviewer’s
state of mind. Our project aims to employ Sentiment Analysis on movie reviews to
comprehend the overall sentiment whether positive or negative and predict how the
reviewers felt about the film.
The project focuses on performing sentiment analysis on movie reviews using
natural language processing (NLP) techniques. The primary objective is to develop
a model that can categorize reviews as positive or negative based on the sentiments
expressed within the text. The project involves collecting a dataset of movie reviews,
preprocessing the text, applying machine learning algorithms and training a model
to accurately predict sentiment.

Sentiment Analysis: Concept, Analysis and Applications

 Sentiment analysis, also known as opinion mining, is a branch of natural


language processing (NLP) that involves identifying, extracting, and
categorizing emotions, attitudes, or opinions expressed within textual data.
Its primary goal is to understand the subjective information present in the text
and categorize it as positive, negative, or neutral.

 Sentiment analysis allows organizations to gain insights into the vast volumes
of unstructured data from various online sources. Fundamentally, sentiment
analysis algorithms undertake either of these three approaches: rule-based,
automatic and hybrid for data processing.

 As the name suggests, the rule-based approach uses predefined and lexicon-
based manually crafted rules to classify sentiments. The automatic approach
employs machine learning methods, while the hybrid approach uses a
combination of the above to perform sentiment analysis.

Department of AIML, AMCEC 2023-24 2


Sentiment Analysis on Movie Reviews

 The assessment of sentiment in movie reviews is a nuanced process,


involving deciphering the subtleties of the text to grasp the emotions and
subjective viewpoints conveyed by reviewers. Its goal is to reveal the
collective sentiment of the audience towards a movie, offering crucial
insights into its reception, public perception, and the elements that resonate
or disappoint viewers.

 The process of sentiment analysis involves analysing written or spoken


language to determine the underlying sentiment conveyed. It aims to discern
the emotional tone, subjective opinions, or attitudes expressed in a piece of
text, whether it is a social media post, product review, survey response, or
any other form of written communication.

 Sentiment analysis has many applications for businesses in a variety of


sectors. The following are some of the most common real-world uses of
sentiment analysis: Market research, Movie reviews, Improving services,
Customer support, Finance.

 The benefits of sentiment analysis include – To infer meaning from


unstructured data, to take quick action against poor customer experience,
boost business performance and strategy, news trend analysis, real time
sentiment insights, etc.

Department of AIML, AMCEC 2023-24 3


Sentiment Analysis on Movie Reviews

Chapter-2
SYSTEM REQUIREMENTS

To perform Sentiment analysis on movie reviews, both software and hardware


requirements need to be considered. Here is a comprehensive list of system
requirements.

Software Requirements:

 Operating System: Windows, macOS

 Python: Install the latest version of Python (preferably Python 3.x) from the
official Python website ([Link]
 Web Browser – Google chrome
 Google Colab
 Python Libraries: Install the necessary libraries and frameworks for
Sentiment analysis .

Hardware Requirements:

 CPU: A modern CPU with multiple cores is recommended for faster


processing.. . . . . . . . . . . . . . . . . . . . . . . .
......................... ...
 Memory (RAM): At least 8GB of RAM is recommended

 Storage: Ensure that you have enough disk space to store your dataset, pre-
trained models, and any intermediate or output files generated during the
process. . . . . . . . . . . . . . . . . . . . . . . .

 Display: A monitor with a suitable resolution is required for visualizing the


results and analysing the output. . . . . . . . . . . . . . . . . . . . . . . . .

 Network Connection: A high speed router or a stable network connection is


necessary for training the model,

Department of AIML, AMCEC 2023-24 4


Sentiment Analysis on Movie Reviews

Chapter 3
DATASET AND LIBRARIES
DATASET

The dataset consists of 50,000 movie reviews, one half of which is positive reviews
and the other half negative. It includes three columns – index, review and sentiment.
The review column consists of reviews as typed by multiple users. The sentiment
column says whether the corresponding review is positive (1) or negative (0).

Source: The dataset was obtained from Kaggle, which in turn was obtained from
IMDB. It is an online database of information related to visual entertainments like
movies, TV series etc. It is a popular medium for people to review and rate movies.
IMDB ratings are considered as benchmark for success of multiple movies.

Data Overview:

 Number of Samples : 50,000 reviews


 Features : Typically, the dataset would consist of two main columns:
o Review Text: The textual content of the reviews.
o Sentiment Label: Binary labels indicating positive (1) or negative (0)
sentiment.
 Distribution of Sentiments:
o Positive Reviews (Label 1): 25,000 samples
o Negative Reviews (Label 0): 25,000 samples

The dataset's balance between positive and negative sentiments provides an equal
representation for training and testing sentiment analysis models.

Department of AIML, AMCEC 2023-24 5


Sentiment Analysis on Movie Reviews

Python Libraries used

Core Libraries:
1. Pandas (import pandas as pd): Used for data manipulation, handling
datasets, and performing data analysis.

2. NumPy (import numpy as np): Essential for numerical computations, array


operations, and handling mathematical operations efficiently.

3. Re (import re): Provides support for working with regular expressions, often
used for text processing tasks.

4. NLTK (import nltk): The Natural Language Toolkit offers various tools and
resources for natural language processing tasks, including tokenization,
stemming, stopwords, etc.

5. Sci-kit learn (from sklearn.*): Scikit-learn is a powerful machine learning


library that includes tools for data preprocessing, feature extraction, model
building, and evaluation. It encompasses various classification, regression,
clustering, and model evaluation algorithms.

Department of AIML, AMCEC 2023-24 6


Sentiment Analysis on Movie Reviews

1. Seaborn (import seaborn as sns) and Matplotlib (import [Link]


as plt): Visualization libraries used to create plots, charts, and visual
representations of the data and model evaluation metrics.

2. Wordcloud (from wordcloud import WordCloud): A library specifically


used for generating word clouds from textual data, visually representing the
frequency of words.

3. PIL (from PIL import Image): The Python Imaging Library, used for
handling images and supporting image operations.

Specific Modules:

 [Link].word_tokenize: Tokenizes text into words.

 [Link]: Provides a list of common stopwords for text data.

 [Link]: Implements stemming to reduce words to


their root form.

 sklearn.feature_extraction.[Link] and TfidfVectorizer: Used


for converting text data into numerical features (Bag-of-Words and TF-IDF
representations).

 sklearn.model_selection.train_test_split : Splits data into training and testing


sets.

 sklearn.linear_model.LogisticRegressionCV, [Link],
[Link] : Machine learning models used
for sentiment analysis.

 [Link].classification_report, [Link].roc_auc_score,
[Link].roc_curve : Functions for evaluating classification model
performance.

 Pickle : Used for serializing and deserializing Python objects, here


specifically for saving trained models.

Department of AIML, AMCEC 2023-24 7


Sentiment Analysis on Movie Reviews

Data Preprocessing

Text Cleaning

The preprocessing() function includes several steps to clean the text data:

 HTML Tag Removal : Using regular expressions ([Link](‘<[^>]*>’, ”,


text)), it removes any HTML tags present in the text.
 Special Character Removal : [Link](‘[\W]+’, ‘’, [Link]()) converts the
text to lowercase and removes non-alphanumeric characters.
 Emoticon Handling : Emoticons are extracted and included as part of the
processed text.

These steps ensure that the text is standardized, devoid of unwanted characters, and
prepared for further analysis.

Tokenization & Normalization

 Tokenization : word_tokenize() from NLTK tokenizes the cleaned text into


individual words or tokens (words = word_tokenize(text)).
 Lowercasing : Text is converted to lowercase ([Link]()) for uniformity
in analysis.

Stop words Removal:

 Stop words: NLTK’s ENGLISH_STOP_WORDS and stopwords corpus are


utilized to filter out common English stop words. The line text = ‘
’.join([word for word in words if word not in ENGLISH_STOP_WORDS])
removes these stop words from the tokenized words (words) before rejoining
them into a cleaned sentence.

Department of AIML, AMCEC 2023-24 8


Sentiment Analysis on Movie Reviews

These preprocessing ensure that the text data is cleaned of noise, tokenized into
meaningful units, and prepared by removing common words that don't carry
significant meaning for sentiment analysis.

The function is then applied to each review in the dataset using the Pandas
DataFrame .apply() method.

This process ensures that the text data is transformed into a format suitable for
further analysis and modeling.

Word Cloud for Sentiment Analysis:

 Insightful Visualization: The creation of word clouds aims to visually


represent the most frequent words used in reviews categorized by sentiment.

 Identifying Common Words: Word clouds help identify and display the
words that appear most frequently in either positive or negative sentiment
reviews.

 The creation of word clouds facilitates the visual exploration of the most
frequent words within specific sentiment categories. It offers an intuitive
representation of commonly used terms, aiding in the understanding of
prevalent themes or aspects associated with positive or negative sentiment in
the reviews.

 The addition of word clouds enhances the understanding of prevalent words


within each sentiment class, offering a visual representation of frequently
used terms in positive and negative reviews. This aids in discerning key
sentiment-carrying words and potentially significant themes within each
sentiment category.

Department of AIML, AMCEC 2023-24 9


Sentiment Analysis on Movie Reviews

Positive Sentiment Word Cloud : A word cloud is created using a collection of


words from reviews with positive sentiment.

 Text Collection(Sentiment): The code aggregates text from specific


sentiment reviews into the text variable using the get_all_text() function.

 Word Cloud Generation(Sentiment): Utilizing the WordCloud module


from the wordcloud library, a word cloud is generated with specific
parameters defining its appearance (size, stopwords, background_color).

 Visualization: Matplotlib functions are used to display the word cloud,


showcasing frequently occurring words from specific sentiment reviews.

This code encapsulates the process of creating a word cloud for positive sentiment
reviews, allowing visual insights into the most common words found within that
sentiment category.

Department of AIML, AMCEC 2023-24 10


Sentiment Analysis on Movie Reviews

Positive Sentiment Word Cloud Visualization

To gain insight into the kind of words used for both the sentiments, word clouds are
created for each class of sentiment. The word cloud highlights the most common
words, showing what words usually appear in a review of that kind of sentiment.

Negative Sentiment Word Cloud Visualization

Department of AIML, AMCEC 2023-24 11


Sentiment Analysis on Movie Reviews

The two plots show interesting results. Word clouds show the most frequent words
with the largest font and the font size decreases with frequency. Both the plots show
similar words - film, movie, like, time, character etc. This goes to show that a word
that appears so frequently in the reviews does not help in the classification of the
review. This will be dealt with later on.

In reviews, various words might share a common lemma; for instance, words like
‘do, doing, done’ convey the same concept and stem from a single root form or
lemma – ‘do’. Treating these words individually could lead to higher dimensionality
in the dataset. Hence, lemmatization is employed to transform words into their base
or dictionary form (lemmas), ensuring coherence in meaning, which may not be
preserved when considering stems separately.

Lemmatization focuses on transforming words to their base or dictionary form


(lemmas), maintaining meaningfulness in language, which can be more appropriate
than stemming when retaining semantic context is crucial for analysis.

Lemmatization with wordnet

Lemmatization: Lemmatization is a text normalization technique that reduces


words to their base or root form (lemma).

The code initializes the WordNetLemmatizer from NLTK (wnl =


WordNetLemmatizer()).

Within the lemm_text() function:

 The text is tokenized into words using word_tokenize().

Department of AIML, AMCEC 2023-24 12


Sentiment Analysis on Movie Reviews

 Each word is lemmatized ([Link](word)) to transform it into its base


form.

 The lemmatized words are then joined back together to reconstruct the text
(return “ ”.join(...)).

Application of Lemmatization:

 The lemm_text() function is applied to the 'review' column in the DataFrame


using .apply(lemm_text).
 This process lemmatizes the text within the ‘review’ column, transforming
words to their base forms.

This code demonstrates the application of lemmatization to the text data in the
‘review’ column of the DataFrame. Lemmatization helps in standardizing words to
their base forms, aiding in reducing the vocabulary size and potentially improving
the performance of sentiment analysis.

TF-IDF (Term Frequency-Inverse Document Frequency):

TF-IDF is a numerical statistic used to evaluate the importance of a word in a


document relative to a collection of documents (corpus).

Term Frequency (TF): Represents the frequency of a word within a specific


document. It measures how often a word occurs in a document.

Inverse Document Frequency (IDF): Measures the rarity of a word across


documents in the corpus. It diminishes the weight of common words and amplifies
the weight of rare words.

Calculation: For each word in a review, TF-IDF calculates a weight that is high for
words occurring frequently within the review.

Role of TF-IDF in Sentiment Analysis:

Importance Weighting: TF-IDF assigns weights to words, highlighting words that


are important for a particular document (review) while diminishing the influence of
commonly occurring words.

Feature Creation: TF-IDF creates numerical features representing the importance


of words in each review, which can be utilized in machine learning models for
sentiment analysis.

This code showcases the use of TF-IDF to transform text data into weighted
numerical features:

Department of AIML, AMCEC 2023-24 13


Sentiment Analysis on Movie Reviews

 TfidfVectorizer(use_idf=True, norm=‘12’,smooth_idf=True): Initializes a


TF-IDF vectorizer that calculates word importance based on rarity across
reviews for sentiment analysis.

 X = tfidf.fit_transform([Link]): Transforms text data from the ‘review’


column into TF-IDF-weighted numerical features, enhancing sentiment
analysis by emphasizing rare and meaningful words.

Data Splitting: Training and Testing for Sentiment Analysis

1. Import Libraries: The code imports the necessary library train_test_split


from sklearn.model_selection to perform the data split.
2. Data Preparation:
 X represents the vectorized features derived from movie reviews.
 y contains the corresponding sentiment labels (positive, negative) for
each review in X.

Department of AIML, AMCEC 2023-24 14


Sentiment Analysis on Movie Reviews

3. Splitting the Data:


 train_test_split function divides the data into four subsets: X_train,
X_test, y_train, and y_test.
 test_size=0.30 specifies that 30% of the data will be allocated for
testing, and 70% will be used for training.
 random_state=1 sets a random seed, ensuring the same random split
occurs if the code is run multiple times (providing reproducibility).
 shuffle=False means the data won't be shuffled before splitting, which
can be useful in certain scenarios. However, shuffling can help prevent
any inherent ordering bias in the dataset.
4. Result:
 X_train and y_train will contain 70% of the original data, used for
training the sentiment analysis model.
 X_test and y_test will contain 30% of the original data, used to
evaluate the model's performance.

Feature Engineering

1. Importing Libraries:
 The code imports the MaxAbsScaler class from
[Link], which is used for scaling features.
2. Data Scaling:
 MaxAbsScaler scales each feature by its maximum absolute value,
bringing all features within the range [-1, 1] without changing their
directionality.

Department of AIML, AMCEC 2023-24 15


Sentiment Analysis on Movie Reviews

3. Fitting and Transforming:


 scaling = MaxAbsScaler().fit(X_train) initializes the scaler and fits it
to the training data (X_train). This step calculates the maximum
absolute values of each feature in the training set.
 X_train = [Link](X_train) applies the scaling
transformation to the training data. It scales the features in X_train
based on the maximum absolute values calculated earlier.
 X_test = [Link](X_test) applies the same transformation to
the testing data (X_test). It ensures that the scaling is consistent across
both training and testing datasets, maintaining the same scaling factors
applied to X_train.
4. Purpose in Sentiment Analysis of Movie Reviews:
 In this project on movie reviews, scaling the features will be
beneficial, especially when the original features (word frequencies,
TF-IDF values, etc.) have varying scales or ranges.
 Scaling ensures that the features’ magnitudes don't affect the learning
process disproportionately, especially in models sensitive to feature
scales (e.g., SVMs, neural networks). It aids in improving
convergence and model stability.

Department of AIML, AMCEC 2023-24 16


Sentiment Analysis on Movie Reviews

Chapter 4

IMPLEMENTATION

The implementation of the project involves the stages of data training and
validation, plotting of graph and predicting the accuracy and loss values.

Model Selection and Training Methodology


Model Selection: In the context of our project focused on analyzing sentiments in
movie reviews, our model selection process revolves around crucial factors.
Primarily, we prioritize accuracy to comprehensively capture the diverse spectrum
of sentiments expressed. Additionally, efficient handling of textual data is
paramount given the nature of movie reviews abundant in phrases and expressions.

To ensure the reliability of our models, robustness against overfitting is vital,


especially considering the wide-ranging types of reviews encountered.
Interpretability becomes a key aspect, aiding in understanding the pivotal features
influencing sentiment assessments. Moreover, scalability holds significance in
effectively managing extensive datasets typically encountered in movie review
analyses.

Methodology: Social media sentiment analysis is the process of collecting and


analyzing information on how people talk about your brand on social media. Rather
than a simple count of mentions or comments, sentiment analysis considers
emotions and opinions

Training methodology for sentiment analysis in movie reviews embraces a


systematic approach to model development. Initially, we preprocess the textual data,
performing tasks like tokenization, removing stop words, and utilizing techniques
like Lemmatization to streamline text features. The pivotal steps in our methodology
involve extensive experimentation and validation.

Next, we partition our dataset into training, validation, and testing sets, ensuring a
robust evaluation of model performance. During the training phase, we employ
diverse algorithms, experimenting with various models known for their efficacy in
sentiment analysis. This includes algorithms like Support Vector Machines, Logistic
Regression, Ensemble methods such as Random Forests or Bagging, and Naive
Bayes classifiers.

Department of AIML, AMCEC 2023-24 17


Sentiment Analysis on Movie Reviews

Algorithms
Logistic Regression: Logistic regression, a statistical technique, excels in
predicting binary outcomes, making it fitting for sentiment analysis tasks. By
analysing relationships between independent variables—like review features—and
a binary sentiment outcome, it facilitates predictions, enabling straightforward
categorization of reviews into positive or negative sentiments.

Logistic regression can also play a role in data preparation activities by allowing
data sets to be put into specifically predefined buckets during the extract, transform,
load (ETL) process in order to stage the information for analysis. Logistic regression
is important because it transforms complex calculations around probability into a
straightforward arithmetic problem. This dramatically simplifies analyzing the
impact of multiple variables and helps to minimize the effect of confounding factors.
As a result, statisticians can quickly model and explore the contribution of various
factors to a given outcome.

Support vector machines (SVM): Support vector machines are a set of supervised
learning methods used for classification, regression, and outliers detection. All of
these are common tasks in machine learning.

A simple linear SVM classifier works by making a straight line between two classes.
That means all of the data points on one side of the line will represent a category
and the data points on the other side of the line will be put into a different category.
This means there can be an infinite number of lines to choose from. What makes the
linear SVM algorithm better than some of the other algorithms, like k-nearest
neighbours, is that it chooses the best line to classify your data points. It chooses the
line that separates the data and is the furthest away from the closet data points as
possible.

Bagging Classifier: Bootstrap Aggregation (bagging) is an ensemble method that


attempts to resolve overfitting for classification or regression problems. Bagging
aims to improve the accuracy and performance of machine learning algorithms. It
does this by taking random subsets of an original dataset, with replacement, and fits
either a classifier (for classification) or regressor (for regression) to each subset. The
predictions for each subset are then aggregated through majority vote for
classification or averaging for regression, increasing prediction accuracy.

The main advantage of bagging is that it can reduce the variance of the predictions
made by a supervised learning algorithm without significantly compromising its
accuracy.

Department of AIML, AMCEC 2023-24 18


Sentiment Analysis on Movie Reviews

Random Forest Classifier: Random Forest is a popular machine learning algorithm


that belongs to the supervised learning technique. It can be used for both
Classification and Regression problems in ML.
As the name suggests, Random Forest is a classifier that contains a number of
decision trees on various subsets of the given dataset and takes the average to
improve the predictive accuracy of that dataset. Instead of relying on one decision
tree, the random forest takes the prediction from each tree and based on the majority
votes of predictions, it predicts the final output.
The greater number of trees in the forest leads to higher accuracy and prevents the
problem of overfitting.

Naive Bayes : Naive Bayes, which is computationally very efficient and easy to
implement, is a learning algorithm frequently used in text classification problems.
Naive Bayes, treats language as a bag of words, disregarding word order and
assigning fixed class labels to new documents based on their word features.
Particularly suited for high-dimensional problems in text classification due to its
efficiency, Naive Bayes utilizes two event models: Multivariate Bernoulli and
Multinomial, with the latter commonly referred to as Multinomial Naive Bayes in
Natural Language Processing (NLP).
This algorithm estimates tag likelihood for text samples using Bayes theorem,
considering each feature's independence from others—a feature's presence or
absence does not influence others.
While simple and efficient, Naive Bayes’ prediction accuracy might be lower
compared to other algorithms and it's not suitable for regression, limited to
classifying textual data without estimating numerical values.

Stacked Classifier : A stacked classifier, also known as a stacked ensemble or


meta-learner, is a powerful machine learning technique that combines multiple
individual classifiers or models to improve predictive performance. It operates by
training a new model, often referred to as a meta-learner, on the predictions or
outputs of base classifiers.
In a stacked classifier, the base models can be diverse, ranging from different
algorithms like decision trees, support vector machines, or neural networks, each
trained on the same dataset but potentially capturing different aspects of the data.
The key idea behind stacking is to leverage the diverse perspectives of multiple
models, allowing the meta-learner to learn from their collective outputs and
potentially create a more accurate and robust final prediction.

Department of AIML, AMCEC 2023-24 19


Sentiment Analysis on Movie Reviews

Model Training and Implementation


Logistic Regression
A Logistic Regression model is trained for this classification. For the c
parameter tuning, LogisticRegressionCV is used, which will perform k-fold cross
validation and grid search to find the optimal parameter based on accuracy.

 lr_model = LogisticRegressionCV : Instantiating the Logistic Regression


model with cross-validation.
 cv=5 : 5-fold cross-validation.
 scoring= ‘accuracy’: Metric for evaluation.
 max_iter=300: Maximum number of iterations.
 n_jobs=-1 : Utilize all available CPU cores.
 verbose=3: Verbosity level for logging.
 random_state=0 : Setting a random seed for reproducibility.
 lr_model.fit(X_train, y_train) : Fitting the model with training data.
 pred = lr_model.predict(X_test): Predicting sentiments on the test dataset.
 print(classification_report(pred, y_test)): Generating a classification report
for performance evaluation.

Department of AIML, AMCEC 2023-24 20


Sentiment Analysis on Movie Reviews

Support Vector Machine (SVM)


A Linear Support Vector Machine (SVM) model (LinearSVC) with enabled
probability estimates for sentiment analysis of movie reviews. It initializes, trains,
and evaluates the model’s performance metrics.

 from [Link] import LinearSVC : Importing Linear Support Vector


Machine module.
 l2_norm = 25 : Setting L2 norm regularization value.
 l2_norm_inverse = 1 / l2_norm : Calculating the inverse of L2 norm for
model parameter C.
 maximum_iterations = 4000 : Defining the maximum number of iterations
for the model.
 model_svm = LinearSVC(C=l2_norm_inverse,
max_iter=maximum_iterations) : Creating a Linear SVM model with
specified parameters.
 model_svm.fit(X_train, y_train) : Training the model using the training data
 y_pred_svm = model_bc.predict(X_test) : Generating predictions on the test
data using the SVM model.
 print(classification_report(y_pred_svm, y_test)) : Printing a classification
report comparing predicted labels against true test labels.

The code initializes and trains a Linear Support Vector Machine (LinearSVC) model
with specified regularization (L2 norm), maximum iterations, and fits it to training
data. Then, it generates predictions on the test data and prints a classification report
comparing the predictions against the true test labels.

Department of AIML, AMCEC 2023-24 21


Sentiment Analysis on Movie Reviews

Bagging Classifier
It involves a Bagging Classifier that uses Linear Support Vector Machines
(LinearSVC) as base estimators. It is trained on provided data, then evaluates its
accuracy on both training and test sets. The process time is also recorded, giving
insight into computational efficiency.

 Imported necessary modules: time for time tracking, BaggingClassifier and


LinearSVC from [Link] and [Link] respectively.
 Defined regularization values (l2_norm, l2_norm_inverse) and maximum
iterations for the Linear Support Vector Machine (maximum_iterations).
 Created a Bagging Classifier with LinearSVC as the base estimator, utilizing
30 estimators and a specified random state.
 Trained the Bagging Classifier using the provided training data (X_train,
y_train).
 Generated predictions on the test data (X_test) using the trained model.
Calculated accuracy scores on both the training and test sets.
 Recorded the start time before generating predictions. Recorded the end time
after predictions to measure the duration.
 Printed the accuracy scores for the training and test sets. Displayed the time
taken for the entire process in milliseconds.

Department of AIML, AMCEC 2023-24 22


Sentiment Analysis on Movie Reviews

Random Forest Classifier


A Random Forest Classifier model is trained for this classification with 200 decision
trees, leveraging parallel processing (n_jobs = -1), and displaying training progress
(verbose = 1). It then generates predictions on test data and prints a classification
report, comparing these predictions against the true test labels.

 from [Link] import RandomForestClassifier : Importing Linear


Random Forest Classifier module.
 rf_model = RandomForestClassifier(n_estimators=200, n_jobs=-1,
verbose=1) : Initiating a Random Forest Classifier with 200 decision trees,
utilizing parallel processing for improved efficiency, and displaying the
training progress during the process.
 rf_model.fit(X_train, y_train) : Training the Random Forest Classifier using
the provided training data.
 pred = rf_model.predict(X_test) : Generating predictions on the test data
using the trained model.
 print(classification_report(pred, y_test)) : Printing a detailed classification
report that compares the predicted labels against the true labels in the test
dataset.
The implementation of the Random Forest Classifier with 200 decision trees,
parallel processing, and training progress display signifies a robust approach to
modeling complex relationships within the data.
This method not only enhances predictive accuracy but also provides insights into
feature importance due to the ensemble nature of Random Forests. The resulting
classification report allows for a comprehensive evaluation of the model's
performance, aiding in the assessment of its efficacy in sentiment analysis for movie
reviews.

Department of AIML, AMCEC 2023-24 23


Sentiment Analysis on Movie Reviews

Bernoulli Naive Bayes model


This initiates a Bernoulli Naive Bayes model for classification and measures the
time taken for training and prediction.

 Records the start time using [Link]() to measure the execution time.
 Imports the Bernoulli Naive Bayes module from sklearn.naive_bayes.
 Creates a Bernoulli Naive Bayes model (nb_model).
 Trains the Naive Bayes model using the provided training data (X_train,
y_train) via the fit() function.
 (End):Records the end time after model training to calculate the duration of
the process.
 Generates predictions (y_pred_nb) on the test data (X_test) using the trained
model.
 Prints a classification report, comparing the predicted labels (y_pred_nb)
against the true test labels (y_test), using the classification_report() function
from the relevant library.

Stacked Classifier
A stacked classifier is used to combine the predictions of multiple base classifiers,
which, in this case, are a Support Vector Machine (SVM), a Bagging Classifier, and
a Random Forest Classifier.
Stacking involves training a meta-learner on the predictions made by these base
classifiers, aiming to improve overall predictive performance. The combined model
leverages the diverse strengths of each base classifier to enhance its accuracy and
robustness.

Department of AIML, AMCEC 2023-24 24


Sentiment Analysis on Movie Reviews

Chapter 5

Result, Screenshots and Conclusion


The result stage showcases the outputs that are displayed at different parts of the
project. Certain outputs display quickly after execution of the cell, while others
reliant on stable, fast networks take longer to showcase specific results.

Logistic Regression model

The image displayed above shows the result of executing the cell which contains the
particular code involved with training the Logistic Regression model.

Linear Support Vector Classifier model

The image displayed above shows the result of executing the cell which contains the
particular code involved with training the Linear Support Vector Classifier model.

Department of AIML, AMCEC 2023-24 25


Sentiment Analysis on Movie Reviews

Random Forest Classifier model

The image displayed above shows the result of executing the cell which contains the
particular code involved with training the Linear Support Vector Classifier model.

Bagging Classifier model

The image displayed above shows the result of executing the cell which contains the
particular code involved with training the Bagging Classifier model.

Department of AIML, AMCEC 2023-24 26


Sentiment Analysis on Movie Reviews

Naive Bayes model

The image displayed above shows the result of executing the cell which contains the
particular code involved with training the Naive Bayes model.

Stacked Classifier

The image displayed above shows the result of executing the cell which contains the
particular code involved with training the stacked classifier . It is used to combine
the predictions of multiple base classifiers, which, in this case, are a Support Vector
Machine (SVM), a Bagging Classifier, and a Random Forest Classifier.

Department of AIML, AMCEC 2023-24 27


Sentiment Analysis on Movie Reviews

Plot of ROC-AUC curve


The classification report shows all the models perform quite well and close to each
other. The ROC curve for each model is plotted below. It shows the AUC and TPR
vs FPR of the model. The more the AUC the better the model performs.

The models – Logistic Regression and Support Vector have very close Area under
the curves. Random Forest and Naïve Bayes Model comparatively underperforms.
The final model chosen is the Support Vector machine as it slightly out-performs
the Logistic Regression model.

Department of AIML, AMCEC 2023-24 28


Sentiment Analysis on Movie Reviews

Conclusion

The Sentiment Analysis on Movie Reviews project has been successfully


implemented using the IMDB movie reviews dataset for training and python
as the programming language. The project involves multiple machine learning
models, which are used to extract the emotional features and predict the
sentiment of each of the reviews.

The project is saved as a python notebook file (.ipynb) and can be accessed
through Google Colab.

The models are able to predict the sentiments of each of the reviews to an
extent of varying accuracies. Some models suffer in few cases such as
“nb_model” i.e. the Bernoulli Naïve Bayes model predicts the sentiments at
an accuracy of 85%. This is because its performance is often degraded as it
does not model text well, by inappropriate feature selection and the lack of
confidence scores. On the other hand it can be observed that both logistic
regression and SVC models perform better at analyzing sentiment of movie
reviews, each having accuracy levels of 89%. It can also be observed that only
the stacked model predicts the sentiments with an accuracy higher than both
these models.

With the theoretical inclination of our syllabus, it becomes very essential to


take the utmost advantage of any opportunity of gaining practical experience
that comes along. The building blocks of this mini project “Sentiment
Analysis on Movie Reviews” was one of those opportunities. It gave us the
requisite practical knowledge to supplement the already taught theoretical
concepts thus, making us more competent as a computer engineer.

Department of AIML, AMCEC 2023-24 29


Sentiment Analysis on Movie Reviews

Future Scope

 Although there have been significant developments in technology


which help in better implementation of Sentiment Analysis, it will still
be transformed by the rapid advancements in AI and NLP technologies.

 The future of sentiment analysis is going to continue to dig deeper, far


past the surface of the number of likes, comments and shares, and aim
to reach, and truly understand, the significance of social media
interactions and what they tell us about the consumers behind the
screens.

 A lot of research is present in literature for detecting sentiment from the


text. Still, there is a huge scope of improvement of these existing
sentiment analysis models. Existing sentiment analysis models can be
improved further with more semantic and commonsense knowledge. …
. . . . . . . . . …. . …….
 However, sentiment analysis will delve deeper in the future, beyond the
concept of positive, negative, or neutral, to reach and comprehend the
significance of understanding conversations and what they reveal about
consumers.

 As a result, sentiment analysis is becoming more important for the


businesses as the data underlying those interactions grows larger and
more complex.

 Improved accuracy and consistency in text mining techniques can help


overcome some current problems faced in Sentiment analysis. Looking
ahead, what we can see is a true social democracy that will be created
using Sentiment analysis, where we can harness the wisdom of the
crowd rather than a select few “experts”.

Department of AIML, AMCEC 2023-24 30


Sentiment Analysis on Movie Reviews

References

 Sentiment Analysis: Mining Opinions, Sentiments, and Emotions,


Bing Liu

 IMDB movie reviews dataset, Kaggle

 Google Colab

 [Link]
analysis-trends/

Department of AIML, AMCEC 2023-24 31

Common questions

Powered by AI

Data preprocessing involves cleaning text by removing HTML tags, special characters, and stopwords, followed by tokenization and normalization tasks. These steps standardize text, reduce noise, and create a uniform input for models, thus improving analysis accuracy by ensuring consistent data representation .

The balance of sentiment labels, such as having an equal number of positive (25,000 samples) and negative reviews (25,000 samples), ensures an unbiased training and testing process in sentiment analysis. This equal representation allows the model to learn and evaluate sentiments without any inclination toward any particular sentiment class, facilitating more accurate predictions .

Lemmatization refines sentiment analysis by transforming words to their dictionary form, preserving semantic meaning, unlike stemming which might reduce words to non-meaningful roots. This retention of semantic integrity is crucial when the context is important for model training, thereby enhancing model accuracy and performance in sentiment analysis by reducing vocabulary size while maintaining meaning .

Word clouds visualize the most frequent words in reviews, aiding in swiftly understanding prevalent themes and sentiments. By visually representing word frequency, they facilitate the identification of significant words associated with positive or negative sentiments, allowing for easier interpretation of sentiment dynamics in the text .

Cross-validation, such as using LogisticRegressionCV with k-fold, enhances model reliability by assessing performance across different dataset subsets rather than a single split. It aids in tuning hyperparameters for optimal accuracy, preventing overfitting and ensuring consistent performance on unseen data .

Word clouds highlight frequently used words in reviews, offering intuitive insights into commonly used themes or terms. This visualization simplifies discerning sentiment-carrying words and significant topics, improving the understanding of sentimental context in large datasets efficiently .

Naive Bayes models are advantageous in sentiment analysis for their simplicity and efficiency, especially with high-dimensional data. They assume feature independence, allowing for quick computation. However, their accuracy can be lower compared to other models since they assume independence and do not account for feature interactions, leading to less nuanced text classification .

Stacking classifiers enhances performance by combining predictions from diverse models with a meta-learner, capitalizing on different models capturing varied data aspects. This ensemble approach leverages strengths across models, often resulting in higher accuracy and robustness than individual models. It effectively blends their strengths, thereby improving precision in sentiment analysis tasks .

TF-IDF enhances sentiment analysis by weighing term importance in a document relative to the entire corpus. It assigns higher weights to words frequent within specific documents but rare across others, thereby highlighting critical predictive features and reducing the influence of common words. This process improves model accuracy by generating meaningful features for classification .

The Support Vector Machine (SVM) model is chosen because it slightly outperforms others like Logistic Regression in terms of accuracy and ROC-AUC. It effectively handles high-dimensional data and provides strong generalization capabilities, crucial in distinguishing subtle sentiment nuances effectively .

You might also like