Text Preprocessing and Analysis in Python

The document outlines a series of Python scripts for text preprocessing, extractive summarization, and data analysis using a YouTube dataset. It includes steps for removing special characters, stopwords, tokenizing text, calculating word frequency, and visualizing data with plots and word clouds. Additionally, it demonstrates how to clean a dataset and compute total views, likes, dislikes, and comments from YouTube video data.

Uploaded by

jagtapvinay160

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

35 views3 pages

Text Preprocessing and Analysis in Python

Uploaded by

jagtapvinay160

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Q2).Consider any text paragraph. Preprocess the text to remove any special characters and digits.

Generate the summary using extractive summarization pprocess.

Ans:
Import re
Import nltk
From [Link] import stopwords
From [Link] import sent_tokenize, word_tokenize
From heapq import nlargest
# Sample text paragraph you can write any text
Text = “Natural language processing (NLP) is a subfield of linguistics, computer science, information
engineering, and artificial intelligence concerned with the interactions between computers and human
languages, in particular how to program computers to process and analyze large amounts of natural
language data. Challenges in natural language processing frequently involve speech recognition, natural
language understanding, and natural language generation. The history of natural language processing
generally started in the 1950s, although work can be found from earlier periods.”
# Remove special characters and digits
Text = [Link](‘[^a-zA-Z]’, ‘ ‘, text)
# Tokenize the text into sentences
Sentences = sent_tokenize(text)
# Tokenize each sentence into words and remove stop words
Stop_words = set([Link](‘english’))
Words = []
For sentence in sentences:
[Link](word_tokenize(sentence))
Words = [[Link]() for word in words if [Link]() not in stop_words]
# Calculate word frequency
Word_freq = [Link](words)
# Calculate sentence scores based on word frequency
Sentence_scores = {}
For sentence in sentences:
For word in word_tokenize([Link]()):
If word in word_freq:
If len([Link](‘ ‘)) < 30:
If sentence not in sentence_scores:
Sentence_scores[sentence] = word_freq[word]
Else:
Sentence_scores[sentence] += word_freq[word]
# Generate summary by selecting top 3 sentences with highest scores
Summary_sentences = nlargest(3, sentence_scores, key=sentence_scores.get)
Summary = ‘ ‘.join(summary_sentences)

Q. 2)Consider any text paragraph. Remove the stopwords. Tokenize the paragraph to extract words and. Sentences.
Calculate the word frequency distribution and plot the frequencies. Plot the wordcloud of
the
Txt.
Ans:
# Install the libraries
!pip install nltk matplotlib wordcloud
# Import the necessary modules
Import nltk
From [Link] import stopwords
From [Link] import word_tokenize, sent_tokenize
From [Link] import FreqDist
Import [Link] as plt
From wordcloud import WordCloud
# Download the stopwords corpus
[Link](‘stopwords’)
# Define the text paragraph
Text = “Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed tristique ante et velit vestibulum, vel
pharetra orci iaculis. Nullam mattis risus quis augue tincidunt rhoncus. Morbi varius, arcu vitae
scelerisque laoreet, magna est imperdiet quam, sit amet ultrices lectus justo id enim. Sed dictum
suscipit commodo. Sed maximus consequat risus, nec pharetra nibh interdum quis. Etiam eget quam vel
augue dictum dignissim sit amet nec elit. Nunc at sapien dolor. Nulla vitae iaculis lorem. Suspendisse
potenti. Sed non ante turpis. Morbi consectetur, arcu a vestibulum suscipit, mauris eros convallis nibh,
nec feugiat orci enim sit amet enim. Aliquam erat volutpat. Etiam vel nisi id neque viverra dapibus non
non lectus.”
# Tokenize the paragraph to extract words and sentences

Words = word_tokenize([Link]())
Sentences = sent_tokenize(text)
# Remove the stopwords from the extracted words
Stop_words = set([Link](‘english’))
Filtered_words = [word for word in words if [Link]() not in stop_words]
# Calculate the word frequency distribution and plot the frequencies using matplotlib
Fdist = FreqDist(filtered_words)
[Link](30, cumulative=False)
[Link]()
# Plot the wordcloud of the text using wordcloud
Wordcloud = WordCloud(width = 800, height = 800,
Background_color =’white’,
Stopwords = stop_words,
Min_font_size = 10).generate(text)

# plot the WordCloud image

[Link](figsize = (8, 8), facecolor = None)
[Link](wordcloud)
[Link](“off”)
Plt.tight_layout(pad = 0)
[Link]()

Q. 2) Consider the following dataset :

[Link]
Write a Python script for the following :
i.
Read the dataset and perform data cleaning operations on it.
ii.
ii. Find the total views, total likes, total dislikes and comment count.
Ans:
Import pandas as pd
# Read the dataset
Df = pd.read_csv(‘[Link]’)
# Drop the columns that are not required
Df = [Link]([‘video_id’, ‘trending_date’, ‘channel_title’, ‘category_id’, ‘publish_time’, ‘tags’,
‘thumbnail_link’, ‘comments_disabled’, ‘ratings_disabled’, ‘video_error_or_removed’], axis=1)
# Convert the datatype of ‘views’, ‘likes’, ‘dislikes’, and ‘comment_count’ to integer
Df[[‘views’, ‘likes’, ‘dislikes’, ‘comment_count’]] = df[[‘views’, ‘likes’, ‘dislikes’,
‘comment_count’]].astype(int)
# Find the total views, likes, dislikes, and comment count
Total_views = df[‘views’].sum()
Total_likes = df[‘likes’].sum()
Total_dislikes = df[‘dislikes’].sum()
Total_comments = df[‘comment_count’].sum()
Print(‘Total Views:’, total_views)
Print(‘Total Likes:’, total_likes)
Print(‘Total Dislikes:’, total_dislikes)
Print(‘Total Comments:’, total_comments)

Data Analytics Exam Questions Guide
No ratings yet
Data Analytics Exam Questions Guide
2 pages
PHP Practical Assignment Overview
No ratings yet
PHP Practical Assignment Overview
12 pages
CS 356 Theoretical Computer Science Exam
No ratings yet
CS 356 Theoretical Computer Science Exam
3 pages
PHP Scripts for Web Tracking and Data Analysis
No ratings yet
PHP Scripts for Web Tracking and Data Analysis
144 pages
Data Analytics Solved Slips (Repaired)
No ratings yet
Data Analytics Solved Slips (Repaired)
30 pages
Apriori Algorithm for Market Basket Analysis
No ratings yet
Apriori Algorithm for Market Basket Analysis
3 pages
Demand Paging Simulation and Shell Implementation
No ratings yet
Demand Paging Simulation and Shell Implementation
24 pages
Python Indentation Basics for CS Students
No ratings yet
Python Indentation Basics for CS Students
50 pages
Java Programming Assignments Overview
No ratings yet
Java Programming Assignments Overview
10 pages
PHP XML File Creation and Parsing Guide
No ratings yet
PHP XML File Creation and Parsing Guide
6 pages
Web Tech Lab Book for CS Students
No ratings yet
Web Tech Lab Book for CS Students
56 pages
Smart Study Planner With Deadline Intelligence - Synopsis
No ratings yet
Smart Study Planner With Deadline Intelligence - Synopsis
5 pages
Operating Systems Exam Questions
No ratings yet
Operating Systems Exam Questions
15 pages
Leading and Trailing in Compiler Design
No ratings yet
Leading and Trailing in Compiler Design
8 pages
Introduction to HTML and PHP Basics
No ratings yet
Introduction to HTML and PHP Basics
78 pages
BST Implementation and Operations
No ratings yet
BST Implementation and Operations
14 pages
Mathematics I: Groups and Coding Theory
No ratings yet
Mathematics I: Groups and Coding Theory
2 pages
Mealy Machine and Formal Definitions
No ratings yet
Mealy Machine and Formal Definitions
24 pages
Java Collection Operations Examples
No ratings yet
Java Collection Operations Examples
13 pages
JAVA1 - Solution - Oct 22 - NSGAcademy
No ratings yet
JAVA1 - Solution - Oct 22 - NSGAcademy
14 pages
PHP Scripts for Web Page Interactivity
No ratings yet
PHP Scripts for Web Page Interactivity
15 pages
Discrete Mathematics for B.Sc. IT
No ratings yet
Discrete Mathematics for B.Sc. IT
8 pages
C Program for Resource Management and Scheduling
No ratings yet
C Program for Resource Management and Scheduling
70 pages
Tybsc (CS) - CS - 366 Compiler Construction-1
No ratings yet
Tybsc (CS) - CS - 366 Compiler Construction-1
3 pages
Java II: Core Concepts Overview
No ratings yet
Java II: Core Concepts Overview
30 pages
Resource Management and Scheduling in C
No ratings yet
Resource Management and Scheduling in C
110 pages
Java Program for Student Grade Calculation
No ratings yet
Java Program for Student Grade Calculation
6 pages
CS 353 Web Technologies Exam Guide
No ratings yet
CS 353 Web Technologies Exam Guide
2 pages
Multithreading Programming Examples
No ratings yet
Multithreading Programming Examples
23 pages
Electronics Practical Exam Guidelines
100% (1)
Electronics Practical Exam Guidelines
19 pages
Automata and Grammar Definitions
No ratings yet
Automata and Grammar Definitions
4 pages
Java OOP Practical Exam Guide
No ratings yet
Java OOP Practical Exam Guide
30 pages
HTML and CSS Code Examples for Beginners
No ratings yet
HTML and CSS Code Examples for Beginners
29 pages
Data Structures & Algorithms Exam Guide
No ratings yet
Data Structures & Algorithms Exam Guide
19 pages
BCA Sem 5 Practical Exam Guide
No ratings yet
BCA Sem 5 Practical Exam Guide
36 pages
Data Analytics Pyq
No ratings yet
Data Analytics Pyq
32 pages
TY BSC Web Tech Solution Slips
No ratings yet
TY BSC Web Tech Solution Slips
29 pages
Data Structures and Algorithms Exam Questions
No ratings yet
Data Structures and Algorithms Exam Questions
57 pages
Computer Science Semester V Exam Papers
No ratings yet
Computer Science Semester V Exam Papers
18 pages
BSc CS Sem 5 Question Paper and Solutions
100% (1)
BSc CS Sem 5 Question Paper and Solutions
2 pages
Python 2 Techmax
No ratings yet
Python 2 Techmax
39 pages
Python Programming Practical Exam Guide
No ratings yet
Python Programming Practical Exam Guide
169 pages
TYBSc CS 3rd Sem Java Exam Paper
No ratings yet
TYBSc CS 3rd Sem Java Exam Paper
2 pages
Computer Science Syllabus Pondicherry University
0% (1)
Computer Science Syllabus Pondicherry University
51 pages
PDF - Reader - Sybsc (CS) Slips Cs-253 MJP Ds II and Dbms II 2025 Sem IV
No ratings yet
PDF - Reader - Sybsc (CS) Slips Cs-253 MJP Ds II and Dbms II 2025 Sem IV
30 pages
Java OOP: Rectangle, Cylinder, and Array
No ratings yet
Java OOP: Rectangle, Cylinder, and Array
24 pages
FDS Practical Slips Solutions
No ratings yet
FDS Practical Slips Solutions
32 pages
Computer Networks Exam Questions
No ratings yet
Computer Networks Exam Questions
2 pages
INSEM Paper Solution
No ratings yet
INSEM Paper Solution
11 pages
Bugzilla Tool Features Explained
No ratings yet
Bugzilla Tool Features Explained
27 pages
Introduction to Compiler Construction
No ratings yet
Introduction to Compiler Construction
60 pages
Python Programming Exam Questions
No ratings yet
Python Programming Exam Questions
20 pages
EQPS Download for Computer Networks
No ratings yet
EQPS Download for Computer Networks
43 pages
T.Y.B.Sc. Computer Science Syllabus
No ratings yet
T.Y.B.Sc. Computer Science Syllabus
46 pages
SYBSC English Syllabus Overview
No ratings yet
SYBSC English Syllabus Overview
32 pages
Overview of Operating Systems
No ratings yet
Overview of Operating Systems
42 pages
Assignment No 3
No ratings yet
Assignment No 3
4 pages
NLTK in Natural Language Processing
No ratings yet
NLTK in Natural Language Processing
21 pages
Text Mining with NLTK: Summarization Guide
No ratings yet
Text Mining with NLTK: Summarization Guide
8 pages
NLP Techniques with NLTK and Regex
No ratings yet
NLP Techniques with NLTK and Regex
24 pages
Ntapil0805en Pilot Im
No ratings yet
Ntapil0805en Pilot Im
24 pages
MCQs on Rural and Tribal Societies
No ratings yet
MCQs on Rural and Tribal Societies
6 pages
Formal Methods for Algorithm Verification
No ratings yet
Formal Methods for Algorithm Verification
6 pages
Anirban Ray's Professional Biodata
100% (6)
Anirban Ray's Professional Biodata
2 pages
Solar Tracking Systems for PV Efficiency
No ratings yet
Solar Tracking Systems for PV Efficiency
4 pages
Ethical Practices in IT: A Midterm Guide
No ratings yet
Ethical Practices in IT: A Midterm Guide
3 pages
SBI Junior Associates Pre-Exam Training
No ratings yet
SBI Junior Associates Pre-Exam Training
6 pages
Walnut Shells in Cement-Bonded Boards
No ratings yet
Walnut Shells in Cement-Bonded Boards
28 pages
High-Strength Concrete Lintel Guide
No ratings yet
High-Strength Concrete Lintel Guide
2 pages
GSM Frequency Shifting Repeater Note
No ratings yet
GSM Frequency Shifting Repeater Note
81 pages
Overview of Decision Support Systems
100% (1)
Overview of Decision Support Systems
34 pages
Verb Forms: To V, Bare V, V-ing
No ratings yet
Verb Forms: To V, Bare V, V-ing
2 pages
Risk Management Tools Overview
No ratings yet
Risk Management Tools Overview
19 pages
Red Light Therapy Pad Guide
100% (1)
Red Light Therapy Pad Guide
14 pages
Contemporary Global Governance Overview
No ratings yet
Contemporary Global Governance Overview
20 pages
Updated Gear Oil Approval List 2025
No ratings yet
Updated Gear Oil Approval List 2025
15 pages
Planning Reports and Proposals Guide
No ratings yet
Planning Reports and Proposals Guide
29 pages
DMV Affidavit on Oregon Traffic Laws
100% (7)
DMV Affidavit on Oregon Traffic Laws
12 pages
Payslip Health
No ratings yet
Payslip Health
2 pages
Understanding Syntax Directed Definitions
No ratings yet
Understanding Syntax Directed Definitions
51 pages
Sony BDP-S350 Blu-ray Player Manual
No ratings yet
Sony BDP-S350 Blu-ray Player Manual
71 pages
TLE 6 ICT & Entrepreneurship Test
No ratings yet
TLE 6 ICT & Entrepreneurship Test
4 pages
Operations Management Exam Guide
No ratings yet
Operations Management Exam Guide
7 pages
DOJ Investigation Request for Savannah PD
No ratings yet
DOJ Investigation Request for Savannah PD
2 pages
Multicultural Packaging: Color & Design Insights
No ratings yet
Multicultural Packaging: Color & Design Insights
43 pages
Cybersecurity Risks in Connected Vehicles
No ratings yet
Cybersecurity Risks in Connected Vehicles
5 pages
Online Teaching Guidelines for Lecturers
No ratings yet
Online Teaching Guidelines for Lecturers
8 pages
Professional Employees Award Pay Guide
No ratings yet
Professional Employees Award Pay Guide
4 pages
New Hope EA v1.20 Installation Guide
No ratings yet
New Hope EA v1.20 Installation Guide
8 pages
Essential Obstetric Procedures Guide
No ratings yet
Essential Obstetric Procedures Guide
6 pages

Text Preprocessing and Analysis in Python

Uploaded by

Text Preprocessing and Analysis in Python

Uploaded by

Q2).Consider any text paragraph. Preprocess the text to remove any special characters and digits.

Generate the summary using extractive summarization pprocess.

# plot the WordCloud image

Q. 2) Consider the following dataset :

You might also like