0% found this document useful (0 votes)
35 views3 pages

Text Preprocessing and Analysis in Python

The document outlines a series of Python scripts for text preprocessing, extractive summarization, and data analysis using a YouTube dataset. It includes steps for removing special characters, stopwords, tokenizing text, calculating word frequency, and visualizing data with plots and word clouds. Additionally, it demonstrates how to clean a dataset and compute total views, likes, dislikes, and comments from YouTube video data.

Uploaded by

jagtapvinay160
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views3 pages

Text Preprocessing and Analysis in Python

The document outlines a series of Python scripts for text preprocessing, extractive summarization, and data analysis using a YouTube dataset. It includes steps for removing special characters, stopwords, tokenizing text, calculating word frequency, and visualizing data with plots and word clouds. Additionally, it demonstrates how to clean a dataset and compute total views, likes, dislikes, and comments from YouTube video data.

Uploaded by

jagtapvinay160
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Q2).Consider any text paragraph. Preprocess the text to remove any special characters and digits.

Generate the summary using extractive summarization pprocess.


Ans:
Import re
Import nltk
From [Link] import stopwords
From [Link] import sent_tokenize, word_tokenize
From heapq import nlargest
# Sample text paragraph you can write any text
Text = “Natural language processing (NLP) is a subfield of linguistics, computer science, information
engineering, and artificial intelligence concerned with the interactions between computers and human
languages, in particular how to program computers to process and analyze large amounts of natural
language data. Challenges in natural language processing frequently involve speech recognition, natural
language understanding, and natural language generation. The history of natural language processing
generally started in the 1950s, although work can be found from earlier periods.”
# Remove special characters and digits
Text = [Link](‘[^a-zA-Z]’, ‘ ‘, text)
# Tokenize the text into sentences
Sentences = sent_tokenize(text)
# Tokenize each sentence into words and remove stop words
Stop_words = set([Link](‘english’))
Words = []
For sentence in sentences:
[Link](word_tokenize(sentence))
Words = [[Link]() for word in words if [Link]() not in stop_words]
# Calculate word frequency
Word_freq = [Link](words)
# Calculate sentence scores based on word frequency
Sentence_scores = {}
For sentence in sentences:
For word in word_tokenize([Link]()):
If word in word_freq:
If len([Link](‘ ‘)) < 30:
If sentence not in sentence_scores:
Sentence_scores[sentence] = word_freq[word]
Else:
Sentence_scores[sentence] += word_freq[word]
# Generate summary by selecting top 3 sentences with highest scores
Summary_sentences = nlargest(3, sentence_scores, key=sentence_scores.get)
Summary = ‘ ‘.join(summary_sentences)

Q. 2)Consider any text paragraph. Remove the stopwords. Tokenize the paragraph to extract words and. Sentences.
Calculate the word frequency distribution and plot the frequencies. Plot the wordcloud of
the
Txt.
Ans:
# Install the libraries
!pip install nltk matplotlib wordcloud
# Import the necessary modules
Import nltk
From [Link] import stopwords
From [Link] import word_tokenize, sent_tokenize
From [Link] import FreqDist
Import [Link] as plt
From wordcloud import WordCloud
# Download the stopwords corpus
[Link](‘stopwords’)
# Define the text paragraph
Text = “Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed tristique ante et velit vestibulum, vel
pharetra orci iaculis. Nullam mattis risus quis augue tincidunt rhoncus. Morbi varius, arcu vitae
scelerisque laoreet, magna est imperdiet quam, sit amet ultrices lectus justo id enim. Sed dictum
suscipit commodo. Sed maximus consequat risus, nec pharetra nibh interdum quis. Etiam eget quam vel
augue dictum dignissim sit amet nec elit. Nunc at sapien dolor. Nulla vitae iaculis lorem. Suspendisse
potenti. Sed non ante turpis. Morbi consectetur, arcu a vestibulum suscipit, mauris eros convallis nibh,
nec feugiat orci enim sit amet enim. Aliquam erat volutpat. Etiam vel nisi id neque viverra dapibus non
non lectus.”
# Tokenize the paragraph to extract words and sentences

Words = word_tokenize([Link]())
Sentences = sent_tokenize(text)
# Remove the stopwords from the extracted words
Stop_words = set([Link](‘english’))
Filtered_words = [word for word in words if [Link]() not in stop_words]
# Calculate the word frequency distribution and plot the frequencies using matplotlib
Fdist = FreqDist(filtered_words)
[Link](30, cumulative=False)
[Link]()
# Plot the wordcloud of the text using wordcloud
Wordcloud = WordCloud(width = 800, height = 800,
Background_color =’white’,
Stopwords = stop_words,
Min_font_size = 10).generate(text)

# plot the WordCloud image


[Link](figsize = (8, 8), facecolor = None)
[Link](wordcloud)
[Link](“off”)
Plt.tight_layout(pad = 0)
[Link]()

Q. 2) Consider the following dataset :


[Link]
Write a Python script for the following :
i.
Read the dataset and perform data cleaning operations on it.
ii.
ii. Find the total views, total likes, total dislikes and comment count.
Ans:
Import pandas as pd
# Read the dataset
Df = pd.read_csv(‘[Link]’)
# Drop the columns that are not required
Df = [Link]([‘video_id’, ‘trending_date’, ‘channel_title’, ‘category_id’, ‘publish_time’, ‘tags’,
‘thumbnail_link’, ‘comments_disabled’, ‘ratings_disabled’, ‘video_error_or_removed’], axis=1)
# Convert the datatype of ‘views’, ‘likes’, ‘dislikes’, and ‘comment_count’ to integer
Df[[‘views’, ‘likes’, ‘dislikes’, ‘comment_count’]] = df[[‘views’, ‘likes’, ‘dislikes’,
‘comment_count’]].astype(int)
# Find the total views, likes, dislikes, and comment count
Total_views = df[‘views’].sum()
Total_likes = df[‘likes’].sum()
Total_dislikes = df[‘dislikes’].sum()
Total_comments = df[‘comment_count’].sum()
Print(‘Total Views:’, total_views)
Print(‘Total Likes:’, total_likes)
Print(‘Total Dislikes:’, total_dislikes)
Print(‘Total Comments:’, total_comments)

You might also like