0% found this document useful (0 votes)

172 views6 pages

Web Scraping with Python Guide

This document provides an introduction to web scraping using Python. It discusses how web scraping works by sending requests to servers and extracting specific data elements from pages. The steps involved in web scraping include sending HTTP requests, parsing HTML responses, and traversing the parse trees. It also covers installing and importing the BeautifulSoup and Requests libraries for scraping and making requests. As an example, it describes scraping and analyzing COVID-19 case data from the Worldometer website.

Uploaded by

Anand Sharma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

172 views6 pages

Web Scraping with Python Guide

Uploaded by

Anand Sharma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Web Scraping using Python

Topics Covered:
● Introduction to Web Scraping
● How Does Web Scraping Work?
● Steps involved in web scraping
● Installing BeautifulSoup
● Installing Requests
● Scraping and Analyzing data from Worldometer website

Introduction to Web Scraping

What is Web Scraping? Why do we use Web Scraping?

Web scraping, web harvesting, or web data extraction is an

automated process of collecting large data(unstructured) from
websites. It is the process of gathering information from the Internet.
Even copying and pasting the lyrics of your favorite song is a form of
web scraping! However, the words “web scraping” usually refer to a
process that involves automation. Some websites don’t like it when
automatic scrapers gather their data, while others don’t mind. The
data collected can be stored in a structured format for further
analysis.

If you’re scraping a page respectfully for educational purposes, then

you’re unlikely to have any problems. Still, it’s a good idea to do some
research on your own and make sure that you’re not violating any
Terms of Service before you start a large-scale project.

1
How Does Web Scraping Work?

When we scrape the web, we write code that sends a request to the
server that’s hosting the page we specified. The server will return the
source code — HTML, mostly — for the page (or pages) we requested.
So far, we’re essentially doing the same thing a web browser does —
sending a server request with a specific URL and asking the server to
return the code for that page.
But unlike a web browser, our web scraping code won’t interpret the
page’s source code and display the page visually. Instead, we’ll write
some custom code that filters through the page’s source code looking
for specific elements we’ve specified, and extracting whatever content
we’ve instructed it to extract.
For example, if we wanted to get all of the data from inside a table that
was displayed on a web page, our code would be written to go
through these steps in sequence:

● Request the content (source code) of a specific URL from the

server
● Download the content that is returned
● Identify the elements of the page that are part of the table we
want
● Extract and (if necessary) reformat those elements into a dataset
we can analyze or use in whatever way we require.

Steps involved in web scraping

● Send an HTTP request to the URL of the webpage you want to

access. The server responds to the request by returning the
HTML content of the webpage.
● Once we have accessed the HTML content, we are left with the
task of parsing the data. Since most of the HTML data is nested,
we cannot extract data simply through string processing. One
needs a parser which can create a nested/tree structure of the
HTML data.

2
● Now, all we need to do is navigate and search the parse tree that
we created, i.e. tree traversal. For this task, we will be using
another third-party python library, Beautiful Soup. It is a Python
library for pulling data out of HTML and XML files.

Installing BeautifulSoup

BeautifulSoup is one of the most prolific Python libraries in existence,

in some part having shaped the web as we know it. BeautifulSoup is a
lightweight, easy-to-learn, and highly effective way to
programmatically isolate information on a single webpage at a time.
It's common to use BeautifulSoupin conjunction with the requests
library, where requests will fetch a page, and BeautifulSoup will extract
the resulting data.

● For installing Pandas Type pip install beautifulsoup4 in the

Command prompt/ terminal.

3
● Or type !pip install beautifulsoup4 or %pip install beautifulsoup4
in a Jupyter notebook cell.

● Then type from bs4 import BeautifulSoup to import pandas.

For more information on BeautifulSoup please refer to the official

Beautiful Soup Documentation.

Installing Requests

The first thing we’ll need to do to scrape a web page is to download

the page. We can download pages using the Python requests library.
The requests library will make a GET request to a web server, which
will download the HTML contents of a given web page for us. There are
several different types of requests we can make using requests, of
which GET is just one.

4
● For installing Pandas Type pip install requests in the Command
prompt/ terminal.

● Or type !pip install requests or %pip install requests in a Jupyter

notebook cell.

● Then type import requests to import pandas.

For more information on Requests please refer to the official Requests

Documentation.

5
Scraping and Analyzing data from Worldometer
website

We have scrapped Covid19. Confirmed cases, Deaths according to

Country and continent from the Worldometer website, this is for
purely educational purposes. The Website-link with the Reference
Notebook and scrapped data set with analysis is given below

● Worldometer website link

● Jupyter Notebook Download Link
● Scrapped Covid19 Dataset Download Link

Common questions

The `requests` library is used to send an HTTP request to the web server to fetch the HTML content of a webpage. Following this, the `BeautifulSoup` library is employed to parse and navigate through the HTML or XML data, creating a nested/tree structure of the data which allows for efficient data extraction. These libraries complement each other as `requests` handles data retrieval, while `BeautifulSoup` handles data parsing and extraction .

By using `requests` to access and retrieve the HTML of a webpage, and then using `BeautifulSoup` to parse this HTML, specific elements of interest can be efficiently extracted and stored in a structured format. This organized dataset can then be further processed and analyzed for insights, making the combination of these tools effective for data extraction and preparation .

Parsing HTML data into a tree structure is significant because HTML data is often nested and complex, and simple string processing would fail to accurately capture the hierarchical organization of elements. The tree structure allows for efficient navigation and manipulation of specific elements of interest, enabling precise data extraction necessary for creating structured datasets .

To install BeautifulSoup in a command prompt or terminal, you can use the command `pip install beautifulsoup4`. In a Jupyter notebook, the setup might differ slightly using `!pip install beautifulsoup4` or `%pip install beautifulsoup4` within a cell. Importing is consistent, using `from bs4 import BeautifulSoup` in both environments .

Large-scale web scraping can significantly impact a website's server and its performance by overwhelming it with numerous, frequent requests that mimic regular browsing traffic. This can lead to slower load times or even crashes, affecting the service provider’s ability to serve other users. Consequently, websites may implement rate limiting or block IPs associated with scraping to protect their resources and maintain performance stability .

Automated web scraping offers the benefit of timely and comprehensive data collection, crucial during evolving situations like the COVID-19 pandemic. However, risks include potential legal implications if sites like Worldometer do not allow scraping, as well as data integrity issues if scraped data is not verified against reliable sources. Ethical considerations regarding user consent and the impact on site performance also exist .

Researching the Terms of Service is crucial to avoid potential legal issues, as many websites explicitly prohibit automated data extraction techniques like web scraping. Ensuring compliance with these terms helps in mitigating risks of legal actions and aligns with ethical standards expected when interacting with third-party web data .

The web scraping process begins with sending an HTTP request to the server hosting the target webpage, utilizing libraries like `requests` to download the HTML content. Once retrieved, the HTML content needs to be parsed to create a navigable structure, typically using a library like `BeautifulSoup`. The parsed data then allows for tree traversal, enabling the identification and extraction of specified elements for analysis and use .

When engaging in web scraping, it's essential to consider whether you are violating the Terms of Service of the website being scraped. Some websites may not explicitly allow automated scraping, which can lead to ethical and legal challenges. For educational purposes and when scraping pages respectfully, issues are unlikely, but it's critical to research and ensure that the site's rules are not being breached .

Web scraping is powerful because it automates the collection of large volumes of data, enabling detailed analysis and insights that would be time-consuming to gather manually. However, it poses privacy concerns as the data might include sensitive information or be used without the website owners' consent, raising ethical issues regarding data ownership and user privacy .

Data Science Lab Manual for TYCS VI
No ratings yet
Data Science Lab Manual for TYCS VI
33 pages
Data Science Course Overview and Content
No ratings yet
Data Science Course Overview and Content
8 pages
Introduction to Spark SQL and Scala
No ratings yet
Introduction to Spark SQL and Scala
17 pages
Understanding AJAX Technology
No ratings yet
Understanding AJAX Technology
7 pages
Beginner's Guide to Python & MongoDB
No ratings yet
Beginner's Guide to Python & MongoDB
12 pages
Core Python Programming Course Overview
No ratings yet
Core Python Programming Course Overview
2 pages
Text Summarization Using NLP
No ratings yet
Text Summarization Using NLP
6 pages
Web Scraping with Python: BeautifulSoup
No ratings yet
Web Scraping with Python: BeautifulSoup
109 pages
Data Analytics Lab Overview
No ratings yet
Data Analytics Lab Overview
14 pages
Python Web Scraping with BeautifulSoup
No ratings yet
Python Web Scraping with BeautifulSoup
6 pages
Community Development: Tree Plantation Project
No ratings yet
Community Development: Tree Plantation Project
14 pages
Sales Management System in Python
No ratings yet
Sales Management System in Python
23 pages
45-Day AI Internship Plan Guide
No ratings yet
45-Day AI Internship Plan Guide
3 pages
Naan Mudhalvan - Data Analytics by Google Lab Manual-2-24!2!23
No ratings yet
Naan Mudhalvan - Data Analytics by Google Lab Manual-2-24!2!23
22 pages
Web Servers (IIS, PWS and Apache) : WW W .D e
No ratings yet
Web Servers (IIS, PWS and Apache) : WW W .D e
6 pages
Data Science Applications in Agriculture
No ratings yet
Data Science Applications in Agriculture
5 pages
Pathology Lab Reporting Software Project
No ratings yet
Pathology Lab Reporting Software Project
34 pages
Twitter Spam Detection Techniques
No ratings yet
Twitter Spam Detection Techniques
45 pages
Text Line Length Adjustment in C
No ratings yet
Text Line Length Adjustment in C
4 pages
Python Dictionary Manipulation Practice
No ratings yet
Python Dictionary Manipulation Practice
8 pages
Full Stack Development Overview
No ratings yet
Full Stack Development Overview
18 pages
Employee Management System Project
No ratings yet
Employee Management System Project
47 pages
Alternatives to Web Scraping Explained
No ratings yet
Alternatives to Web Scraping Explained
13 pages
Python ETL Data Importer and Exporter
No ratings yet
Python ETL Data Importer and Exporter
26 pages
Python Programming Basics Overview
No ratings yet
Python Programming Basics Overview
67 pages
Python Programming Essentials Guide
No ratings yet
Python Programming Essentials Guide
15 pages
DAV Question Bank for AI & ML 2023-24
No ratings yet
DAV Question Bank for AI & ML 2023-24
5 pages
Core Java by Nageswara Rao PDF Download
0% (2)
Core Java by Nageswara Rao PDF Download
3 pages
NLP-Based Email Spam Detection
No ratings yet
NLP-Based Email Spam Detection
5 pages
Plant Leaf Disease Detection Project
No ratings yet
Plant Leaf Disease Detection Project
62 pages
NGO Initiatives for Education and Welfare
No ratings yet
NGO Initiatives for Education and Welfare
36 pages
Essential Python Libraries Overview
No ratings yet
Essential Python Libraries Overview
17 pages
1Stop.ai Internship Overview
No ratings yet
1Stop.ai Internship Overview
23 pages
Simple Linear Regression in Python Guide
No ratings yet
Simple Linear Regression in Python Guide
21 pages
Weather Forecasting GUI Development
No ratings yet
Weather Forecasting GUI Development
7 pages
Static vs Object Methods in JavaScript
No ratings yet
Static vs Object Methods in JavaScript
2 pages
Data Structures in Python for B.Tech
No ratings yet
Data Structures in Python for B.Tech
156 pages
Intrusion Detection System in Software Defined Networks Using Machine Learning Approach
No ratings yet
Intrusion Detection System in Software Defined Networks Using Machine Learning Approach
8 pages
Phishing Website Detection Report
No ratings yet
Phishing Website Detection Report
83 pages
Excel Data Science Course Overview
100% (1)
Excel Data Science Course Overview
21 pages
Introduction to Data Science Overview
No ratings yet
Introduction to Data Science Overview
11 pages
Quiz System Specification Document
33% (6)
Quiz System Specification Document
28 pages
Familiarization with Network Devices
No ratings yet
Familiarization with Network Devices
13 pages
JSP Coding Essentials Cheat Sheet
100% (1)
JSP Coding Essentials Cheat Sheet
2 pages
Car Rental Management System Report
No ratings yet
Car Rental Management System Report
31 pages
Orange Data Mining Program
No ratings yet
Orange Data Mining Program
8 pages
BSc IT Business Intelligence Journal
No ratings yet
BSc IT Business Intelligence Journal
28 pages
BCSP-064 Project Guidelines Overview
No ratings yet
BCSP-064 Project Guidelines Overview
18 pages
Array Data Structure Lect-3
No ratings yet
Array Data Structure Lect-3
16 pages
K-Means Clustering in Machine Learning
No ratings yet
K-Means Clustering in Machine Learning
13 pages
Class 12 Covid Management Project Certificate
No ratings yet
Class 12 Covid Management Project Certificate
25 pages
Risky Loan Identification with Python
No ratings yet
Risky Loan Identification with Python
53 pages
Summer Training in Machine Learning Python
No ratings yet
Summer Training in Machine Learning Python
52 pages
Machine Learning Internship Report
0% (1)
Machine Learning Internship Report
37 pages
Test Plan for Grocery App Testing
No ratings yet
Test Plan for Grocery App Testing
35 pages
Python Programming Syllabus Overview
No ratings yet
Python Programming Syllabus Overview
18 pages
BCS 053 Web Programming Study Notes
No ratings yet
BCS 053 Web Programming Study Notes
21 pages
Computer Engineering: Internet Protocols & Web Technologies
No ratings yet
Computer Engineering: Internet Protocols & Web Technologies
6 pages
Web Scraping with Python Overview
No ratings yet
Web Scraping with Python Overview
18 pages
Web Scraping with Python: A Complete Guide
100% (2)
Web Scraping with Python: A Complete Guide
35 pages
Voter List Bharatpura
No ratings yet
Voter List Bharatpura
22 pages
Class 4th Final Exam Paper
No ratings yet
Class 4th Final Exam Paper
4 pages
JNJ Technologies Job Offer Letter
No ratings yet
JNJ Technologies Job Offer Letter
3 pages
Jaivik Kheti
No ratings yet
Jaivik Kheti
9 pages
Python Web Scraping Assignment Guide
No ratings yet
Python Web Scraping Assignment Guide
2 pages
IPL Matches EDA Assignment Guide
No ratings yet
IPL Matches EDA Assignment Guide
1 page
Enhancing Operational Flexibility in Maintenance
No ratings yet
Enhancing Operational Flexibility in Maintenance
16 pages
Exception Handling in Python
No ratings yet
Exception Handling in Python
10 pages
OData Table Screen Element Guide
No ratings yet
OData Table Screen Element Guide
4 pages
50 Software Testing Q&A Guide
No ratings yet
50 Software Testing Q&A Guide
5 pages
Twilio Regulatory Compliance API Guide
No ratings yet
Twilio Regulatory Compliance API Guide
2 pages
PRD Performance Analysis: CPU & I/O Data
No ratings yet
PRD Performance Analysis: CPU & I/O Data
13 pages
Scrum For The Rest of Us
100% (3)
Scrum For The Rest of Us
49 pages
Understanding ExpressJS Middleware
No ratings yet
Understanding ExpressJS Middleware
41 pages
Assimulo Installation Guide and Troubleshooting
No ratings yet
Assimulo Installation Guide and Troubleshooting
3 pages
Sending Mail in Sage X3: 3 Methods
No ratings yet
Sending Mail in Sage X3: 3 Methods
12 pages
Java Decision Making and Loops
No ratings yet
Java Decision Making and Loops
33 pages
Modulewise Questions For First 4 Module
No ratings yet
Modulewise Questions For First 4 Module
3 pages
Syed Hassan Raza Rizvi: Web Developer Profile
No ratings yet
Syed Hassan Raza Rizvi: Web Developer Profile
1 page
Self-Quiz Unit 3 Attempt Review
No ratings yet
Self-Quiz Unit 3 Attempt Review
6 pages
C++ Programming Language by Stroustrup
No ratings yet
C++ Programming Language by Stroustrup
28 pages
PL/SQL: Enhancing SQL with Procedures
No ratings yet
PL/SQL: Enhancing SQL with Procedures
3 pages
Firebase Dynamic Links Overview
No ratings yet
Firebase Dynamic Links Overview
10 pages
30-Day Python Learning Syllabus
No ratings yet
30-Day Python Learning Syllabus
3 pages
Java Swing GUI Framework Overview
No ratings yet
Java Swing GUI Framework Overview
4 pages
Prolog Programming Basics and Concepts
No ratings yet
Prolog Programming Basics and Concepts
32 pages
Web Technologies Lab Manual Exercises
No ratings yet
Web Technologies Lab Manual Exercises
41 pages
Software Engineering Fundamentals Explained
No ratings yet
Software Engineering Fundamentals Explained
18 pages
Parallel Architecture Lab Assignment
No ratings yet
Parallel Architecture Lab Assignment
8 pages
Types of Middleware Explained
100% (1)
Types of Middleware Explained
3 pages
Python Programming Concepts Explained
No ratings yet
Python Programming Concepts Explained
11 pages
C Programming Fundamentals Overview
No ratings yet
C Programming Fundamentals Overview
26 pages
Amr Mohamed: IT Systems Administrator
No ratings yet
Amr Mohamed: IT Systems Administrator
2 pages
CPSC 233 Midterm Review Guide
No ratings yet
CPSC 233 Midterm Review Guide
11 pages
C Programming: Arithmetic Operators MCQs
No ratings yet
C Programming: Arithmetic Operators MCQs
5 pages
Operating System Exam Paper 2020
No ratings yet
Operating System Exam Paper 2020
2 pages