0% found this document useful (0 votes)
12 views24 pages

Web Scraping Techniques and Ethics

The document outlines the syllabus for a course on data integration and processing, focusing on web scraping techniques, legality, and ethical considerations. It details the components of web scraping, including web crawlers and scrapers, and emphasizes the use of Python and libraries like BeautifulSoup for data extraction. Additionally, it addresses the importance of adhering to website rules and privacy concerns while scraping data from the web.

Uploaded by

renosecure05
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views24 pages

Web Scraping Techniques and Ethics

The document outlines the syllabus for a course on data integration and processing, focusing on web scraping techniques, legality, and ethical considerations. It details the components of web scraping, including web crawlers and scrapers, and emphasizes the use of Python and libraries like BeautifulSoup for data extraction. Additionally, it addresses the importance of adhering to website rules and privacy concerns while scraping data from the web.

Uploaded by

renosecure05
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Data

Integration
and
Processing
Mr. Aakash Kadam
SYLLABUS DATA COLLECTION, CLEANING
AND INGESTION

DATA INTEGRATION

DATA PROCESSING AND


ANALYTICS

ADVANCED ML PIPELINES
MODULE-1
1. Web Scraping – Basics, Definition, Components of Web Scraping, Working of a Web Scraper, Python as a
Language for Web Scraping, Different Types of Web Scraping, Legality of Web Scraping, Understanding
[Link] file, Understanding requests, Understanding Headers, Ethical Use of Headers, Understanding
BeautifulSoup, Understanding Parsers, Different Types of Parsers, Different Functions under BeautifulSoup
Valuable Data Trapped inside Websites
• The web is the world's largest database, but most of its information isn't structured for analysis.
• Data on product prices, market trends, and news is locked within the visual layout of HTML, making it difficult
to extract and use at scale.

Structured

Data

Unstructured
What is Web Scrapping?

This process involves converting


Web Scraping is an automated data that is typically found in HTML
method to obtain large amounts of format on websites into structured
data from websites. formats like spreadsheets,
databases, CSV files, Excel files etc.
Components involved in Web Scraping

• Web Crawler:
1. It is used for indexing of Web pages.
2. It basically visits a website and read web pages for the purpose of building entries for search engine index.
3. It visits each and every page, until the last line for information.

• Web Scraper:
1. It is a technique used to extract a large amount of data from websites and then saving it to the local machine.
2. It need not visit all the pages of website for information.
Working of a Web Scrapper -
Overview
Language for Web Scraping

• Python's combination of simplicity and power makes it the dominant language for web scraping.

• Its ecosystem provides a variety of libraries specifically designed to handle every step of the scraping process.
Different Types of Web Scrapers
Types of Web
Scrapers:

Based on Based on
Based on
Development Execution
Platform
Type Environment

Browser
Self-Built Web Cloud Web
Extension Web
Scrapers Scrapers
Scrapers

Pre-built Web Software Web Local Web


Scrapers Scrapers Scrapers
Legality of Web Scraping
• The legality of web scraping is a nuanced issue that depends mainly on the following factors:
1. Purpose of Scraping
2. Nature of the data being extracted
3. Laws of the operating jurisdiction

• Web scraping itself is not illegal by default, but its legality depends on what is scraped, how it is scraped,
and how the scraped data is used. Because different countries interpret digital access differently, the legal
landscape remains complex.
• Laws around web scraping are still unclear in many places.
Common Ethical Concerns
1. Compliance with Website Rules and Terms : Developers must ensure they have permission to scrape the data
and comply with the website's terms of service. Before conducting any scraping activities, it is crucial to ensure
that you are scraping public data and are in no way breaching third-party rights.
2. Adherence to [Link] Directives: Developers should check the [Link] file for guidance before
conducting any scraping activities. The [Link] file outlines rules regarding what data can or cannot be scraped.
3. Respecting Server Integrity and Rate Limits: A major ethical concern is ignoring rate limiting and making too
many requests in a short time frame. Websites utilize mechanisms to detect and prevent scraping activities that
burden their servers. Developers should avoid disrupting the scraped site's servers.
4. Privacy and Personal Data: Scraping personally identifiable information can violate user privacy.
5. IPR: Scraping and republishing original content without permission can constitute plagiarism or copyright
infringement.
6. Bypassing Access Controls: Some scrapers attempt to bypass anti-bot mechanisms like CAPTCHA, login walls
or paywalls. These actions cross clear ethical lines and are often illegal
Understanding the [Link] file
Rules apply to all the bots
Sample [Link] file
Bots should not crawl any URL beginning with
/admin/

Bots should not crawl any URL beginning with


/private/

Bots are allowed to crawl /public/ paths.

Bots should wait 5 seconds between requests


Let’s start with Scraping
The Scout: Requests
• requests is a Python library used to send HTTP requests to websites.
• It’s the first point of contact with the target site.
• It allows your Python script to behave like a web browser and fetch web pages
Web Server

http request

Web Browser

Data
Status_Code
metadata
The Scout: Requests
Understanding the Status Codes
Code Meaning

200 Success

403 Forbidden (Blocked)

404 Page not found

500 Server Error


The Scout: Requests
Using Headers
• Headers are pieces of metadata sent along with an HTTP request. They help the server understand who is
requesting, how, and what format is expected.
• When your browser loads a webpage, it automatically sends many headers.
• When scraping, you must send at least some headers manually—otherwise the website may detect your
script as a bot and block it.
The Scout: Requests
Format of Headers
botHeader = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)“,
"Accept-Language": "en-US,en;q=0.9"
}
response = [Link](url, headers=botHeader)
The Scout: Requests
Ethical use of Headers
• Key Ethical Principle
“If your scraping could impact a server, be transparent. Tell the server who you actually are.
If your scraping is small-scale and harmless, headers just help avoid blocking.”
• Some rules which we should follow while scraping:
1. Always Identify Your Scraper Honestly (if scraping at scale).
2. Never Use Headers to Bypass Login or Security
3. Use Only Necessary Headers
The Decoder: BeautifulSoup
• Parses the raw HTML, turning it into a structured object that's easy to navigate and search.
• BeautifulSoup is a Python library used for:
1. reading HTML or XML documents
2. navigating the webpage structure
3. searching for tags, classes, IDs
4. extracting specific information

• “BeautifulSoup converts messy HTML code into a structured, searchable Python object.”
The Decoder: BeautifulSoup
HTML to Tree Structure
• Raw HTML is difficult to process directly.
• Without BeautifulSoup, you would be manually parsing strings — hard and error-prone.
• BeautifulSoup solves this by turning HTML into a tree structure:

This code is converted into a tree


structure.
The Decoder: BeautifulSoup
Parsing
• Parsing means:
a. Taking raw HTML (which is just text)
b. and converting it into a structured format
c. that a program can understand and extract data from.
• Different Types of Parsers in BeautifulSoup:
a. [Link] (Default)
b. lxml (Requires installation)
c. html5lib (Requires installation)
The Decoder: BeautifulSoup
Finding Your Target
1. To scrape a website, you must first understand its HTML
structure.
2. By ‘inspecting’ an element, you can identify the HTML
tags and attributes (like class or id) that contain your
target data.
The Decoder: BeautifulSoup
Important Functions
1. find() - Finds the first occurrence of a tag.
2. find_all() - Finds all matching tags, returns a list.
3. get_text() – Accesses the text within the tag.
4. Searching by Attributes (class_, id)

You might also like