Web Scraping Techniques and Ethics

The document outlines the syllabus for a course on data integration and processing, focusing on web scraping techniques, legality, and ethical considerations. It details the components of web scraping, including web crawlers and scrapers, and emphasizes the use of Python and libraries like BeautifulSoup for data extraction. Additionally, it addresses the importance of adhering to website rules and privacy concerns while scraping data from the web.

Uploaded by

renosecure05

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views24 pages

Web Scraping Techniques and Ethics

Uploaded by

renosecure05

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Data

Integration
and
Processing
Mr. Aakash Kadam
SYLLABUS DATA COLLECTION, CLEANING
AND INGESTION

DATA INTEGRATION

DATA PROCESSING AND

ANALYTICS

ADVANCED ML PIPELINES
MODULE-1
1. Web Scraping – Basics, Definition, Components of Web Scraping, Working of a Web Scraper, Python as a
Language for Web Scraping, Different Types of Web Scraping, Legality of Web Scraping, Understanding
[Link] file, Understanding requests, Understanding Headers, Ethical Use of Headers, Understanding
BeautifulSoup, Understanding Parsers, Different Types of Parsers, Different Functions under BeautifulSoup
Valuable Data Trapped inside Websites
• The web is the world's largest database, but most of its information isn't structured for analysis.
• Data on product prices, market trends, and news is locked within the visual layout of HTML, making it difficult
to extract and use at scale.

Structured

Data

Unstructured
What is Web Scrapping?

This process involves converting

Web Scraping is an automated data that is typically found in HTML
method to obtain large amounts of format on websites into structured
data from websites. formats like spreadsheets,
databases, CSV files, Excel files etc.
Components involved in Web Scraping

• Web Crawler:
1. It is used for indexing of Web pages.
2. It basically visits a website and read web pages for the purpose of building entries for search engine index.
3. It visits each and every page, until the last line for information.

• Web Scraper:
1. It is a technique used to extract a large amount of data from websites and then saving it to the local machine.
2. It need not visit all the pages of website for information.
Working of a Web Scrapper -
Overview
Language for Web Scraping

• Python's combination of simplicity and power makes it the dominant language for web scraping.

• Its ecosystem provides a variety of libraries specifically designed to handle every step of the scraping process.
Different Types of Web Scrapers
Types of Web
Scrapers:

Based on Based on
Based on
Development Execution
Platform
Type Environment

Browser
Self-Built Web Cloud Web
Extension Web
Scrapers Scrapers
Scrapers

Pre-built Web Software Web Local Web

Scrapers Scrapers Scrapers
Legality of Web Scraping
• The legality of web scraping is a nuanced issue that depends mainly on the following factors:
1. Purpose of Scraping
2. Nature of the data being extracted
3. Laws of the operating jurisdiction

• Web scraping itself is not illegal by default, but its legality depends on what is scraped, how it is scraped,
and how the scraped data is used. Because different countries interpret digital access differently, the legal
landscape remains complex.
• Laws around web scraping are still unclear in many places.
Common Ethical Concerns
1. Compliance with Website Rules and Terms : Developers must ensure they have permission to scrape the data
and comply with the website's terms of service. Before conducting any scraping activities, it is crucial to ensure
that you are scraping public data and are in no way breaching third-party rights.
2. Adherence to [Link] Directives: Developers should check the [Link] file for guidance before
conducting any scraping activities. The [Link] file outlines rules regarding what data can or cannot be scraped.
3. Respecting Server Integrity and Rate Limits: A major ethical concern is ignoring rate limiting and making too
many requests in a short time frame. Websites utilize mechanisms to detect and prevent scraping activities that
burden their servers. Developers should avoid disrupting the scraped site's servers.
4. Privacy and Personal Data: Scraping personally identifiable information can violate user privacy.
5. IPR: Scraping and republishing original content without permission can constitute plagiarism or copyright
infringement.
6. Bypassing Access Controls: Some scrapers attempt to bypass anti-bot mechanisms like CAPTCHA, login walls
or paywalls. These actions cross clear ethical lines and are often illegal
Understanding the [Link] file
Rules apply to all the bots
Sample [Link] file
Bots should not crawl any URL beginning with
/admin/

Bots should not crawl any URL beginning with

/private/

Bots are allowed to crawl /public/ paths.

Bots should wait 5 seconds between requests

Let’s start with Scraping
The Scout: Requests
• requests is a Python library used to send HTTP requests to websites.
• It’s the first point of contact with the target site.
• It allows your Python script to behave like a web browser and fetch web pages
Web Server

http request

Web Browser

Data
Status_Code
metadata
The Scout: Requests
Understanding the Status Codes
Code Meaning

200 Success

403 Forbidden (Blocked)

404 Page not found

500 Server Error

The Scout: Requests
Using Headers
• Headers are pieces of metadata sent along with an HTTP request. They help the server understand who is
requesting, how, and what format is expected.
• When your browser loads a webpage, it automatically sends many headers.
• When scraping, you must send at least some headers manually—otherwise the website may detect your
script as a bot and block it.
The Scout: Requests
Format of Headers
botHeader = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)“,
"Accept-Language": "en-US,en;q=0.9"
}
response = [Link](url, headers=botHeader)
The Scout: Requests
Ethical use of Headers
• Key Ethical Principle
“If your scraping could impact a server, be transparent. Tell the server who you actually are.
If your scraping is small-scale and harmless, headers just help avoid blocking.”
• Some rules which we should follow while scraping:
1. Always Identify Your Scraper Honestly (if scraping at scale).
2. Never Use Headers to Bypass Login or Security
3. Use Only Necessary Headers
The Decoder: BeautifulSoup
• Parses the raw HTML, turning it into a structured object that's easy to navigate and search.
• BeautifulSoup is a Python library used for:
1. reading HTML or XML documents
2. navigating the webpage structure
3. searching for tags, classes, IDs
4. extracting specific information

• “BeautifulSoup converts messy HTML code into a structured, searchable Python object.”
The Decoder: BeautifulSoup
HTML to Tree Structure
• Raw HTML is difficult to process directly.
• Without BeautifulSoup, you would be manually parsing strings — hard and error-prone.
• BeautifulSoup solves this by turning HTML into a tree structure:

This code is converted into a tree

structure.
The Decoder: BeautifulSoup
Parsing
• Parsing means:
a. Taking raw HTML (which is just text)
b. and converting it into a structured format
c. that a program can understand and extract data from.
• Different Types of Parsers in BeautifulSoup:
a. [Link] (Default)
b. lxml (Requires installation)
c. html5lib (Requires installation)
The Decoder: BeautifulSoup
Finding Your Target
1. To scrape a website, you must first understand its HTML
structure.
2. By ‘inspecting’ an element, you can identify the HTML
tags and attributes (like class or id) that contain your
target data.
The Decoder: BeautifulSoup
Important Functions
1. find() - Finds the first occurrence of a tag.
2. find_all() - Finds all matching tags, returns a list.
3. get_text() – Accesses the text within the tag.
4. Searching by Attributes (class_, id)

Web Scraping: Process, Tools, and Uses
No ratings yet
Web Scraping: Process, Tools, and Uses
38 pages
(Ebook) Web Scraping With Python: Data Extraction From The Modern Web by Ryan Mitchell Online Reading
No ratings yet
(Ebook) Web Scraping With Python: Data Extraction From The Modern Web by Ryan Mitchell Online Reading
81 pages
Web Scraping With Python Tutorials From A To Z
No ratings yet
Web Scraping With Python Tutorials From A To Z
35 pages
Final Web Scraping Complete Detailed
No ratings yet
Final Web Scraping Complete Detailed
17 pages
Web Scraping with Python and BeautifulSoup
No ratings yet
Web Scraping with Python and BeautifulSoup
10 pages
Web Scraping: A Comprehensive Guide
No ratings yet
Web Scraping: A Comprehensive Guide
15 pages
Alternatives to Web Scraping Explained
No ratings yet
Alternatives to Web Scraping Explained
13 pages
Data Collection and Web Scraping Guide
No ratings yet
Data Collection and Web Scraping Guide
12 pages
Web Data Collection via Scraping
No ratings yet
Web Data Collection via Scraping
10 pages
Python Web Scraping Guide
No ratings yet
Python Web Scraping Guide
16 pages
Introduction to Web Parsing Basics
100% (1)
Introduction to Web Parsing Basics
3 pages
Web Scraping with Beautiful Soup Guide
No ratings yet
Web Scraping with Beautiful Soup Guide
13 pages
Data Collection and Web Scraping Guide
No ratings yet
Data Collection and Web Scraping Guide
11 pages
Week02-Web Scraping QueryingAPIs
No ratings yet
Week02-Web Scraping QueryingAPIs
67 pages
Web Scraping Tool Project Report
No ratings yet
Web Scraping Tool Project Report
59 pages
Telecom Data Mining via Web Scraping
No ratings yet
Telecom Data Mining via Web Scraping
5 pages
Web Scraping Basics with Python
No ratings yet
Web Scraping Basics with Python
10 pages
XTree: Python Web Data Extraction Project
No ratings yet
XTree: Python Web Data Extraction Project
40 pages
ETL Process and Web Scraping Guide
No ratings yet
ETL Process and Web Scraping Guide
4 pages
Web Scraping Quick Start Guide
No ratings yet
Web Scraping Quick Start Guide
7 pages
Web Mining and Social Media Analytics
No ratings yet
Web Mining and Social Media Analytics
19 pages
Web Scraping Techniques and Tools
No ratings yet
Web Scraping Techniques and Tools
22 pages
Web Scraping System Development Guide
No ratings yet
Web Scraping System Development Guide
8 pages
Web Scraping Techniques and Tools
No ratings yet
Web Scraping Techniques and Tools
30 pages
Web Scraping with Python Requests
No ratings yet
Web Scraping with Python Requests
19 pages
Python Web Data Access Techniques
No ratings yet
Python Web Data Access Techniques
16 pages
Web Scraping Techniques Overview
No ratings yet
Web Scraping Techniques Overview
14 pages
Business Analytics & Web Scraping Course
No ratings yet
Business Analytics & Web Scraping Course
118 pages
Simple Web Scraper Project Overview
No ratings yet
Simple Web Scraper Project Overview
41 pages
Web Scraping with Beautiful Soup & Selenium
No ratings yet
Web Scraping with Beautiful Soup & Selenium
5 pages
Spider Rust: High-Performance Web Scraper
No ratings yet
Spider Rust: High-Performance Web Scraper
12 pages
E-commerce Data Scraper for India
No ratings yet
E-commerce Data Scraper for India
5 pages
E-commerce Web Scraper Development Guide
No ratings yet
E-commerce Web Scraper Development Guide
7 pages
Hybrid Web Scraping Techniques Overview
No ratings yet
Hybrid Web Scraping Techniques Overview
8 pages
Legality and Ethics of Web Scraping
No ratings yet
Legality and Ethics of Web Scraping
29 pages
AI-Enhanced Web Scraping Review
No ratings yet
AI-Enhanced Web Scraping Review
8 pages
Real-time E-commerce Price Comparison Using Python
No ratings yet
Real-time E-commerce Price Comparison Using Python
10 pages
How Web Scraping Helps in Competitor Analysis and Market Research - Blog
No ratings yet
How Web Scraping Helps in Competitor Analysis and Market Research - Blog
9 pages
Python Web Scraping & Data Mining Guide
No ratings yet
Python Web Scraping & Data Mining Guide
10 pages
Web Scraping for Geographic Data Insights
No ratings yet
Web Scraping for Geographic Data Insights
18 pages
Cloud-Based Weather Data Scraping
No ratings yet
Cloud-Based Weather Data Scraping
11 pages
Secure Web Scraping Tool
No ratings yet
Secure Web Scraping Tool
10 pages
Best Web Scraped Data Storage Solution
No ratings yet
Best Web Scraped Data Storage Solution
13 pages
Product Price Comparison System Overview
No ratings yet
Product Price Comparison System Overview
11 pages
Practical Web Scraping For Economists 1744341390
No ratings yet
Practical Web Scraping For Economists 1744341390
33 pages
Cloud Deployment for Web-Scraping Chatbots
No ratings yet
Cloud Deployment for Web-Scraping Chatbots
5 pages
AI-Powered Personalized Lesson Plans
No ratings yet
AI-Powered Personalized Lesson Plans
13 pages
Scraping Password-Protected Sites with Python
No ratings yet
Scraping Password-Protected Sites with Python
16 pages
Web Scraper Project Report
No ratings yet
Web Scraper Project Report
22 pages
E-commerce Data Scraper Development Guide
No ratings yet
E-commerce Data Scraper Development Guide
5 pages
Step-by-Step Python Web Scraping Guide
0% (1)
Step-by-Step Python Web Scraping Guide
7 pages
Upadhyay (2017) - Articulating The Construction of A Web Scraper For
No ratings yet
Upadhyay (2017) - Articulating The Construction of A Web Scraper For
4 pages
Web Crawler Development Guide
0% (1)
Web Crawler Development Guide
12 pages
Web Scraping with Python Guide
No ratings yet
Web Scraping with Python Guide
35 pages
Relative Insertion of Business To Customer URL by Discover Web Information Schemas
No ratings yet
Relative Insertion of Business To Customer URL by Discover Web Information Schemas
4 pages
Web Scraping for FCRA Data Analysis
No ratings yet
Web Scraping for FCRA Data Analysis
2 pages
Web Scraping for Collections Efficiency
100% (1)
Web Scraping for Collections Efficiency
43 pages
Flask Web Development Developing Web Applications With Python 2nd Edition Miguel Grinberg Online PDF
100% (2)
Flask Web Development Developing Web Applications With Python 2nd Edition Miguel Grinberg Online PDF
96 pages
Python Web Scraper Development Guide
No ratings yet
Python Web Scraper Development Guide
13 pages
Web Scraping
No ratings yet
Web Scraping
4 pages
ANSI Y32.2-1986 Sup
100% (1)
ANSI Y32.2-1986 Sup
66 pages
Farm Automation Solution Framework
No ratings yet
Farm Automation Solution Framework
4 pages
M-Voting: Biometric System Analysis
No ratings yet
M-Voting: Biometric System Analysis
17 pages
Priyanka Juyal: Software Engineer Resume
No ratings yet
Priyanka Juyal: Software Engineer Resume
2 pages
Lighthouse Report: No Content Painted
No ratings yet
Lighthouse Report: No Content Painted
8 pages
Bobcat Auger Specifications Overview
No ratings yet
Bobcat Auger Specifications Overview
1 page
Energy Production via Magnetic Induction
100% (1)
Energy Production via Magnetic Induction
9 pages
CEABo KStudy Guide SI
No ratings yet
CEABo KStudy Guide SI
12 pages
Understanding Node.js: Benefits & Features
No ratings yet
Understanding Node.js: Benefits & Features
89 pages
Burndy MD6 Mechanical Crimp Tool Datasheet
No ratings yet
Burndy MD6 Mechanical Crimp Tool Datasheet
2 pages
Induction Heating Equipment Overview
No ratings yet
Induction Heating Equipment Overview
4 pages
Graphic Designer Career Objective
No ratings yet
Graphic Designer Career Objective
1 page
Krones Checkmat Label Inspection Overview
No ratings yet
Krones Checkmat Label Inspection Overview
10 pages
Zscaler Cloud Ips
No ratings yet
Zscaler Cloud Ips
4 pages
Understanding the Grid System
No ratings yet
Understanding the Grid System
37 pages
2020 Web Analytics Trends Overview
No ratings yet
2020 Web Analytics Trends Overview
13 pages
Configuring RIP for Multi-LAN Communication
No ratings yet
Configuring RIP for Multi-LAN Communication
5 pages
CAP Round-III Allotment for Akola College
No ratings yet
CAP Round-III Allotment for Akola College
27 pages
B.Tech CSE IoT & Cybersecurity Guide
No ratings yet
B.Tech CSE IoT & Cybersecurity Guide
4 pages
Beta-Factor Analysis for Safety Systems
No ratings yet
Beta-Factor Analysis for Safety Systems
89 pages
Precision and Micromanufacturing Course
No ratings yet
Precision and Micromanufacturing Course
10 pages
Understanding Computer Hardware Basics
No ratings yet
Understanding Computer Hardware Basics
6 pages
R-Car Starterkit Hardware Manual: R-Car Starter Kit Premier R-Car Starter Kit Pro
No ratings yet
R-Car Starterkit Hardware Manual: R-Car Starter Kit Premier R-Car Starter Kit Pro
37 pages
ZX200-5G and ZX330-5G Launch Overview
No ratings yet
ZX200-5G and ZX330-5G Launch Overview
16 pages
Swami Nandan: Software Engineer Profile
No ratings yet
Swami Nandan: Software Engineer Profile
1 page
AWS Cloud Practitioner Lab Manual
No ratings yet
AWS Cloud Practitioner Lab Manual
99 pages
Hikvision DS-2CD1143G0-I Camera Specs
No ratings yet
Hikvision DS-2CD1143G0-I Camera Specs
4 pages
Student Course Registration System Project
No ratings yet
Student Course Registration System Project
9 pages
4r1k1/ppp2pp1/2n4p Solution Analysis
No ratings yet
4r1k1/ppp2pp1/2n4p Solution Analysis
8 pages
TR MSG 058 PDF
No ratings yet
TR MSG 058 PDF
334 pages