0% found this document useful (0 votes)

9 views9 pages

Final Me2

This document outlines fundamental web scraping techniques using Python and the BeautifulSoup library, detailing the process of programmatically retrieving text and link data from webpages. It explains web scraping concepts, key libraries, and steps for extracting data, along with practical examples of scripts for extracting paragraphs, links, headings, and images from a webpage. The document emphasizes the importance of ethical standards and the utility of web scraping in various fields such as data science and market research.

Uploaded by

Yogesh Kumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views9 pages

Final Me2

Uploaded by

Yogesh Kumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Machine Exercise: 2

Aim:
This exercise aims to demonstrate fundamental web scraping techniques. The goal is to use Python
with the BeautifulSoup library to programmatically retrieve text and link data from a specific
webpage.

What is Web Scraping?

Web scraping is a process for automatically collecting information from websites. It works by
programmatically fetching a webpage's HTML source code and then parsing it to find and extract
specific data points. This extracted data can include text, images, hyperlinks, or tabular data.

The process begins with a script sending an HTTP request to a website's server. The server responds
by sending back the webpage's content, typically in HTML format. The scraping script then scans
this content, looking for predefined patterns or elements to pull out the desired information. The
collected data is then often stored in a structured file format like CSV, JSON, or a database for later
use.

Specialized tools and libraries, particularly in programming languages like Python, are used to
perform these tasks. Libraries such as BeautifulSoup, Requests, and Selenium assist in navigating and
parsing the complex structure of a webpage. They can precisely target and extract content from
specific HTML tags like <p>, <a>, and <div>.

This technique is valuable because it enables data collection from websites that lack dedicated APIs
or downloadable data sets. It supports the gathering of large datasets efficiently, facilitates real-time
data tracking (such as market prices or news updates), and is crucial for data analysis, machine
learning, and informed decision-making. Its applications are widespread across data science, business
intelligence, market research, and academia.

Libraries for Web Scraping

A variety of powerful libraries are available for web scraping, particularly in Python and R, which
automate tasks like making web requests, parsing HTML, navigating page structures, and extracting
data for storage and analysis.

Python Libraries
1. BeautifulSoup: A powerful tool for parsing HTML and XML documents.
o Features: Can use different parsers like lxml and html5lib. Provides an easy way to
locate and extract elements using their tag names and attributes. Best suited for small
to moderately sized scraping projects.
2. Requests: A library designed for handling HTTP requests.
o Features: Simplifies sending GET and POST requests. Manages details like request
headers, cookies, and sessions. Is commonly used in conjunction with BeautifulSoup
for scraping

Page 1 of 9
3. Scrapy: A comprehensive framework for web crawling and scraping.
o Features: Offers a complete solution for building sophisticated crawlers. Supports
asynchronous requests for enhanced speed. Includes built-in features for exporting
scraped data to formats like JSON, CSV, and databases.
4. Selenium: A tool for automating browser actions.
o Features: Essential for scraping dynamic websites that rely on JavaScript. Simulates
user interactions like clicking, scrolling, and form submission. Perfect for content that
loads only after the initial page view.
5. html5lib: A strict HTML5-compliant parser.
o Features: Known for its accuracy in parsing even malformed or broken HTML. Can
be used as a reliable parser backend for BeautifulSoup. It is generally slower but more
robust for complex or messy pages.

R Libraries
1. rvest: An R package for easy web scraping.
o Features: Allows users to select HTML elements using CSS selectors or XPath.
Provides simple functions like html_nodes() and html_text() for parsing. Well-suited for
extracting data from static web pages.
2. httr: A library for making HTTP requests in R.
o Features: Supports various request methods, including GET and POST. Helps manage
authentication, headers, and cookies. Often paired with rvest for a complete scraping
workflow.
3. RSelenium: An R package for browser automation.
o Features: Provides control over a browser instance to scrape dynamic content. Can
automate actions like clicking links and filling forms. Useful for scraping sites that
load content through JavaScript.

Steps to Extract Data using Web Scraping

 Define the Target Data: Clearly identify the specific information you want to collect, such
as text, links, or images, and pinpoint its location on the webpage.
 Examine the Page Structure: Utilize your browser's developer tools (like "Inspect
Element") to examine the HTML source code and find the unique tags, classes, or IDs that
contain your target data.
 Build the Scraping Script: Develop a program using a web scraping library (e.g.,
BeautifulSoup, Scrapy) that sends a request to the website and then systematically extracts the
identified data.
 Save the Extracted Data: Store the scraped data in a useful format like a CSV file, a JSON
document, or a database, preparing it for the next steps.
 Clean and Prepare the Data: Process the collected data to handle any inconsistencies,
remove duplicates, or correct errors, ensuring it is ready for analysis.
 Analyze the Data: Apply statistical methods, machine learning models, or data visualization
techniques to the prepared data to derive insights and create reports.

Page 2 of 9
Key HTML Elements Used for Web Scraping
Understanding common HTML elements is crucial for effective web scraping. The following are
some of the most frequently targeted elements:

1. Division Tag (<div>): Serves as a generic container for grouping other elements. Often used
to logically section off content like an article, a product description, or a sidebar.
2. Paragraph Tag (<p>): Holds blocks of plain text, such as article content, product
descriptions, or user reviews.
3. Anchor Tag (<a>): Defines a hyperlink. The URL is contained in the href attribute, while the
link text is found between the opening and closing tags.
4. Image Tag (<img>): Used to embed images. The image's source URL is located in the src
attribute.
5. Span Tag (<span>): An inline container for a small portion of text or other elements.
Frequently used to apply specific styling to elements like prices, ratings, or labels.
6. Table Tags (<table>, <tr>, <td>, <th>): Used for presenting data in a structured, tabular format.
<table> defines the table, <tr> a row, <td> a data cell, and <th> a header cell.
7. List Tags (<ul>, <ol>, <li>): <ul> creates an unordered list, <ol> creates an ordered list, and <li>
defines an individual list item. These are helpful for extracting data from menus, feature lists,
or bullet points.
8. Heading Tags (<h1> to <h6>): Used to define headings of varying importance, with <h1>
being the most important and <h6> the least. Ideal for extracting titles and section headers
from a page.

Page 3 of 9
About the Data Extracted from the College Website
The data for this example was collected from the publicly available official website of Amritsar
Group of Colleges ([Link] The primary goal was to provide a practical demonstration of
basic web scraping by extracting text directly from a real-world educational website.

A Python script, utilizing the BeautifulSoup and Requests libraries, was used to access the website.
The script retrieved the HTML content and then extracted various types of plain text, including:

 Informational content from the page, such as general announcements and navigation labels.
 Text found within common HTML tags like <div>, <span>, <p>, and <a>.
 Headings and descriptive text from the main homepage.

This extracted data is intended to serve as a practical illustration of how to parse and process content
from a webpage. The scraping was strictly limited to publicly available information, ensuring no
private or sensitive data was accessed and adhering to ethical standards. The example highlights how
data extraction can be applied for educational purposes, content analysis, or simply for hands-on
practice with web scraping concepts.

Page 4 of 9
Q 1) Extract all paragraph texts from a webpage.

Description: Write a Python script to fetch a webpage and extract all text content from the
paragraph (<p>) tags.

Step 1: Import Libraries Description: Import the necessary libraries for making HTTP requests
and parsing HTML.

import requests
from bs4 import BeautifulSoup

Step 2: Fetch the Webpage Description: Use the requests library to get the HTML content from
the target URL.

url = "[Link]
try:
response = [Link](url)
response.raise_for_status() # Check for HTTP errors
html_content = [Link]
except [Link] as e:
print(f"Error fetching the URL: {e}")
exit()

Step 3: Parse with BeautifulSoup Description: Create a BeautifulSoup object to parse the HTML
content, making it easy to search.

soup = BeautifulSoup(html_content, '[Link]')

Step 4: Find and Extract Text from Paragraphs Description: Use find_all() to locate all <p>
tags and iterate through them to print the text.

paragraphs = soup.find_all('p')
print("Extracted Paragraphs:")
for p in paragraphs:
print("-" * 50)
print([Link]())

Output -

Page 5 of 9
Q 2) Retrieve all links from a webpage and print them.

Description: Write a Python script to retrieve all hyperlinks (the href attribute) from the anchor
(<a>) tags on a webpage and print them.

Step 1: Import Libraries, Fetch, and Parse Description: The first steps are the same as Q 2.1,
ensuring we have a parsed BeautifulSoup object to work with.

import requests
from bs4 import BeautifulSoup

url = "[Link]
try:
response = [Link](url)
response.raise_for_status()
soup = BeautifulSoup([Link], 'lxml')
except [Link] as e:
print(f"Error fetching the URL: {e}")
exit()

Step 2: Find and Extract Links Description: Use find_all() to locate all <a> tags. Iterate through
the results and extract the href attribute if it exists.

links = soup.find_all('a')
print("Extracted Links (href attributes):")
print("-" * 50)
for link in links:
href = [Link]('href')
link_text = [Link]()
# Only print links that have an href attribute
if href:
print(f"Text: '{link_text}' | Link: {href}")

Output

Page 6 of 9
Q3) Extract all headings text from a webpage.
import requests
from bs4 import BeautifulSoup

url = '[Link]
response = [Link](url)

if response.status_code == 200:
soup = BeautifulSoup([Link], '[Link]')
print("Headings:")

for i in range(1, 7):

headings = soup.find_all(f'h{i}')

for h in headings:
print(f"h{i}: {h.get_text(strip=True)}")

else:
print(f"Failed to retrieve the webpage. Status code: {response.status_code}")

OUTPUT:

Page 7 of 9
Q4) Extract the plain text (first 500 characters) from a webpage.
import requests
from bs4 import BeautifulSoup
import textwrap

url = '[Link]
response = [Link](url)

if response.status_code == 200:
soup = BeautifulSoup([Link], '[Link]')
print("Complete Page Text (First 500 characters, wrapped):\n")
page_text = soup.get_text(strip=True)[:500]
wrapped_text = [Link](page_text, width=80)
print(wrapped_text)
else:
print(f"Failed to retrieve the webpage. Status code: {response.status_code}")

Output:

Page 8 of 9
Q5) Extract all image from a webpage.
import requests
from bs4 import BeautifulSoup

url = '[Link]
response = [Link](url)

if response.status_code == 200:
soup = BeautifulSoup([Link], '[Link]')
images = soup.find_all('img')
print("Image URLs:")
for img in images:
src = [Link]('src')
if src:
print(src)
else:
print(f"Failed to retrieve the webpage. Status code: {response.status_code}")

Output:

---------------------------------
signature

Page 9 of 9

Web Scraping with Python and BeautifulSoup
No ratings yet
Web Scraping with Python and BeautifulSoup
10 pages
Web Scraping with Beautiful Soup Guide
No ratings yet
Web Scraping with Beautiful Soup Guide
13 pages
Web Scraping Basics and Python Guide
No ratings yet
Web Scraping Basics and Python Guide
45 pages
Machine Exercise
No ratings yet
Machine Exercise
3 pages
3 Web Scraping
No ratings yet
3 Web Scraping
5 pages
Web Scraping
No ratings yet
Web Scraping
4 pages
Overview of Web Scraping Techniques
No ratings yet
Overview of Web Scraping Techniques
5 pages
Python Web Scraping Guide 2023
No ratings yet
Python Web Scraping Guide 2023
11 pages
Web Scraping with BeautifulSoup in Python
No ratings yet
Web Scraping with BeautifulSoup in Python
12 pages
Web Scraping with Python Guide
No ratings yet
Web Scraping with Python Guide
6 pages
Python Web Scraping Essentials Guide
No ratings yet
Python Web Scraping Essentials Guide
14 pages
Web Scraping with Python Overview
No ratings yet
Web Scraping with Python Overview
18 pages
Web Scraping with Python: Tools & Techniques
No ratings yet
Web Scraping with Python: Tools & Techniques
38 pages
Lecture 12 More On Data Collection
No ratings yet
Lecture 12 More On Data Collection
94 pages
Web Crawling and Scraping with Python
No ratings yet
Web Crawling and Scraping with Python
34 pages
Web Scraping with BeautifulSoup Guide
No ratings yet
Web Scraping with BeautifulSoup Guide
13 pages
Practical Web Scraping For Economists 1744341390
No ratings yet
Practical Web Scraping For Economists 1744341390
33 pages
Web Scraping With Python Tutorials From A To Z
No ratings yet
Web Scraping With Python Tutorials From A To Z
35 pages
Web Scraping with Python: A Complete Guide
100% (2)
Web Scraping with Python: A Complete Guide
35 pages
Data Collection and Web Scraping Guide
No ratings yet
Data Collection and Web Scraping Guide
12 pages
Data Analysis via Web Scraping in Python
No ratings yet
Data Analysis via Web Scraping in Python
6 pages
Web Scraping with Python Basics
No ratings yet
Web Scraping with Python Basics
6 pages
Web Scraping with Python Guide
No ratings yet
Web Scraping with Python Guide
42 pages
Seminar Completed
No ratings yet
Seminar Completed
22 pages
Web Scraping Essentials for PHP Developers
No ratings yet
Web Scraping Essentials for PHP Developers
8 pages
Python Web Scraping Guide
No ratings yet
Python Web Scraping Guide
16 pages
Web Scraping Tutorial Using R
No ratings yet
Web Scraping Tutorial Using R
11 pages
? Web Scraping
No ratings yet
? Web Scraping
16 pages
Unit 4
No ratings yet
Unit 4
11 pages
Python Web Scraping Fundamentals
No ratings yet
Python Web Scraping Fundamentals
12 pages
Introduction to Web Parsing Basics
100% (1)
Introduction to Web Parsing Basics
3 pages
Python Module - IV Notes
No ratings yet
Python Module - IV Notes
15 pages
6 B4 69 Anjali
No ratings yet
6 B4 69 Anjali
6 pages
Web Scraping Basics with Python
No ratings yet
Web Scraping Basics with Python
4 pages
chp3A10.10072F978 3 319 32001 4 - 483 1
No ratings yet
chp3A10.10072F978 3 319 32001 4 - 483 1
4 pages
Web Scraping Basics with Python
No ratings yet
Web Scraping Basics with Python
3 pages
Web Scraping with R: A Comprehensive Guide
No ratings yet
Web Scraping with R: A Comprehensive Guide
3 pages
S12 Web Scraping
No ratings yet
S12 Web Scraping
13 pages
Web Data Collection via Scraping
No ratings yet
Web Data Collection via Scraping
10 pages
Web Scraping Cheat Sheet 2021
100% (3)
Web Scraping Cheat Sheet 2021
26 pages
Web Scraping: Process, Tools, and Uses
No ratings yet
Web Scraping: Process, Tools, and Uses
38 pages
Web Scraping: Techniques & Tools Guide
No ratings yet
Web Scraping: Techniques & Tools Guide
12 pages
HTML Basics for Web Scraping Guide
No ratings yet
HTML Basics for Web Scraping Guide
7 pages
AI Web App for Disease Prediction
No ratings yet
AI Web App for Disease Prediction
5 pages
Web Scraping for Data Science Insights
No ratings yet
Web Scraping for Data Science Insights
16 pages
Web Scraping Basics with Python
No ratings yet
Web Scraping Basics with Python
4 pages
Web Data Scraping with Python
No ratings yet
Web Data Scraping with Python
5 pages
Web Scraping: Techniques and Applications
No ratings yet
Web Scraping: Techniques and Applications
6 pages
Unit-2 Web Data Extraction and API Integration Using LCNC Platforms UPDATED
No ratings yet
Unit-2 Web Data Extraction and API Integration Using LCNC Platforms UPDATED
49 pages
Web Scraping for Data Science Insights
No ratings yet
Web Scraping for Data Science Insights
2 pages
Fastest Language for Web Scraping
No ratings yet
Fastest Language for Web Scraping
7 pages
03 Web Scraping
No ratings yet
03 Web Scraping
7 pages
NLP Web Scraping Techniques Explained
No ratings yet
NLP Web Scraping Techniques Explained
18 pages
Web Scraping with Python & Selenium
No ratings yet
Web Scraping with Python & Selenium
5 pages
Web Scraping Techniques and Tools
No ratings yet
Web Scraping Techniques and Tools
38 pages
What's Web Scraping - by Justin - Technically
No ratings yet
What's Web Scraping - by Justin - Technically
9 pages
Web Scraping
No ratings yet
Web Scraping
7 pages
Web Scraping
No ratings yet
Web Scraping
16 pages
Web Scraping Techniques and Tools
100% (1)
Web Scraping Techniques and Tools
31 pages
Data Sciense Final
No ratings yet
Data Sciense Final
21 pages
2nd Eg
No ratings yet
2nd Eg
12 pages
1st Eg
No ratings yet
1st Eg
12 pages
12th Class IT First Term Paper Guide
No ratings yet
12th Class IT First Term Paper Guide
8 pages
12th Class IT First Term Paper
No ratings yet
12th Class IT First Term Paper
8 pages
QuickBite Project Overview and Acknowledgements
No ratings yet
QuickBite Project Overview and Acknowledgements
56 pages
Blazor PDF
100% (1)
Blazor PDF
64 pages
Full Calendar with JSON in ASP.NET
No ratings yet
Full Calendar with JSON in ASP.NET
5 pages
Java Servlets Interview Questions Guide
No ratings yet
Java Servlets Interview Questions Guide
2 pages
Erased Movie 2016 Overview
No ratings yet
Erased Movie 2016 Overview
23 pages
HTML Full Course PDF Guide
No ratings yet
HTML Full Course PDF Guide
4 pages
Get Coding! Learn HTML, CSS, and JavaScript and Build A Website, App, and Game PDF
91% (34)
Get Coding! Learn HTML, CSS, and JavaScript and Build A Website, App, and Game PDF
209 pages
Understanding HTTP: Full Form & Function
No ratings yet
Understanding HTTP: Full Form & Function
7 pages
Anti-Phishing Tool: Spoof Guard System
No ratings yet
Anti-Phishing Tool: Spoof Guard System
4 pages
Liferay Role Management Overview
100% (1)
Liferay Role Management Overview
21 pages
Safe Sites for Downloading Cracks
No ratings yet
Safe Sites for Downloading Cracks
4 pages
Understanding MSXML2 DOMDocument60 Methods
100% (1)
Understanding MSXML2 DOMDocument60 Methods
57 pages
Local SEO Strategies for Businesses
No ratings yet
Local SEO Strategies for Businesses
7 pages
Account Login Credentials List
No ratings yet
Account Login Credentials List
3 pages
Cross-Site Scripting (XSS) Explained
No ratings yet
Cross-Site Scripting (XSS) Explained
8 pages
Document Reference Materials
No ratings yet
Document Reference Materials
12 pages
A-Z Digital Marketing Terms Glossary
No ratings yet
A-Z Digital Marketing Terms Glossary
10 pages
Web Technologies Lab Record
No ratings yet
Web Technologies Lab Record
42 pages
Advanced Java Servlet Overview
85% (13)
Advanced Java Servlet Overview
66 pages
GTU Web Technology Overview and XAMPP Guide
No ratings yet
GTU Web Technology Overview and XAMPP Guide
18 pages
HTML Guide for X Degree College
No ratings yet
HTML Guide for X Degree College
10 pages
GameCenter Startup Log Analysis
No ratings yet
GameCenter Startup Log Analysis
17 pages
JavaScript for Dynamic Calculations
No ratings yet
JavaScript for Dynamic Calculations
13 pages
HTML Basics and Tag Overview
No ratings yet
HTML Basics and Tag Overview
10 pages
Game Center Application Startup Log
No ratings yet
Game Center Application Startup Log
31 pages
HTML Basics: Structure and Elements
No ratings yet
HTML Basics: Structure and Elements
12 pages
Piyush Lalwani's Professional Profile
No ratings yet
Piyush Lalwani's Professional Profile
1 page
Search Engine Evaluation Insights
No ratings yet
Search Engine Evaluation Insights
3 pages
Instagram Interactions Summary
No ratings yet
Instagram Interactions Summary
255 pages
PHP Full Stack Training in Kolkata
No ratings yet
PHP Full Stack Training in Kolkata
15 pages
Team and Coffee Shop Overview
No ratings yet
Team and Coffee Shop Overview
13 pages