Machine Exercise: 2
Aim:
This exercise aims to demonstrate fundamental web scraping techniques. The goal is to use Python
with the BeautifulSoup library to programmatically retrieve text and link data from a specific
webpage.
What is Web Scraping?
Web scraping is a process for automatically collecting information from websites. It works by
programmatically fetching a webpage's HTML source code and then parsing it to find and extract
specific data points. This extracted data can include text, images, hyperlinks, or tabular data.
The process begins with a script sending an HTTP request to a website's server. The server responds
by sending back the webpage's content, typically in HTML format. The scraping script then scans
this content, looking for predefined patterns or elements to pull out the desired information. The
collected data is then often stored in a structured file format like CSV, JSON, or a database for later
use.
Specialized tools and libraries, particularly in programming languages like Python, are used to
perform these tasks. Libraries such as BeautifulSoup, Requests, and Selenium assist in navigating and
parsing the complex structure of a webpage. They can precisely target and extract content from
specific HTML tags like <p>, <a>, and <div>.
This technique is valuable because it enables data collection from websites that lack dedicated APIs
or downloadable data sets. It supports the gathering of large datasets efficiently, facilitates real-time
data tracking (such as market prices or news updates), and is crucial for data analysis, machine
learning, and informed decision-making. Its applications are widespread across data science, business
intelligence, market research, and academia.
Libraries for Web Scraping
A variety of powerful libraries are available for web scraping, particularly in Python and R, which
automate tasks like making web requests, parsing HTML, navigating page structures, and extracting
data for storage and analysis.
Python Libraries
1. BeautifulSoup: A powerful tool for parsing HTML and XML documents.
o Features: Can use different parsers like lxml and html5lib. Provides an easy way to
locate and extract elements using their tag names and attributes. Best suited for small
to moderately sized scraping projects.
2. Requests: A library designed for handling HTTP requests.
o Features: Simplifies sending GET and POST requests. Manages details like request
headers, cookies, and sessions. Is commonly used in conjunction with BeautifulSoup
for scraping
Page 1 of 9
3. Scrapy: A comprehensive framework for web crawling and scraping.
o Features: Offers a complete solution for building sophisticated crawlers. Supports
asynchronous requests for enhanced speed. Includes built-in features for exporting
scraped data to formats like JSON, CSV, and databases.
4. Selenium: A tool for automating browser actions.
o Features: Essential for scraping dynamic websites that rely on JavaScript. Simulates
user interactions like clicking, scrolling, and form submission. Perfect for content that
loads only after the initial page view.
5. html5lib: A strict HTML5-compliant parser.
o Features: Known for its accuracy in parsing even malformed or broken HTML. Can
be used as a reliable parser backend for BeautifulSoup. It is generally slower but more
robust for complex or messy pages.
R Libraries
1. rvest: An R package for easy web scraping.
o Features: Allows users to select HTML elements using CSS selectors or XPath.
Provides simple functions like html_nodes() and html_text() for parsing. Well-suited for
extracting data from static web pages.
2. httr: A library for making HTTP requests in R.
o Features: Supports various request methods, including GET and POST. Helps manage
authentication, headers, and cookies. Often paired with rvest for a complete scraping
workflow.
3. RSelenium: An R package for browser automation.
o Features: Provides control over a browser instance to scrape dynamic content. Can
automate actions like clicking links and filling forms. Useful for scraping sites that
load content through JavaScript.
Steps to Extract Data using Web Scraping
Define the Target Data: Clearly identify the specific information you want to collect, such
as text, links, or images, and pinpoint its location on the webpage.
Examine the Page Structure: Utilize your browser's developer tools (like "Inspect
Element") to examine the HTML source code and find the unique tags, classes, or IDs that
contain your target data.
Build the Scraping Script: Develop a program using a web scraping library (e.g.,
BeautifulSoup, Scrapy) that sends a request to the website and then systematically extracts the
identified data.
Save the Extracted Data: Store the scraped data in a useful format like a CSV file, a JSON
document, or a database, preparing it for the next steps.
Clean and Prepare the Data: Process the collected data to handle any inconsistencies,
remove duplicates, or correct errors, ensuring it is ready for analysis.
Analyze the Data: Apply statistical methods, machine learning models, or data visualization
techniques to the prepared data to derive insights and create reports.
Page 2 of 9
Key HTML Elements Used for Web Scraping
Understanding common HTML elements is crucial for effective web scraping. The following are
some of the most frequently targeted elements:
1. Division Tag (<div>): Serves as a generic container for grouping other elements. Often used
to logically section off content like an article, a product description, or a sidebar.
2. Paragraph Tag (<p>): Holds blocks of plain text, such as article content, product
descriptions, or user reviews.
3. Anchor Tag (<a>): Defines a hyperlink. The URL is contained in the href attribute, while the
link text is found between the opening and closing tags.
4. Image Tag (<img>): Used to embed images. The image's source URL is located in the src
attribute.
5. Span Tag (<span>): An inline container for a small portion of text or other elements.
Frequently used to apply specific styling to elements like prices, ratings, or labels.
6. Table Tags (<table>, <tr>, <td>, <th>): Used for presenting data in a structured, tabular format.
<table> defines the table, <tr> a row, <td> a data cell, and <th> a header cell.
7. List Tags (<ul>, <ol>, <li>): <ul> creates an unordered list, <ol> creates an ordered list, and <li>
defines an individual list item. These are helpful for extracting data from menus, feature lists,
or bullet points.
8. Heading Tags (<h1> to <h6>): Used to define headings of varying importance, with <h1>
being the most important and <h6> the least. Ideal for extracting titles and section headers
from a page.
Page 3 of 9
About the Data Extracted from the College Website
The data for this example was collected from the publicly available official website of Amritsar
Group of Colleges ([Link] The primary goal was to provide a practical demonstration of
basic web scraping by extracting text directly from a real-world educational website.
A Python script, utilizing the BeautifulSoup and Requests libraries, was used to access the website.
The script retrieved the HTML content and then extracted various types of plain text, including:
Informational content from the page, such as general announcements and navigation labels.
Text found within common HTML tags like <div>, <span>, <p>, and <a>.
Headings and descriptive text from the main homepage.
This extracted data is intended to serve as a practical illustration of how to parse and process content
from a webpage. The scraping was strictly limited to publicly available information, ensuring no
private or sensitive data was accessed and adhering to ethical standards. The example highlights how
data extraction can be applied for educational purposes, content analysis, or simply for hands-on
practice with web scraping concepts.
Page 4 of 9
Q 1) Extract all paragraph texts from a webpage.
Description: Write a Python script to fetch a webpage and extract all text content from the
paragraph (<p>) tags.
Step 1: Import Libraries Description: Import the necessary libraries for making HTTP requests
and parsing HTML.
import requests
from bs4 import BeautifulSoup
Step 2: Fetch the Webpage Description: Use the requests library to get the HTML content from
the target URL.
url = "[Link]
try:
response = [Link](url)
response.raise_for_status() # Check for HTTP errors
html_content = [Link]
except [Link] as e:
print(f"Error fetching the URL: {e}")
exit()
Step 3: Parse with BeautifulSoup Description: Create a BeautifulSoup object to parse the HTML
content, making it easy to search.
soup = BeautifulSoup(html_content, '[Link]')
Step 4: Find and Extract Text from Paragraphs Description: Use find_all() to locate all <p>
tags and iterate through them to print the text.
paragraphs = soup.find_all('p')
print("Extracted Paragraphs:")
for p in paragraphs:
print("-" * 50)
print([Link]())
Output -
Page 5 of 9
Q 2) Retrieve all links from a webpage and print them.
Description: Write a Python script to retrieve all hyperlinks (the href attribute) from the anchor
(<a>) tags on a webpage and print them.
Step 1: Import Libraries, Fetch, and Parse Description: The first steps are the same as Q 2.1,
ensuring we have a parsed BeautifulSoup object to work with.
import requests
from bs4 import BeautifulSoup
url = "[Link]
try:
response = [Link](url)
response.raise_for_status()
soup = BeautifulSoup([Link], 'lxml')
except [Link] as e:
print(f"Error fetching the URL: {e}")
exit()
Step 2: Find and Extract Links Description: Use find_all() to locate all <a> tags. Iterate through
the results and extract the href attribute if it exists.
links = soup.find_all('a')
print("Extracted Links (href attributes):")
print("-" * 50)
for link in links:
href = [Link]('href')
link_text = [Link]()
# Only print links that have an href attribute
if href:
print(f"Text: '{link_text}' | Link: {href}")
Output
Page 6 of 9
Q3) Extract all headings text from a webpage.
import requests
from bs4 import BeautifulSoup
url = '[Link]
response = [Link](url)
if response.status_code == 200:
soup = BeautifulSoup([Link], '[Link]')
print("Headings:")
for i in range(1, 7):
headings = soup.find_all(f'h{i}')
for h in headings:
print(f"h{i}: {h.get_text(strip=True)}")
else:
print(f"Failed to retrieve the webpage. Status code: {response.status_code}")
OUTPUT:
Page 7 of 9
Q4) Extract the plain text (first 500 characters) from a webpage.
import requests
from bs4 import BeautifulSoup
import textwrap
url = '[Link]
response = [Link](url)
if response.status_code == 200:
soup = BeautifulSoup([Link], '[Link]')
print("Complete Page Text (First 500 characters, wrapped):\n")
page_text = soup.get_text(strip=True)[:500]
wrapped_text = [Link](page_text, width=80)
print(wrapped_text)
else:
print(f"Failed to retrieve the webpage. Status code: {response.status_code}")
Output:
Page 8 of 9
Q5) Extract all image from a webpage.
import requests
from bs4 import BeautifulSoup
url = '[Link]
response = [Link](url)
if response.status_code == 200:
soup = BeautifulSoup([Link], '[Link]')
images = soup.find_all('img')
print("Image URLs:")
for img in images:
src = [Link]('src')
if src:
print(src)
else:
print(f"Failed to retrieve the webpage. Status code: {response.status_code}")
Output:
---------------------------------
signature
Page 9 of 9