0% found this document useful (0 votes)
11 views17 pages

Final Web Scraping Complete Detailed

The document provides a comprehensive guide on web scraping using the Requests library and BeautifulSoup in Python. It includes detailed explanations of various methods for sending HTTP requests, parsing HTML, and extracting data, along with practical code examples and a mini project. Additionally, it covers legal considerations, advanced topics, and common interview questions related to web scraping.

Uploaded by

tapanideaprime
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views17 pages

Final Web Scraping Complete Detailed

The document provides a comprehensive guide on web scraping using the Requests library and BeautifulSoup in Python. It includes detailed explanations of various methods for sending HTTP requests, parsing HTML, and extracting data, along with practical code examples and a mini project. Additionally, it covers legal considerations, advanced topics, and common interview questions related to web scraping.

Uploaded by

tapanideaprime
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

ALL METHODS OF REQUESTS & BEAUTIFULSOUP

Theory + Code Examples (Complete Reference Notes)


PART A: REQUESTS LIBRARY – ALL IMPORTANT METHODS
1. [Link]()
Used to retrieve data from a server. Most commonly used method in web scraping.
import requests

response = [Link]("[Link]
print(response.status_code)
print([Link])

2. [Link]()
Used to send data to the server (forms, login, APIs).
data = {"username": "user", "password": "pass"}
response = [Link]("[Link] data=data)
print([Link])

3. [Link]()
Used to update existing data on a server (mostly APIs).
data = {"name": "Updated Name"}
[Link]("[Link] json=data)

4. [Link]()
Used to delete data on the server.
[Link]("[Link]

5. [Link]()
Fetches only headers, no response body. Useful for checking availability.
response = [Link]("[Link]
print([Link])

6. [Link]()
Checks allowed HTTP methods for a resource.
response = [Link]("[Link]
print([Link]["Allow"])

7. Headers & Params


headers = {"User-Agent": "Mozilla/5.0"}
params = {"page": 1}

[Link]("[Link] headers=headers, params=params)

8. Sessions
Session object stores cookies and improves performance.
session = [Link]()
[Link]("[Link]
[Link]("[Link]
PART B: BEAUTIFULSOUP – ALL IMPORTANT METHODS
1. BeautifulSoup()
Creates a parse tree from HTML.
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "lxml")

2. find()
Returns the first matching tag.
[Link]("h1")

3. find_all()
Returns all matching tags as a list.
soup.find_all("a")

4. select()
Uses CSS selectors.
[Link]("[Link] > p")

5. get_text()
Extracts only text content.
[Link]("p").get_text()

6. attrs & get()


link = [Link]("a")
print([Link])
print([Link]("href"))

7. parent / children / descendants


tag = [Link]("p")
print([Link])

for child in [Link]:


print(child)

8. next_sibling / previous_sibling
tag = [Link]("h1")
print(tag.next_sibling)

9. find_next() / find_previous()
[Link]("h1").find_next("p")

10. prettify()
Formats HTML for readability.
print([Link]())

11. Decompose & Extract


tag = [Link]("script")
[Link]()

12. Summary Table (Conceptual)


Requests handles HTTP communication.
BeautifulSoup handles HTML parsing and navigation.
WEB SCRAPING USING REQUESTS &
BEAUTIFULSOUP
Complete Theory + Code + MCQs + Interview Q&A; + Mini Project
1. Introduction to Web Scraping
Web scraping is an automated technique to extract data from websites using software. It simulates how a
browser requests a web page and then processes the returned HTML to collect useful information. Web
scraping is widely used in data science, research, price comparison, news aggregation, and machine learning
dataset creation.

Applications of Web Scraping:

• Price monitoring (Amazon, Flipkart)


• Job portals data collection
• News and article aggregation
• Data collection for ML models

2. Requests Library – Detailed Explanation


The Requests library is used to send HTTP requests in Python. It supports GET, POST, PUT, DELETE
methods and handles sessions, cookies, and headers.
import requests

url = "[Link]
response = [Link](url)

print(response.status_code)
print([Link])
Status Codes:
200 – Success
404 – Page not found
403 – Forbidden
500 – Server error

3. HTTP Headers (IMPORTANT)


Headers make requests look like they come from a real browser. Without headers, many websites block
scraping.
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"
}

response = [Link](url, headers=headers)

4. BeautifulSoup Library
BeautifulSoup parses HTML/XML documents and creates a navigable tree structure. It allows searching data
using tags, attributes, and CSS selectors.
from bs4 import BeautifulSoup

soup = BeautifulSoup([Link], "lxml")


print([Link])
Common Methods:
find(), find_all(), select(), get_text()

5. Real Website Scraping Example


import requests
from bs4 import BeautifulSoup

url = "[Link]
response = [Link](url)
soup = BeautifulSoup([Link], "lxml")

quotes = soup.find_all("span", class_="text")


authors = soup.find_all("small", class_="author")

for q, a in zip(quotes, authors):


print([Link], "-", [Link])

6. Pagination Handling
page = 1
while True:
url = f"[Link]
r = [Link](url)
if r.status_code != 200:
break

soup = BeautifulSoup([Link], "lxml")


quotes = soup.find_all("span", class_="text")
if not quotes:
break

for q in quotes:
print([Link])

page += 1
7. MINI PROJECT: Job Listings Scraper
Objective: Scrape job title, company name, and location from a job listing website and store data in CSV
format.
import requests, csv
from bs4 import BeautifulSoup

url = "[Link]
r = [Link](url)
soup = BeautifulSoup([Link], "lxml")

jobs = soup.find_all("div", class_="job")

with open("[Link]", "w", newline="", encoding="utf-8") as f:


writer = [Link](f)
[Link](["Title", "Company", "Location"])

for job in jobs:


title = [Link]("h2").text
company = [Link]("span", class_="company").text
location = [Link]("span", class_="location").text
[Link]([title, company, location])

8. Advanced Topics
• Sessions & cookies
• Login-based scraping
• Delays using [Link]()
• Avoiding IP blocking
• [Link] rules
9. MCQs
1. Which library is used to parse HTML?
A) NumPy B) Requests C) BeautifulSoup D) Pandas
Answer: C

2. Which HTTP status code means Forbidden?


A) 200 B) 404 C) 403 D) 500
Answer: C

10. Interview Questions & Answers


Q1. What is web scraping?
A. Automated extraction of data from websites.

Q2. Difference between Requests and BeautifulSoup?


A. Requests fetches data, BeautifulSoup parses HTML.

Q3. What is User-Agent?


A. It identifies the browser to the server.

11. Legal & Ethical Considerations


Always follow [Link], avoid excessive requests, and scrape only public data. Never scrape private or
copyrighted content without permission.
Mini Project: Flipkart Web Scraping (Industry-Oriented)

Project Objective:
To collect publicly available product information from Flipkart using Python in a clean,
industry-standard approach suitable for data analysis tasks.

This project demonstrates how companies collect market price data for analysis and comparison.

Tools & Technologies

• Python
• Jupyter Notebook
• requests – HTTP communication
• BeautifulSoup (bs4) – HTML parsing
• pandas – data storage & analysis

Business Use Case

E-commerce companies and analysts scrape product data to:


• Track competitor pricing
• Analyze product popularity
• Build pricing dashboards
• Support business decisions

Step 1: Import Required Libraries

import requests
from bs4 import BeautifulSoup
import pandas as pd

Step 2: Send Request with Browser Headers

url = "[Link]

headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"
}

response = [Link](url, headers=headers)


print(response.status_code)

Step 3: Parse HTML Response

soup = BeautifulSoup([Link], "lxml")

Step 4: Extract Required Fields

products = soup.find_all("div", class_="_1AtVbE")

records = []

for item in products:


name = [Link]("div", class_="_4rR01T")
price = [Link]("div", class_="_30jeq3")
rating = [Link]("div", class_="_3LWZlK")

if name and price:


[Link]({
"Product Name": [Link],
"Price": [Link],
"Rating": [Link] if rating else "N/A"
})

Step 5: Create Structured Dataset

df = [Link](records)
print([Link]())

Sample Output

Product Name Price Rating


--------------------------------------
Samsung Galaxy F14 ■9,999 4.2
Redmi 12C ■7,499 4.1

Step 6: Export Data


df.to_csv("flipkart_products.csv", index=False)

Conclusion

This project follows a clean and professional workflow used in industry:


• Ethical data collection
• Structured data storage
• Reusable and scalable code

The same approach is used in real-world data engineering and analytics roles.
HTML Tags (Basics for Web Scraping)

HTML tags define the structure of a webpage.

Common tags:
<div> container
<a> link
<span> inline container
<img> image

Example:
<h1>Product</h1>
<p>Price ■999</p>

Tag in BeautifulSoup

Tag represents an HTML element.

Example:
from bs4 import BeautifulSoup
soup = BeautifulSoup("<h1>Flipkart</h1>", "[Link]")
type(soup.h1)

NavigableString

NavigableString represents text inside a tag.

Example:
<p>Price ■999</p>

type([Link])

BeautifulSoup – All Important Functions

find(), find_all(), select(), get_text()


attrs, get()
parent, children, next_sibling
find_next(), find_previous()
decompose(), extract()
HTML Comments in BeautifulSoup

from bs4 import Comment


[Link](string=lambda text: isinstance(text, Comment))
HTML TAGS – DETAILED WITH EXAMPLES

HTML tags define elements of a webpage.

Example HTML:
<html>
<body>
<div class="product">
<h1>Mobile Phone</h1>
<p class="price">■9999</p>
<a href="/mobile">View</a>
</div>
</body>
</html>

In BeautifulSoup:
[Link] → returns first <div> tag
[Link] → Mobile Phone

TAG OBJECT (BeautifulSoup)

A Tag represents an HTML element.

Code:
type([Link])

Output:
<class '[Link]'>

Access attributes:
[Link]['class']

Output:
['product']

NAVIGABLESTRING – DETAILED

NavigableString represents text inside a tag.

Code:
type([Link])

Output:
<class '[Link]'>

Text value:
[Link]

Output:
Mobile Phone

find() FUNCTION

find() returns ONLY the first matching tag.

Code:
[Link]("p")

Output:
<p class="price">■9999</p>

If not found:
[Link]("table")

Output:
None

find_all() FUNCTION

find_all() returns ALL matching tags as a list.

Code:
soup.find_all("a")

Output:
[<a href="/mobile">View</a>]

select() FUNCTION (CSS SELECTOR)

select() uses CSS selectors.

Code:
[Link]("[Link] [Link]")

Output:
[<p class="price">■9999</p>]
get_text() FUNCTION

Extracts only text content.

Code:
[Link]("div").get_text()

Output:
Mobile Phone ■9999 View

attrs & get()

Code:
tag = [Link]("a")
[Link]

Output:
{'href': '/mobile'}

[Link]("href")

Output:
/mobile

PARENT, CHILDREN, SIBLINGS

[Link]

Output:
<div class="product">...</div>

for child in [Link]:


print(child)

Output:
<h1>Mobile Phone</h1>
<p class="price">■9999</p>
<a href="/mobile">View</a>

find_next() & find_previous()

soup.h1.find_next("p")
Output:
<p class="price">■9999</p>

soup.p.find_previous("h1")

Output:
<h1>Mobile Phone</h1>

decompose() & extract()

decompose() removes tag permanently

Code:
[Link]()

extract() removes and returns tag

tag = [Link]()

HTML COMMENTS

HTML Comment Example:


<!-- Product End -->

Code:
from bs4 import Comment
comment = [Link](string=lambda t: isinstance(t, Comment))

Output:
Product End

INDUSTRY SUMMARY

• Tags represent elements


• NavigableString stores text
• find() → first match
• find_all() → list
• select() → CSS selector
• get_text() → clean text
• These are core industry scraping concepts

You might also like