0% found this document useful (0 votes)
15 views22 pages

Web Scraping Basics and JSoup Guide

What is Web Scraping? - Why is it important? - How to use it safely and legally? - HTML, CSS - Introduction to JSoup Library - JSoup workflow - VNExpress Scrapping Demo - Manga Scrapping Demo - Potential issues with Web Scraping

Uploaded by

VinhElysia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views22 pages

Web Scraping Basics and JSoup Guide

What is Web Scraping? - Why is it important? - How to use it safely and legally? - HTML, CSS - Introduction to JSoup Library - JSoup workflow - VNExpress Scrapping Demo - Manga Scrapping Demo - Potential issues with Web Scraping

Uploaded by

VinhElysia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Vietnam National University of HCMC

International University
School of Computer Science and Engineering

Web Scraping

(IT069IU)

Le Duy Tan, Ph.D.


📧 ldtan@[Link]

🌐 [Link] 1

Previously,

- Design Patterns:
- Creational
- Singleton
- Factory Method
- Structural
- Behavioral

2
Agenda
- What is Web Scraping?
- Why is it important?
- How to use it safely and legally?
- HTML, CSS
- Introduction to JSoup Library
- JSoup workflow
- VNExpress Scrapping Demo
- Manga Scrapping Demo
- Potential issues with Web Scraping
3

What is Web Scraping?

Web Pages Web Scraping Structured Data

4
Why is Web Scraping Used?
- Search engines
- Social Media Scraping
- Price intelligence
- Lead generation
- Research and Development
- Job listings
- Machine Learning
- Personal Hobby
- Download music, manga, movies…

❓ Question: why don't we use


official API from the websites
instead of web scraping? 5

Website Components

6
Web Component Overview

Analogy of HTML, CSS & Javascript

8
What is HTML (Hyper Text Markup Language)?

HTML Element

10
Common HTML Tags

11

Basic HTML Structure

12
Sample Basic HTML

13

HTML in VNExpress Website

14
What is CSS?

15

HTML + CSS = ❤

16
Another Example of CSS

Define a CSS class “nickname” for your Style rules for CSS class
HTML content: name or HTML tag name:

17

HTML Structure & CSS Selector


🤔 Can you guess which CSS
selector targets which HTML
element?
- div

- div[my_attribute]
Example:
- div[my_attribute=”jerry”]

- div[new_attribute=”charles”]
-p
- div[my_attribute=”jerry”] div

- .my_class_name
- .my_class_name h1 18
JSoup - Java HTML Parser
- Very convenient Java library for fetching URLs and extracting and
manipulating data.
- Find and extract data using CSS selectors.

19
[Link]

Setup JSoup for your project


- Download JSoup jar library on [Link]
- To use JSoup, add its jar library to your project via IDE:

20
JSoup Workflow
1. Get the whole HTML document response from the server via a URL:
a. Document doc = [Link](url).get()

1. Select specific HTML element(s) we want to scrap from the document


a. Elements items = [Link](“CSS Selector Query”)
i. You need to loop through Elements to process each Element
b. Element item = [Link](“CSS Selector Query”)
i. No need to loop since you only have one Element

1. For each Element, extract content inside of the selected HTML element
a. [Link]()
b. [Link](“attribute-name”)
21
[Link]

Super Basic Web Scraping 101

Output:

22
How to use it safely and legally
- A website should have a file “[Link]” to show which links are
- allowed to scrap
- not allowed to scrap
- Here is an example “[Link]”

23

VNExpress Scraping Exercise


[Your Fun Game]

- Open [Link] on
Chrome Browser
- Open Inspector (right click ->
Inspector or F12)
- Your task: Find out there is any
pattern (tag name, CSS class
name) of articles in this page so
that we can use these information
to scrap all the title, description
and article link of all articles of
the homepage!
24
Let’s scrap VNExpress!

25

VNExpress Homepage Scraping Result

26
Put The Result Into a Nice Table

27

Your Homework
- Scrap all the links in the footer of English website of Vnexpress ([Link]

- Return a dictionary which the key is the name of the menu item and the value is the link/url of that menu item. Thus,
there should be 22 items in the dictionary looking like this:

28
Italian Recipe Scrapping Homework
[Link]

Scrap all Italian recipes,


which should have:

- Recipe title
- Short description
- Preparation time
- Cook time
- Number of serves

29

A Fun Project with Manga

30
Our Challenge
- Let’s scrap one of the best manga websites, “MangaDoom”!
[Link]

31

Let’s pick a best manga to scrap them!


- “One Piece” is my favourite manga of all time! Let’s scrap the whole manga!

32
Let’s do it together!
Scrap our favorite manga!

33

I'm a collector myself! Web Scraping is awesome!

34
Potential Problems & Solutions

35

Potential Issues for Scraping


- Completely Automated Public Turing Tests (CAPTCHAs)
- IP blocking
- Geo-blocking
- Dynamic Websites

36
Static website vs Dynamic website
- So far, we have only learned enough to scrap static websites but no dynamic
websites.

37

Selenium to scrape dynamic website

38
Fun Quiz
- Open popular ecommerce websites in Vietnam:
- [Link]
- [Link]
- [Link]
- [Link]
- Open your favourite website ever:
- [Link]
- Open other websites:
- [Link]
- [Link]
- [Link]
c_Qu%E1%BB%91c_t%E1%BA%BF,_%C4%90%E1%BA%A1i_h%E1%BB%8Dc_Qu%E1%BB%91c_
gia_Th%C3%A0nh_ph%E1%BB%91_H%E1%BB%93_Ch%C3%AD_Minh

[Question] Which one is static and which one is dynamic? Which one can be
scrapped with JSoup?
39

Ultimate Challenge
- MangaSee is the only website which got the FULL-COLORED chapters of big
mangas like One Piece and Bleach!!!
- [Task] Can you figure out how to scrap either full-colored One Piece or Bleach?
- [Link]
- [Link]

40
One of The Best Manga Downloader Project
- [Link]
- The Free Manga Downloader is a free open source application written in Object
Pascal for managing and downloading manga from various websites.
- Based on what you have learned so far about web scraping, now you can actually
start to contribute to this open source project to help the Manga community!

41

Recap
- What is Web Scraping?
- Why is it important?
- How to use it safely and legally?
- HTML, CSS
- Introduction to JSoup Library
- JSoup workflow
- VNExpress Scrapping Demo
- Manga Scrapping Demo
- Potential issues with Web Scraping
42
Thank you for your listening!

“Live as if you were to die tomorrow.


Learn as if you were to live forever!”
Mahatma Gandhi

43

You might also like