2 of 37

About Me

Elijah Appiah from Ghana.
Ph.D. Economics at NIDA in Bangkok, Thailand.
Economist by profession and Data Scientist by passion.
Enthusiastic about working with data daily.
Technical skills in LATEX, Microsoft Office (Word, Excel, PowerPoint), SPSS, Stata, EViews, Python, R, Power BI, Tableau, and Google TensorFlow.

Augustine Otobi Ogbaji from Nigeria.
Postgraduate student at University of Calabar and a Faculty at SICSS-Calabar.
Data Science and Machine Learning Engineer.
Passionate about Artificial Intelligence.

3 of 37

IDE and Packages for this Lecture

Main IDE for R is Rstudio
Packages:

rvest
xml
httr

4 of 37

Outline

Introduction to HTML and Web Scraping
Navigation and Selection with CSS
HTTP Request

5 of 37

What is Web Scraping?

Process of extracting data from websites.
The purpose is to gather data for various purposes such as research, analysis, and integration into applications.

6 of 37

HTML Basics – Elements, Tags, Attributes

HTML – Hypertext Markup Language (structure of a website)

Elements

HTML documents are structured using elements, which define the structure and content of a web page.

p (paragraph), h1 (first-level heading), img (image), b (bold formatting), etc…

7 of 37

HTML Basics – Elements, Tags, Attributes

HTML – Hypertext Markup Language (structure of a website)

8 of 37

HTML Basics – Elements, Tags, Attributes

HTML – Hypertext Markup Language (structure of a website)

Attributes

Attributes provide additional information about an HTML element, and are placed within the opening tag. They are typically in the form of name-value pairs.

<a href="https://www.example.com">Visit Example.com</a>

9 of 37

Anatomy of a Webpage

Document Declaration (document type and version)

<!DOCTYPE html>

Root Element (container for the entire HTML document)

<html> head and body tags enter here </html>

Head (contains metadata and information about the document itself)

Body (visible content that users see and interact with when they visit a web page)

10 of 37

Anatomy of HTML Document

<!DOCTYPE html>

<html>

<head>

<title>Sample Page</title>

</head>

<body>

<h1>Welcome to Web Scraping!</h1>

<p>This is a simple HTML example.</p>

<a href="https://www.example.com">Visit Example.com</a>

</body>

</html>

Document type declaration

Root

Element

Head

Body

11 of 37

Reading HTML with R

Read HTML into R

install.packages(“rvest”)

library(rvest)

html <- read_html(x)

Check structure of the HTML object

install.packages(“xml2”)

library(xml2)

xml_structure(html)

12 of 37

Now, let’s practice

13 of 37

Navigate HTML – Like a Tree

14 of 37

Navigating HTML – Navigating Nodes with Selectors

Navigating Nodes (Parents and Children)

html_node(html, “p”)

html %>% html_node(“p”)

html %>% html_nodes(“p”)

html %>% html_elements(“body”)

html %>% html_elements(“div p”)

html %>% html_elements(“div, p”)

html %>% html_element(“p”) %>% html_text()

15 of 37

Navigating HTML – Navigating Attributes

Navigating Attributes

html %>% html_element(“a”) %>% html_attr(“href”)

html_attr() vs. html_attrs()

16 of 37

Now, let’s practice

17 of 37

Scraping Tables

18 of 37

Scraping Tables

Read and view tables

table <- read_html(x)

table %>% html_table()

table %>% html_table(header = TRUE)

Scrape a table from Wikipedia

https://en.wikipedia.org/wiki/List_of_Nobel_laureates

19 of 37

Now, let’s practice

20 of 37

CSS

CSS – Cascading Style Sheets (format html documents)

Styling HTML Elements

CSS can be applied to HTML elements using selectors. Selectors target specific elements on a web page, such as headings, paragraphs, links, or classes and IDs that you define.
CSS rules consists of property-value pairs (e.g. color: blue, font-size: 12)

HTML Selectors

<h1>Hello</h1>

h1 {

property: value;

}

Class Selectors

<h1 class=“hello”>Hello</div>

.hello {

property: value;

}

ID Selectors

<div id=“hello”>Hello</div>

#hello {

property: value;

}

21 of 37

CSS – Type Selectors

type {

property: value;

}

type1, type2 {

property: value;

}

html %>% html_elements(“type”)

html %>% html_elements(“type1, type2”)

* {

property: value;

}

html %>% html_elements(“*”)

22 of 37

Now, let’s practice

23 of 37

CSS – Classes and IDs

.class {

property: value;

}

.class1 {

property: value;

}

.class2 {

property: value;

}

html %>% html_elements(“.class”)

html %>% html_elements(“h1.class”)

html %>% html_elements(“.class1.class2”)

#id {

property: value;

}

html %>% html_elements(“#id”)

24 of 37

Now, let’s practice

25 of 37

CSS – Pseudo Classes

<ol>

<li>First Item</li>

<li>Second Item</li>

<li>Third Item</li>

</ol>

li:first-child {color: blue;}

li:nth-child(2) {color: red;}

li:last-child {color: green;}

html %>% html_elements(“li:first-child”)

1. First Item

2. Second Item

3. Third Item

26 of 37

Now, let’s practice

27 of 37

CSS - Combinators

The CSS Combinators are:

Combinator	Meaning
space	Descendant
>	Child
+	Adjacent sibling
~	General sibling

28 of 37

Now, let’s practice

29 of 37

HTTP Requests

HTTP – HyperText Transfer Protocol
Set of rules that dictate how web browsers, or clients, communicate with a web server.

Source: https://developer.mozilla.org/en-US/docs/Web/HTTP/Overview

30 of 37

Anatomy of HTTP Requests

Request:

A request is sent to the web server.

Source: https://developer.mozilla.org/en-US/docs/Web/HTTP/Overview

Response:

A response is received from the web server.

Common status codes:

200 (OK), 404 (NOT FOUND), 300s (redirect), 500s (server errors).

31 of 37

Request Methods – GET and POST

GET

Fetch a resource without submitting data (GET /index.html).

POST

Send data to a server (after filling out a form on a page).

Source: https://developer.mozilla.org/en-US/docs/Web/HTTP/Overview

32 of 37

HTTP Request with “httr” Package

install.packages(“httr”)

library(httr)

response <- GET(url = “https://wikipedia.com”)

content(response)

status_code(response)

33 of 37

Now, let’s practice

34 of 37

Why HTTP Request?

Web server already registers your IP address.
Purposes of identification (HTTP Header does that).

Modify Headers

response <- GET(“https://Wikipedia.com”, user_agent(“Hello, it’s me, Eli!”))

35 of 37

Making Slow Requests – purr Package

install.packages(“purrr”)

library(purr)

url_list <- c(“http://example.com/page1”,

“http://example.com/page1”,

“http://example.com/page1”)

slow_request <- slowly(read_html, rate = rate_delay(3))

for(url in url_list){

html <- slow_request(url)

}

1 of 37

2 of 37

3 of 37

4 of 37

5 of 37

6 of 37

7 of 37

8 of 37

9 of 37

10 of 37

11 of 37

12 of 37

13 of 37

14 of 37

15 of 37

16 of 37

17 of 37

18 of 37

19 of 37

20 of 37

21 of 37

22 of 37

23 of 37

24 of 37

25 of 37

26 of 37

27 of 37

28 of 37

29 of 37

30 of 37

31 of 37

32 of 37

33 of 37

34 of 37

35 of 37

36 of 37

37 of 37