1 of 37

Web Scraping with R

2 of 37

About Me

  • Elijah Appiah from Ghana.
  • Ph.D. Economics at NIDA in Bangkok, Thailand.
  • Economist by profession and Data Scientist by passion.
  • Enthusiastic about working with data daily.
  • Technical skills in LATEX, Microsoft Office (Word, Excel, PowerPoint), SPSS, Stata, EViews, Python, R, Power BI, Tableau, and Google TensorFlow.
  • Augustine Otobi Ogbaji from Nigeria.
  • Postgraduate student at University of Calabar and a Faculty at SICSS-Calabar.
  • Data Science and Machine Learning Engineer.
  • Passionate about Artificial Intelligence.

3 of 37

IDE and Packages for this Lecture

  • Main IDE for R is Rstudio
  • Packages:
    • rvest
    • xml
    • httr

4 of 37

Outline

  • Introduction to HTML and Web Scraping
  • Navigation and Selection with CSS
  • HTTP Request

5 of 37

What is Web Scraping?

  • Process of extracting data from websites.
  • The purpose is to gather data for various purposes such as research, analysis, and integration into applications.

6 of 37

HTML Basics – Elements, Tags, Attributes

  • HTML – Hypertext Markup Language (structure of a website)

Elements

  • HTML documents are structured using elements, which define the structure and content of a web page.

p (paragraph), h1 (first-level heading), img (image), b (bold formatting), etc…

7 of 37

HTML Basics – Elements, Tags, Attributes

  • HTML – Hypertext Markup Language (structure of a website)

Tags

  • Tags are the elements enclosed in angle brackets, that denote the beginning and end of an HTML element. HTML elements have both opening and closing tags.

<p>This is a simple paragraph.</p>

<h1>Welcome to our Website</h1>

8 of 37

HTML Basics – Elements, Tags, Attributes

  • HTML – Hypertext Markup Language (structure of a website)

Attributes

  • Attributes provide additional information about an HTML element, and are placed within the opening tag. They are typically in the form of name-value pairs.

<a href="https://www.example.com">Visit Example.com</a>

<img src="image.jpg" alt="A beautiful landscape">

9 of 37

Anatomy of a Webpage

  • Document Declaration (document type and version)

<!DOCTYPE html>

  • Root Element (container for the entire HTML document)

<html> head and body tags enter here </html>

  • Head (contains metadata and information about the document itself)

<head> … </head>

  • Body (visible content that users see and interact with when they visit a web page)

<body> … </body>

10 of 37

Anatomy of HTML Document

<!DOCTYPE html>

<html>

<head>

<title>Sample Page</title>

</head>

<body>

<h1>Welcome to Web Scraping!</h1>

<p>This is a simple HTML example.</p>

<a href="https://www.example.com">Visit Example.com</a>

</body>

</html>

Document type declaration

Root

Element

Head

Body

11 of 37

Reading HTML with R

  • Read HTML into R

install.packages(“rvest”)

library(rvest)

html <- read_html(x)

  • Check structure of the HTML object

install.packages(“xml2”)

library(xml2)

xml_structure(html)

12 of 37

Now, let’s practice

13 of 37

Navigate HTML – Like a Tree

14 of 37

Navigating HTML – Navigating Nodes with Selectors

  • Navigating Nodes (Parents and Children)

html_node(html, “p”)

html %>% html_node(“p”)

html %>% html_nodes(“p”)

html %>% html_elements(“body”)

html %>% html_elements(“body”)

html %>% html_elements(“div p”)

html %>% html_elements(“div, p”)

html %>% html_element(“p”) %>% html_text()

15 of 37

Navigating HTML – Navigating Attributes

  • Navigating Attributes

html %>% html_element(“a”) %>% html_attr(“href”)

html_attr() vs. html_attrs()

16 of 37

Now, let’s practice

17 of 37

Scraping Tables

18 of 37

Scraping Tables

  • Read and view tables

table <- read_html(x)

table %>% html_table()

table %>% html_table(header = TRUE)

  • Scrape a table from Wikipedia

https://en.wikipedia.org/wiki/List_of_Nobel_laureates

19 of 37

Now, let’s practice

20 of 37

CSS

  • CSS – Cascading Style Sheets (format html documents)

Styling HTML Elements

  • CSS can be applied to HTML elements using selectors. Selectors target specific elements on a web page, such as headings, paragraphs, links, or classes and IDs that you define.
  • CSS rules consists of property-value pairs (e.g. color: blue, font-size: 12)

HTML Selectors

<h1>Hello</h1>

h1 {

property: value;

}

Class Selectors

<h1 class=“hello”>Hello</div>

.hello {

property: value;

}

ID Selectors

<div id=“hello”>Hello</div>

#hello {

property: value;

}

21 of 37

CSS – Type Selectors

type {

property: value;

}

type1, type2 {

property: value;

}

html %>% html_elements(“type”)

html %>% html_elements(“type1, type2”)

* {

property: value;

}

html %>% html_elements(“*”)

22 of 37

Now, let’s practice

23 of 37

CSS – Classes and IDs

.class {

property: value;

}

.class1 {

property: value;

}

.class2 {

property: value;

}

html %>% html_elements(“.class”)

html %>% html_elements(“h1.class”)

html %>% html_elements(“.class1.class2”)

#id {

property: value;

}

html %>% html_elements(“#id”)

24 of 37

Now, let’s practice

25 of 37

CSS – Pseudo Classes

<ol>

<li>First Item</li>

<li>Second Item</li>

<li>Third Item</li>

</ol>

li:first-child {color: blue;}

li:nth-child(2) {color: red;}

li:last-child {color: green;}

html %>% html_elements(“li:first-child”)

1. First Item

2. Second Item

3. Third Item

26 of 37

Now, let’s practice

27 of 37

CSS - Combinators

  • The CSS Combinators are:

Combinator

Meaning

space

Descendant

>

Child

+

Adjacent sibling

~

General sibling

28 of 37

Now, let’s practice

29 of 37

HTTP Requests

  • HTTP – HyperText Transfer Protocol
  • Set of rules that dictate how web browsers, or clients, communicate with a web server.

Source: https://developer.mozilla.org/en-US/docs/Web/HTTP/Overview

30 of 37

Anatomy of HTTP Requests

Request:

A request is sent to the web server.

Source: https://developer.mozilla.org/en-US/docs/Web/HTTP/Overview

Response:

A response is received from the web server.

Common status codes:

200 (OK), 404 (NOT FOUND), 300s (redirect), 500s (server errors).

31 of 37

Request Methods – GET and POST

GET

  • Fetch a resource without submitting data (GET /index.html).

POST

  • Send data to a server (after filling out a form on a page).

Source: https://developer.mozilla.org/en-US/docs/Web/HTTP/Overview

32 of 37

HTTP Request with “httr” Package

install.packages(“httr”)

library(httr)

response <- GET(url = “https://wikipedia.com”)

content(response)

status_code(response)

33 of 37

Now, let’s practice

34 of 37

Why HTTP Request?

  • Web server already registers your IP address.
  • Purposes of identification (HTTP Header does that).

Modify Headers

response <- GET(“https://Wikipedia.com”, user_agent(“Hello, it’s me, Eli!”))

35 of 37

Making Slow Requests – purr Package

install.packages(“purrr”)

library(purr)

url_list <- c(“http://example.com/page1”,

“http://example.com/page1”,

“http://example.com/page1”)

slow_request <- slowly(read_html, rate = rate_delay(3))

for(url in url_list){

html <- slow_request(url)

}

36 of 37

Now, let’s practice

37 of 37

THANK

YOU