Web Scraping with R
About Me
IDE and Packages for this Lecture
Outline
What is Web Scraping?
HTML Basics – Elements, Tags, Attributes
Elements
p (paragraph), h1 (first-level heading), img (image), b (bold formatting), etc…
HTML Basics – Elements, Tags, Attributes
Tags
<p>This is a simple paragraph.</p>
<h1>Welcome to our Website</h1>
HTML Basics – Elements, Tags, Attributes
Attributes
<a href="https://www.example.com">Visit Example.com</a>
<img src="image.jpg" alt="A beautiful landscape">
Anatomy of a Webpage
<!DOCTYPE html>
<html> head and body tags enter here </html>
<head> … </head>
<body> … </body>
Anatomy of HTML Document
<!DOCTYPE html>
<html>
<head>
<title>Sample Page</title>
</head>
<body>
<h1>Welcome to Web Scraping!</h1>
<p>This is a simple HTML example.</p>
<a href="https://www.example.com">Visit Example.com</a>
</body>
</html>
Document type declaration
Root
Element
Head
Body
Reading HTML with R
install.packages(“rvest”)
library(rvest)
html <- read_html(x)
install.packages(“xml2”)
library(xml2)
xml_structure(html)
Now, let’s practice
Navigate HTML – Like a Tree
Navigating HTML – Navigating Nodes with Selectors
html_node(html, “p”)
html %>% html_node(“p”)
html %>% html_nodes(“p”)
html %>% html_elements(“body”)
html %>% html_elements(“body”)
html %>% html_elements(“div p”)
html %>% html_elements(“div, p”)
html %>% html_element(“p”) %>% html_text()
Navigating HTML – Navigating Attributes
html %>% html_element(“a”) %>% html_attr(“href”)
html_attr() vs. html_attrs()
Now, let’s practice
Scraping Tables
Scraping Tables
table <- read_html(x)
table %>% html_table()
table %>% html_table(header = TRUE)
https://en.wikipedia.org/wiki/List_of_Nobel_laureates
Now, let’s practice
CSS
Styling HTML Elements
HTML Selectors
<h1>Hello</h1>
h1 {
property: value;
}
Class Selectors
<h1 class=“hello”>Hello</div>
.hello {
property: value;
}
ID Selectors
<div id=“hello”>Hello</div>
#hello {
property: value;
}
CSS – Type Selectors
type {
property: value;
}
type1, type2 {
property: value;
}
html %>% html_elements(“type”)
html %>% html_elements(“type1, type2”)
* {
property: value;
}
html %>% html_elements(“*”)
Now, let’s practice
CSS – Classes and IDs
.class {
property: value;
}
.class1 {
property: value;
}
.class2 {
property: value;
}
html %>% html_elements(“.class”)
html %>% html_elements(“h1.class”)
html %>% html_elements(“.class1.class2”)
#id {
property: value;
}
html %>% html_elements(“#id”)
Now, let’s practice
CSS – Pseudo Classes
<ol>
<li>First Item</li>
<li>Second Item</li>
<li>Third Item</li>
</ol>
li:first-child {color: blue;}
li:nth-child(2) {color: red;}
li:last-child {color: green;}
html %>% html_elements(“li:first-child”)
1. First Item
2. Second Item
3. Third Item
Now, let’s practice
CSS - Combinators
Combinator | Meaning |
space | Descendant |
> | Child |
+ | Adjacent sibling |
~ | General sibling |
Now, let’s practice
HTTP Requests
Source: https://developer.mozilla.org/en-US/docs/Web/HTTP/Overview
Anatomy of HTTP Requests
Request:
A request is sent to the web server.
Source: https://developer.mozilla.org/en-US/docs/Web/HTTP/Overview
Response:
A response is received from the web server.
Common status codes:
200 (OK), 404 (NOT FOUND), 300s (redirect), 500s (server errors).
Request Methods – GET and POST
GET
POST
Source: https://developer.mozilla.org/en-US/docs/Web/HTTP/Overview
HTTP Request with “httr” Package
install.packages(“httr”)
library(httr)
response <- GET(url = “https://wikipedia.com”)
content(response)
status_code(response)
Now, let’s practice
Why HTTP Request?
Modify Headers
response <- GET(“https://Wikipedia.com”, user_agent(“Hello, it’s me, Eli!”))
Making Slow Requests – purr Package
install.packages(“purrr”)
library(purr)
url_list <- c(“http://example.com/page1”,
“http://example.com/page1”,
“http://example.com/page1”)
slow_request <- slowly(read_html, rate = rate_delay(3))
for(url in url_list){
html <- slow_request(url)
}
Now, let’s practice
THANK
YOU