1 of 47

Data Wrangling in R

Advanced data io

2 of 47

Google Sheets

3 of 47

https://docs.google.com/spreadsheets

4 of 47

https://speakerdeck.com/jennybc/googlesheets-talk-at-user2015

5 of 47

Reading data with the googlesheets package

install.packages("googlesheets")

library(googlesheets)

?gs_read�?"cell-specification"?gs_read�?"cell-specification"

6 of 47

https://docs.google.com/spreadsheets/d/1WBrH655fxqKW1QqvD5hnqvvEWIvRzDJcKEgjjFeYxeM/edit#gid=0

7 of 47

8 of 47

sheets_url = "https://docs.google.com/spreadsheets/d/1WBrH655fxqKW1QqvD5hnqvvEWIvRzDJcKEgjjFeYxeM/edit?usp=sharing"

gsurl1 = gs_url(sheets_url)�dat = gs_read(gsurl1)

date_read = date()

date_read

9 of 47

https://docs.google.com/spreadsheets/d/1j9vbv8MrVV7EK15vyz-rnhjiXhRkmIFEHgdv1_p1cCc/edit?usp=sharing

10 of 47

sheets_url = "https://docs.google.com/spreadsheets/d/1j9vbv8MrVV7EK15vyz-rnhjiXhRkmIFEHgdv1_p1cCc/edit?usp=sharing"

# Only necessary on rstudio.cloud

options(httr_oob_default=TRUE)

# Will ask you to log in

gs_auth()

gsurl1 = gs_url(sheets_url)�dat = gs_read(gsurl1)

11 of 47

Google Sheets

https://bit.ly/1Cgzjxb

12 of 47

JSON

13 of 47

https://en.wikipedia.org/wiki/JSON

14 of 47

Why JSON matters

https://developer.github.com/v3/search/

15 of 47

github_url = "https://api.github.com/users/jtleek/repos"

#install.packages("jsonlite")�library(jsonlite)�jsonData <- fromJSON(github_url)

date_read = date()

date_read��

16 of 47

Data frame structure from JSON

dim(jsonData)

jsonData$name

#One of the columns is a data frame!

table(sapply(jsonData,class))�dim(jsonData$owner)�names(jsonData$owner)�

17 of 47

JSON Lab

https://bit.ly/2JNLUil

18 of 47

Web Scraping

19 of 47

This is data

http://bowtie-bio.sourceforge.net/recount/

20 of 47

View the source

21 of 47

What the computer sees

22 of 47

Ways to see the source

# Chrome:

# 1. right click on page

# 2. select "view source"

# Firefox:

# 1. right click on page

# 2. select "view source"

# Microsoft Edge:

# 1. right click on page

# 2. select "view source"

# Safari

# 1. click on "Safari"

# 2. select "Preferences"

# 3. go to "Advanced"

# 4. check "Show Develop menu in menu bar"

# 5. click on "Develop"

# 6. select "show page source"

# 7. alternatively to 5./6., right click on page and select "view source"

https://github.com/simonmunzert/rscraping-jsm-2016/blob/c04fd91fec711df65c838e07723125155a7f2cda/02-scraping-with-rvest.r

23 of 47

Inspect element

24 of 47

Copy XPath

25 of 47

rvest package

recount_url = "http://bowtie-bio.sourceforge.net/recount/"

# install.packages("rvest")�library(rvest)�htmlfile = read_html(recount_url)�

nds = html_nodes(htmlfile,

xpath='//*[@id="recounttab"]/table')�dat = html_table(nds)�dat = as.data.frame(dat)

head(dat)�

26 of 47

http://motherboard.vice.com/read/70000-okcupid-users-just-had-their-data-published

27 of 47

https://www.theguardian.com/science/2012/may/23/text-mining-research-tool-forbidden

28 of 47

APIs

29 of 47

Application Programming Interfaces

https://developers.facebook.com/

30 of 47

In biology too!

http://www.ncbi.nlm.nih.gov/books/NBK25501/

31 of 47

Step 0: Did someone do this already

https://ropensci.org/

32 of 47

Do it yourself

33 of 47

Read the docs

https://developer.github.com/v3/

34 of 47

Read the docs

35 of 47

Read the docs

36 of 47

A dissected example