1 of 47

Data Wrangling in R

Advanced data io

2 of 47

Google Sheets

3 of 47

https://docs.google.com/spreadsheets

4 of 47

https://speakerdeck.com/jennybc/googlesheets-talk-at-user2015

5 of 47

Reading data with the googlesheets package

install.packages("googlesheets")

library(googlesheets)

?gs_read�?"cell-specification"?gs_read�?"cell-specification"

6 of 47

https://docs.google.com/spreadsheets/d/1WBrH655fxqKW1QqvD5hnqvvEWIvRzDJcKEgjjFeYxeM/edit#gid=0

7 of 47

8 of 47

sheets_url = "https://docs.google.com/spreadsheets/d/1WBrH655fxqKW1QqvD5hnqvvEWIvRzDJcKEgjjFeYxeM/edit?usp=sharing"

gsurl1 = gs_url(sheets_url)�dat = gs_read(gsurl1)

date_read = date()

date_read

9 of 47

https://docs.google.com/spreadsheets/d/1j9vbv8MrVV7EK15vyz-rnhjiXhRkmIFEHgdv1_p1cCc/edit?usp=sharing

10 of 47

sheets_url = "https://docs.google.com/spreadsheets/d/1j9vbv8MrVV7EK15vyz-rnhjiXhRkmIFEHgdv1_p1cCc/edit?usp=sharing"

# Only necessary on rstudio.cloud

options(httr_oob_default=TRUE)

# Will ask you to log in

gs_auth()

gsurl1 = gs_url(sheets_url)�dat = gs_read(gsurl1)

11 of 47

Google Sheets

https://bit.ly/1Cgzjxb

12 of 47

JSON

13 of 47

https://en.wikipedia.org/wiki/JSON

14 of 47

Why JSON matters

https://developer.github.com/v3/search/

15 of 47

github_url = "https://api.github.com/users/jtleek/repos"

#install.packages("jsonlite")�library(jsonlite)�jsonData <- fromJSON(github_url)

date_read = date()

date_read��

16 of 47

Data frame structure from JSON

dim(jsonData)

jsonData$name

#One of the columns is a data frame!

table(sapply(jsonData,class))�dim(jsonData$owner)�names(jsonData$owner)

17 of 47

JSON Lab

https://bit.ly/2JNLUil

18 of 47

Web Scraping

19 of 47

This is data

20 of 47

View the source

21 of 47

What the computer sees

22 of 47

Ways to see the source

# Chrome:

# 1. right click on page

# 2. select "view source"

# Firefox:

# 1. right click on page

# 2. select "view source"

# Microsoft Edge:

# 1. right click on page

# 2. select "view source"

# Safari

# 1. click on "Safari"

# 2. select "Preferences"

# 3. go to "Advanced"

# 4. check "Show Develop menu in menu bar"

# 5. click on "Develop"

# 6. select "show page source"

# 7. alternatively to 5./6., right click on page and select "view source"

23 of 47

Inspect element

24 of 47

Copy XPath

25 of 47

rvest package

recount_url = "http://bowtie-bio.sourceforge.net/recount/"

# install.packages("rvest")�library(rvest)�htmlfile = read_html(recount_url)�

nds = html_nodes(htmlfile,

xpath='//*[@id="recounttab"]/table')�dat = html_table(nds)�dat = as.data.frame(dat)

head(dat)

26 of 47

http://motherboard.vice.com/read/70000-okcupid-users-just-had-their-data-published

27 of 47

https://www.theguardian.com/science/2012/may/23/text-mining-research-tool-forbidden

28 of 47

APIs

29 of 47

Application Programming Interfaces

https://developers.facebook.com/

30 of 47

In biology too!

http://www.ncbi.nlm.nih.gov/books/NBK25501/

31 of 47

Step 0: Did someone do this already

https://ropensci.org/

32 of 47

Do it yourself

33 of 47

Read the docs

https://developer.github.com/v3/

34 of 47

Read the docs

35 of 47

Read the docs

36 of 47

A dissected example

https://api.github.com/search/repositories?q=created:2014-08-13+language:r+-user:cran&type

37 of 47

The base URL

https://api.github.com/search/repositories?q=created:2014-08-13+language:r+-user:cran&type

38 of 47

Search repositories

https://api.github.com/search/repositories?q=created:2014-08-13+language:r+-user:cran&type

39 of 47

Create a query

https://api.github.com/search/repositories?q=created:2014-08-13+language:r+-user:cran&type

40 of 47

Date repo was created

https://api.github.com/search/repositories?q=created:2014-08-13+language:r+-user:cran&type

41 of 47

Language repo is in

https://api.github.com/search/repositories?q=created:2014-08-13+language:r+-user:cran&type

42 of 47

Ignore repos from “cran”

https://api.github.com/search/repositories?q=created:2014-08-13+language:r+-user:cran&type

43 of 47

#install.packages("httr")�library(httr)

query_url = "https://api.github.com/search/repositories?q=created:2014-08-13+language:r+-user:cran"

�req = GET(query_url)�names(content(req))

44 of 47

Not all APIs are “open”

https://apps.twitter.com/

45 of 47

myapp = oauth_app("twitter",� key="yourConsumerKeyHere",secret="yourConsumerSecretHere")�sig = sign_oauth1.0(myapp,� token = "yourTokenHere",� token_secret = "yourTokenSecretHere")�homeTL = GET("https://api.twitter.com/1.1/statuses/home_timeline.json", sig)

46 of 47

But you can get cool data

json1 = content(homeTL)�json2 = jsonlite::fromJSON(toJSON(json1))�json2[1,1:4]

created_at id id_str�1 Mon Jan 13 05:18:04 +0000 2014 4.225984e+17 422598398940684288� text�1 Now that P. Norvig's regex golf IPython notebook hit Slashdot, let's see if our traffic spike tops the previous one: http://t.co/Vc6JhZXOo8

47 of 47

Web + APIs lab

https://bit.ly/2JIwlIt