1 of 36

R Project in Google Summer of Code

Toby Dylan Hocking

Assistant Professor

Northern Arizona University

toby.hocking@r-project.org

Funded by NSF POSE program, project #2303612.

Slides adapted from Arun Srinivasan, and datatable-intro vignette - thanks!

2 of 36

talk overview

  1. Google Summer of Code
  2. data.table in GSOC
  3. animint2 in GSOC

3 of 36

who am i?

  • BA, MS, PHD in Stats/Math (machine learning)
  • Assistant Professor of computer science since 2018
  • Using R since 2003! 20+ years! Author of 10+ packages
  • data.table user since 2015, contributor since 2019
  • Principal Investigator, NSF Pathways to Enable Open-Source Ecosystems (POSE) project, 2023-2025, about data.table

4 of 36

1/3

Google Summer of Code

5 of 36

  • 3 month free/open-source coding project, paid for by Google
  • You do not have to be a coding expert to participate
  • Goal is to teach you, a new contributor, how to contribute to free/open-source software projects
  • I have been co-administrator of R project in Google Summer of Code for 10+ years, and I have mentored 20+ contributors (typically college students)
  • Your mentor should not be from the same institution as you

What is GSOC?

6 of 36

Guide and example

7 of 36

  • Mentors post project ideas on our GitHub wiki
  • A potential contributor should read project ideas, identify a project that is interesting, do tests for that project, then contact mentors
  • Mentors will help potential contributors to write an application, to be submitted to Google before 2 April 2024
  • Admins and Mentors rank applications by impact/feasibility
  • Top applications are funded by Google
  • Bonding/coding/evals from May to Aug 2024

How to participate

8 of 36

2/3

data.table in GSOC and NSF POSE project

9 of 36

data.frame in R

id

val

1

b

4

2

a

2

3

a

3

4

c

1

5

c

5

6

b

6

  • 2D columnar data structure
    • rows and columns
  • subset rows — DF[DF$id != “a”, ]
  • select columns — DF[, “val”]
  • subset rows & select columns — DF[DF$id != “a”, “val”]
  • that’s pretty much it…

2 column data.frame

DF

id

val

1

b

4

2

a

2

3

a

3

4

c

1

5

c

5

6

b

6

id

val

1

b

4

2

a

2

3

a

3

4

c

1

5

c

5

6

b

6

id

val

1

b

4

2

a

2

3

a

3

4

c

1

5

c

5

6

b

6

id

val

1

b

4

2

a

2

3

a

3

4

c

1

5

c

5

6

b

6

id

val

1

b

4

2

a

2

3

a

3

4

c

1

5

c

5

6

b

6

10 of 36

data.table

id

val

1

b

4

2

a

2

3

a

3

4

c

1

5

c

5

6

b

6

  • Like data.frame, but with more powerful R code syntax, and C code implementation
  • R package on CRAN since 2006
  • Created by Matt Dowle, co-author Arun Srinivasan since 2013, 50+ contributors
  • 1463 other CRAN packages require data.table (in most popular 0.05% of all CRAN packages, rank 11/19932 as of 1 Oct 2023)

2 column data.table

DT

id

val

1

b

4

2

a

2

3

a

3

4

c

1

5

c

5

6

b

6

id

val

1

b

4

2

a

2

3

a

3

4

c

1

5

c

5

6

b

6

id

val

1

b

4

2

a

2

3

a

3

4

c

1

5

c

5

6

b

6

id

val

1

b

4

2

a

2

3

a

3

4

c

1

5

c

5

6

b

6

id

val

1

b

4

2

a

2

3

a

3

4

c

1

5

c

5

6

b

6

11 of 36

Comparing data.table and tidyverse

  • tidyverse R package 1.0 on CRAN in 2016
  • tidyverse packages tibble + readr + tidyr + dplyr ~= data.table
  • tidyverse uses DF |> ... |>, data.table uses DT[...][...]
  • tidyverse is verbose (lots of code), data.table is concise (little code)�example: tibble |> filter(x=="a") |> group_by(z) |> summarise(m=mean(y)) �vs: DT[x=="a", .(m=mean(y)), by=z]
  • tidyverse has many dependencies, data.table has none (easier to install)
  • tidyverse has frequent breaking changes, data.table ensures backwards compatibility �(easier for users to upgrade to new data.table versions)

https://teachdatascience.com/tidyverse/

12 of 36

Why is data.table so popular/powerful?

(efficiency)

13 of 36

two kinds of data table efficiency

  • Efficient R code syntax (saves programming time)
  • Efficient C code implementation (saves time and memory, so larger data sets can be analyzed using smaller computational resources)

14 of 36

data table R code syntax

General form: DT[i, j, by]

On which rows

What to do?

Grouped by

what?

  • think in terms of — rows, what to do with columns, and groups
  • Matt's 2014 useR talk https://youtu.be/qLrdYhizEMg?t=1m54s

SQL: WHERE SELECT | UPDATE GROUP BY

15 of 36

data.frame(DF) vs data.table(DT)

sum(DF[DF$code != “abd”, “valA”])

DT[code != “abd”, sum(valA)]

  • Consider subset of rows with "abd" in code column, then compute sum of values in valA column.
  • DF needs to be repeated, no repetition of DT.
  • sum can be placed in the square brackets [ ] with DT, rather than outside with DF.

16 of 36

two kinds of data table efficiency

  • Efficient R code syntax (saves programming time)
  • Efficient C code implementation (saves time and memory, so larger data sets can be analyzed using smaller computational resources)

17 of 36

100

100

100

3000

data.table::fread is an extremely efficient CSV file reader

18 of 36

data.table computes summaries 100x faster than others

1

...

N

...

In machine learning, K-fold cross-validation is used to estimate the loss of a hyper-parameter, such as the number of epochs of training of a neural network.

data.table can efficiently compute the average loss over the K=10 folds, for each of the N epochs.

epoch

loss

epoch

Mean

SD

Length

1

...

N

1

...

N

fold

10

...

10

1

...

1

19 of 36

100

100

100

10000

data.table::fwrite is an extremely efficient CSV file writer

20 of 36

most underrated package

21 of 36

powerful

22 of 36

data.table data.table data.table

23 of 36

great sadness

24 of 36

Using data.table for efficient big data analysis

See https://github.com/tdhock/2023-10-LatinR-data.table for full 3 hour tutorial presentation slides, with code, figures, exercises...

25 of 36

Contributing to data.table

we need your help!

26 of 36

  • data.table mascot is a sea lion, which barks "R R R"
  • data.table community has a new blog, The Raft, https://rdatatable-community.github.io/The-Raft/ �sea lions often float together on the ocean's surface in groups called "rafts." - Marine Mammal Center

Community blog

27 of 36

  • data.table has an active issue/Pull Request(PR) tracker https://github.com/Rdatatable/data.table/
  • 1000+ open issues, 100+ open PRs
  • if you have any time/interest, we could use your help!
  • easy first contribution: try reproducing an issue �(very helpful to know if an issue is reproducible)
  • very inclusive community -- after you submit your first PR, you will be invited to join the github group!
  • now is a very exciting time to get involved, as we recently created a formal written document describing de-centralized project/community governance

GitHub repository

28 of 36

  • In 2023-2025, National Science Foundation has provided funds to support expanding the ecosystem of users and contributors around data.table
  • 20 translation awards, US$500 each, in order to make documentation and messages more accessible, ideas:
  • Translate errors/warnings/messages (potools package can help)
  • Translate most important vignettes (intro, import, reshape)
  • Translate other documentation (cheat sheets, slides, etc)
  • Priority: Portuguese, Spanish, Chinese, French, Russian, Arabic, Hindi
  • Call for proposals: https://rdatatable-community.github.io/The-Raft/

Translation Awards

29 of 36

  • In 2023-2025, National Science Foundation has provided funds to support expanding the ecosystem of users and contributors around data.table
  • Eight travel awards, US$2700 each
  • Candidates should give a talk about data.table at a conference with a relevant audience (potential data.table users or contributors)
  • Call for proposals on https://rdatatable-community.github.io/The-Raft/

Travel awards

30 of 36

  • concise, consistent syntax
  • time and memory efficient
  • No dependencies (easy to install)
  • No breaking changes (easy to upgrade)
  • Looking for R-GSOC contributors to close issues
  • translation awards, US$500 each
  • travel awards, US$2700 each

Summary of data.table

31 of 36

3/3

animint2 in GSOC: animated interactive ggplots

32 of 36

  • Use ggplot R code to define data viz, rendered on web page.
  • Linked plots, direct manipulation: click on data in one plot/layer to show/hide elements in another.

animint2: animated interactive ggplots

33 of 36

  • Created by Toby Dylan Hocking in 2013
  • Originally importing ggplot2, fork since 2017 (thanks!)
  • Improved in GSOC by 9 contributors over the years
  • animint2 maintainers have to know JavaScript,
  • but animint2 users do not! (only R/ggplot code)
  • Lots of similar R packages, main advantage of animint2 is familiar ggplot interface, and simple API for interactions (clickSelects/showSelected) https://github.com/rstats-gsoc/gsoc2024/wiki/Animated-interactive-ggplots#related-work

animint2 in GSOC

34 of 36

  • New feature in 2023: animint2::update_gallery()
  • Main gallery is https://animint.github.io/gallery/
  • It is a list/table of animint2 data visualizations
  • Each is stored in a GitHub repo, with a link to source code for data viz
  • Gallery is a GitHub repo too, with meta-data about where to find data viz
  • GSOC project: port old-style gallery https://rcdata.nau.edu/genomic-ml/animint-gallery/
  • Automatic/default screenshots

animint2 gallery updates

35 of 36

animint2 axis updates

36 of 36

Thank you! Questions?

Toby Dylan Hocking

Assistant Professor

Northern Arizona University

toby.hocking@r-project.org

Funded by NSF POSE program, project #2303612.

Slides adapted from Arun Srinivasan, and datatable-intro vignette - thanks!

Please use/adapt these slides as you like,

as long as you give me some credit,

as I have done for Arun below.