JavaScript isn't enabled in your browser, so this file can't be opened. Enable and reload.

1 of 36

R Project in Google Summer of Code

Toby Dylan Hocking

Assistant Professor

Northern Arizona University

toby.hocking@r-project.org

Funded by NSF POSE program, project #2303612.

Slides adapted from Arun Srinivasan, and datatable-intro vignette - thanks!

2 of 36

talk overview

Google Summer of Code
data.table in GSOC
animint2 in GSOC

3 of 36

who am i?

BA, MS, PHD in Stats/Math (machine learning)
Assistant Professor of computer science since 2018
Using R since 2003! 20+ years! Author of 10+ packages
data.table user since 2015, contributor since 2019
Principal Investigator, NSF Pathways to Enable Open-Source Ecosystems (POSE) project, 2023-2025, about data.table

4 of 36

1/3

Google Summer of Code

5 of 36

3 month free/open-source coding project, paid for by Google
You do not have to be a coding expert to participate
Goal is to teach you, a new contributor, how to contribute to free/open-source software projects
I have been co-administrator of R project in Google Summer of Code for 10+ years, and I have mentored 20+ contributors (typically college students)
Your mentor should not be from the same institution as you

What is GSOC?

6 of 36

List of all organizations https://summerofcode.withgoogle.com/programs/2024/organizations
R project idea list -- potential mentors, please add ideas here, if you think they could fit into the 3 month coding time frame: https://github.com/rstats-gsoc/gsoc2024/wiki/table%20of%20proposed%20coding%20projects
https://github.com/rstats-gsoc/gsoc2025/wiki timeline

Guide and example

7 of 36

Mentors post project ideas on our GitHub wiki
A potential contributor should read project ideas, identify a project that is interesting, do tests for that project, then contact mentors
Mentors will help potential contributors to write an application, to be submitted to Google before 2 April 2024
Admins and Mentors rank applications by impact/feasibility
Top applications are funded by Google
Bonding/coding/evals from May to Aug 2024

How to participate

8 of 36

2/3

data.table in GSOC and NSF POSE project

9 of 36

data.frame in R

	id	val
1	b	4
2	a	2
3	a	3
4	c	1
5	c	5
6	b	6

2D columnar data structure

rows and columns

subset rows — DF[DF$id != “a”, ]
select columns — DF[, “val”]
subset rows & select columns — DF[DF$id != “a”, “val”]
that’s pretty much it…

2 column data.frame

	id	val
1	b	4
2	a	2
3	a	3
4	c	1
5	c	5
6	b	6

	id	val
1	b	4
2	a	2
3	a	3
4	c	1
5	c	5
6	b	6

	id	val
1	b	4
2	a	2
3	a	3
4	c	1
5	c	5
6	b	6

	id	val
1	b	4
2	a	2
3	a	3
4	c	1
5	c	5
6	b	6

	id	val
1	b	4
2	a	2
3	a	3
4	c	1
5	c	5
6	b	6

10 of 36

data.table

	id	val
1	b	4
2	a	2
3	a	3
4	c	1
5	c	5
6	b	6

Like data.frame, but with more powerful R code syntax, and C code implementation
R package on CRAN since 2006
Created by Matt Dowle, co-author Arun Srinivasan since 2013, 50+ contributors
1463 other CRAN packages require data.table (in most popular 0.05% of all CRAN packages, rank 11/19932 as of 1 Oct 2023)

2 column data.table

	id	val
1	b	4
2	a	2
3	a	3
4	c	1
5	c	5
6	b	6

	id	val
1	b	4
2	a	2
3	a	3
4	c	1
5	c	5
6	b	6

	id	val
1	b	4
2	a	2
3	a	3
4	c	1
5	c	5
6	b	6

	id	val
1	b	4
2	a	2
3	a	3
4	c	1
5	c	5
6	b	6

	id	val
1	b	4
2	a	2
3	a	3
4	c	1
5	c	5
6	b	6

11 of 36

Comparing data.table and tidyverse

tidyverse R package 1.0 on CRAN in 2016
tidyverse packages tibble + readr + tidyr + dplyr ~= data.table
tidyverse uses DF |> ... |>, data.table uses DT[...][...]
tidyverse is verbose (lots of code), data.table is concise (little code)�example: tibble |> filter(x=="a") |> group_by(z) |> summarise(m=mean(y)) �vs: DT[x=="a", .(m=mean(y)), by=z]
tidyverse has many dependencies, data.table has none (easier to install)
tidyverse has frequent breaking changes, data.table ensures backwards compatibility �(easier for users to upgrade to new data.table versions)

https://teachdatascience.com/tidyverse/

12 of 36

Why is data.table so popular/powerful?

(efficiency)

13 of 36

two kinds of data table efficiency

Efficient R code syntax (saves programming time)
Efficient C code implementation (saves time and memory, so larger data sets can be analyzed using smaller computational resources)

14 of 36

data table R code syntax

General form: DT[i, j, by]

On which rows

What to do?

Grouped by

what?

think in terms of — rows, what to do with columns, and groups
Matt's 2014 useR talk https://youtu.be/qLrdYhizEMg?t=1m54s

SQL: WHERE SELECT | UPDATE GROUP BY

15 of 36

data.frame(DF) vs data.table(DT)

sum(DF[DF$code != “abd”, “valA”])

DT[code != “abd”, sum(valA)]

Consider subset of rows with "abd" in code column, then compute sum of values in valA column.
DF needs to be repeated, no repetition of DT.
sum can be placed in the square brackets [ ] with DT, rather than outside with DF.

16 of 36

two kinds of data table efficiency

Efficient R code syntax (saves programming time)
Efficient C code implementation (saves time and memory, so larger data sets can be analyzed using smaller computational resources)

17 of 36

100

3000

data.table::fread is an extremely efficient CSV file reader

Source code: https://tdhock.github.io/blog/2023/dt-atime-figures/

18 of 36

data.table computes summaries 100x faster than others

...

In machine learning, K-fold cross-validation is used to estimate the loss of a hyper-parameter, such as the number of epochs of training of a neural network.

data.table can efficiently compute the average loss over the K=10 folds, for each of the N epochs.

epoch

loss

epoch

Mean

Length

...

fold

...

19 of 36

100

10000

data.table::fwrite is an extremely efficient CSV file writer

Source code: https://tdhock.github.io/blog/2023/dt-atime-figures/

20 of 36

most underrated package

21 of 36

powerful

22 of 36

data.table data.table data.table

23 of 36

great sadness

24 of 36

Using data.table for efficient big data analysis

See https://github.com/tdhock/2023-10-LatinR-data.table for full 3 hour tutorial presentation slides, with code, figures, exercises...

25 of 36

Contributing to data.table

we need your help!

26 of 36

data.table mascot is a sea lion, which barks "R R R"
data.table community has a new blog, The Raft, https://rdatatable-community.github.io/The-Raft/ �sea lions often float together on the ocean's surface in groups called "rafts." - Marine Mammal Center

Community blog

27 of 36

data.table has an active issue/Pull Request(PR) tracker https://github.com/Rdatatable/data.table/
1000+ open issues, 100+ open PRs
if you have any time/interest, we could use your help!
easy first contribution: try reproducing an issue �(very helpful to know if an issue is reproducible)
very inclusive community -- after you submit your first PR, you will be invited to join the github group!
now is a very exciting time to get involved, as we recently created a formal written document describing de-centralized project/community governance

GitHub repository

28 of 36

In 2023-2025, National Science Foundation has provided funds to support expanding the ecosystem of users and contributors around data.table
20 translation awards, US$500 each, in order to make documentation and messages more accessible, ideas:
Translate errors/warnings/messages (potools package can help)
Translate most important vignettes (intro, import, reshape)
Translate other documentation (cheat sheets, slides, etc)
Priority: Portuguese, Spanish, Chinese, French, Russian, Arabic, Hindi
Call for proposals: https://rdatatable-community.github.io/The-Raft/

Translation Awards

29 of 36

In 2023-2025, National Science Foundation has provided funds to support expanding the ecosystem of users and contributors around data.table
Eight travel awards, US$2700 each
Candidates should give a talk about data.table at a conference with a relevant audience (potential data.table users or contributors)
Call for proposals on https://rdatatable-community.github.io/The-Raft/

Travel awards

30 of 36

concise, consistent syntax
time and memory efficient
No dependencies (easy to install)
No breaking changes (easy to upgrade)
Looking for R-GSOC contributors to close issues
translation awards, US$500 each
travel awards, US$2700 each

Summary of data.table

31 of 36

3/3

animint2 in GSOC: animated interactive ggplots

32 of 36

Use ggplot R code to define data viz, rendered on web page.
Linked plots, direct manipulation: click on data in one plot/layer to show/hide elements in another.

animint2: animated interactive ggplots

33 of 36

Created by Toby Dylan Hocking in 2013
Originally importing ggplot2, fork since 2017 (thanks!)
Improved in GSOC by 9 contributors over the years
animint2 maintainers have to know JavaScript,
but animint2 users do not! (only R/ggplot code)
Lots of similar R packages, main advantage of animint2 is familiar ggplot interface, and simple API for interactions (clickSelects/showSelected) https://github.com/rstats-gsoc/gsoc2024/wiki/Animated-interactive-ggplots#related-work

animint2 in GSOC

34 of 36

New feature in 2023: animint2::update_gallery()
Main gallery is https://animint.github.io/gallery/
It is a list/table of animint2 data visualizations
Each is stored in a GitHub repo, with a link to source code for data viz
Gallery is a GitHub repo too, with meta-data about where to find data viz
GSOC project: port old-style gallery https://rcdata.nau.edu/genomic-ml/animint-gallery/
Automatic/default screenshots

animint2 gallery updates

35 of 36

Experimental support for updating axes in response to changing selection, https://github.com/animint/animint2/blob/master/tests/testthat/test-renderer3-update-axes-multiple-ss.R
Example that works: https://tdhock.github.io/2023-11-21-auc-improved/
GSOC project about fixing related bugs, using JavaScript instead of R code.

animint2 axis updates

36 of 36

Thank you! Questions?

Toby Dylan Hocking

Assistant Professor

Northern Arizona University

toby.hocking@r-project.org

Funded by NSF POSE program, project #2303612.

Slides adapted from Arun Srinivasan, and datatable-intro vignette - thanks!

Please use/adapt these slides as you like,

as long as you give me some credit,

as I have done for Arun below.