R Project in Google Summer of Code
Toby Dylan Hocking
Assistant Professor
Northern Arizona University
toby.hocking@r-project.org
Funded by NSF POSE program, project #2303612.
Slides adapted from Arun Srinivasan, and datatable-intro vignette - thanks!
talk overview
who am i?
1/3
Google Summer of Code
What is GSOC?
Guide and example
How to participate
2/3
data.table in GSOC and NSF POSE project
data.frame in R
| id | val |
1 | b | 4 |
2 | a | 2 |
3 | a | 3 |
4 | c | 1 |
5 | c | 5 |
6 | b | 6 |
2 column data.frame
DF
| id | val |
1 | b | 4 |
2 | a | 2 |
3 | a | 3 |
4 | c | 1 |
5 | c | 5 |
6 | b | 6 |
| id | val |
1 | b | 4 |
2 | a | 2 |
3 | a | 3 |
4 | c | 1 |
5 | c | 5 |
6 | b | 6 |
| id | val |
1 | b | 4 |
2 | a | 2 |
3 | a | 3 |
4 | c | 1 |
5 | c | 5 |
6 | b | 6 |
| id | val |
1 | b | 4 |
2 | a | 2 |
3 | a | 3 |
4 | c | 1 |
5 | c | 5 |
6 | b | 6 |
| id | val |
1 | b | 4 |
2 | a | 2 |
3 | a | 3 |
4 | c | 1 |
5 | c | 5 |
6 | b | 6 |
data.table
| id | val |
1 | b | 4 |
2 | a | 2 |
3 | a | 3 |
4 | c | 1 |
5 | c | 5 |
6 | b | 6 |
2 column data.table
DT
| id | val |
1 | b | 4 |
2 | a | 2 |
3 | a | 3 |
4 | c | 1 |
5 | c | 5 |
6 | b | 6 |
| id | val |
1 | b | 4 |
2 | a | 2 |
3 | a | 3 |
4 | c | 1 |
5 | c | 5 |
6 | b | 6 |
| id | val |
1 | b | 4 |
2 | a | 2 |
3 | a | 3 |
4 | c | 1 |
5 | c | 5 |
6 | b | 6 |
| id | val |
1 | b | 4 |
2 | a | 2 |
3 | a | 3 |
4 | c | 1 |
5 | c | 5 |
6 | b | 6 |
| id | val |
1 | b | 4 |
2 | a | 2 |
3 | a | 3 |
4 | c | 1 |
5 | c | 5 |
6 | b | 6 |
Comparing data.table and tidyverse
https://teachdatascience.com/tidyverse/
Why is data.table so popular/powerful?
(efficiency)
two kinds of data table efficiency
data table R code syntax
General form: DT[i, j, by]
On which rows
What to do?
Grouped by
what?
SQL: WHERE SELECT | UPDATE GROUP BY
data.frame(DF) vs data.table(DT)
sum(DF[DF$code != “abd”, “valA”])
DT[code != “abd”, sum(valA)]
two kinds of data table efficiency
100
100
100
3000
data.table::fread is an extremely efficient CSV file reader
Source code: https://tdhock.github.io/blog/2023/dt-atime-figures/
data.table computes summaries 100x faster than others
1
...
N
...
In machine learning, K-fold cross-validation is used to estimate the loss of a hyper-parameter, such as the number of epochs of training of a neural network.
data.table can efficiently compute the average loss over the K=10 folds, for each of the N epochs.
epoch
loss
epoch
Mean
SD
Length
1
...
N
1
...
N
fold
10
...
10
1
...
1
100
100
100
10000
data.table::fwrite is an extremely efficient CSV file writer
Source code: https://tdhock.github.io/blog/2023/dt-atime-figures/
most underrated package
powerful
data.table data.table data.table
great sadness
Using data.table for efficient big data analysis
See https://github.com/tdhock/2023-10-LatinR-data.table for full 3 hour tutorial presentation slides, with code, figures, exercises...
Contributing to data.table
we need your help!
Community blog
GitHub repository
Translation Awards
Travel awards
Summary of data.table
3/3
animint2 in GSOC: animated interactive ggplots
animint2: animated interactive ggplots
animint2 in GSOC
animint2 gallery updates
animint2 axis updates
Thank you! Questions?
Toby Dylan Hocking
Assistant Professor
Northern Arizona University
toby.hocking@r-project.org
Funded by NSF POSE program, project #2303612.
Slides adapted from Arun Srinivasan, and datatable-intro vignette - thanks!
Please use/adapt these slides as you like,
as long as you give me some credit,
as I have done for Arun below.