R package development class presentation
Toby Dylan Hocking
Assistant Professor
Northern Arizona University
toby.hocking@r-project.org
Funded by NSF POSE program, project #2303612.
Slides adapted from Arun Srinivasan, and datatable-intro vignette - thanks!
talk overview
who am i?
1/3
data.table NSF POSE project
data.frame in R
| id | val |
1 | b | 4 |
2 | a | 2 |
3 | a | 3 |
4 | c | 1 |
5 | c | 5 |
6 | b | 6 |
2 column data.frame
DF
| id | val |
1 | b | 4 |
2 | a | 2 |
3 | a | 3 |
4 | c | 1 |
5 | c | 5 |
6 | b | 6 |
| id | val |
1 | b | 4 |
2 | a | 2 |
3 | a | 3 |
4 | c | 1 |
5 | c | 5 |
6 | b | 6 |
| id | val |
1 | b | 4 |
2 | a | 2 |
3 | a | 3 |
4 | c | 1 |
5 | c | 5 |
6 | b | 6 |
| id | val |
1 | b | 4 |
2 | a | 2 |
3 | a | 3 |
4 | c | 1 |
5 | c | 5 |
6 | b | 6 |
| id | val |
1 | b | 4 |
2 | a | 2 |
3 | a | 3 |
4 | c | 1 |
5 | c | 5 |
6 | b | 6 |
data.table
| id | val |
1 | b | 4 |
2 | a | 2 |
3 | a | 3 |
4 | c | 1 |
5 | c | 5 |
6 | b | 6 |
2 column data.table
DT
| id | val |
1 | b | 4 |
2 | a | 2 |
3 | a | 3 |
4 | c | 1 |
5 | c | 5 |
6 | b | 6 |
| id | val |
1 | b | 4 |
2 | a | 2 |
3 | a | 3 |
4 | c | 1 |
5 | c | 5 |
6 | b | 6 |
| id | val |
1 | b | 4 |
2 | a | 2 |
3 | a | 3 |
4 | c | 1 |
5 | c | 5 |
6 | b | 6 |
| id | val |
1 | b | 4 |
2 | a | 2 |
3 | a | 3 |
4 | c | 1 |
5 | c | 5 |
6 | b | 6 |
| id | val |
1 | b | 4 |
2 | a | 2 |
3 | a | 3 |
4 | c | 1 |
5 | c | 5 |
6 | b | 6 |
Comparing data.table and tidyverse
https://teachdatascience.com/tidyverse/
Why is data.table so popular/powerful?
(efficiency)
two kinds of data table efficiency
data table R code syntax
General form: DT[i, j, by]
On which rows
What to do?
Grouped by
what?
SQL: WHERE SELECT | UPDATE GROUP BY
data.frame(DF) vs data.table(DT)
sum(DF[DF$code != “abd”, “valA”])
DT[code != “abd”, sum(valA)]
two kinds of data table efficiency
100
100
100
3000
data.table::fread is an extremely efficient CSV file reader
Source code: https://tdhock.github.io/blog/2023/dt-atime-figures/
data.table computes summaries 100x faster than others
1
...
N
...
In machine learning, K-fold cross-validation is used to estimate the loss of a hyper-parameter, such as the number of epochs of training of a neural network.
data.table can efficiently compute the average loss over the K=10 folds, for each of the N epochs.
epoch
loss
epoch
Mean
SD
Length
1
...
N
1
...
N
fold
10
...
10
1
...
1
100
100
100
10000
data.table::fwrite is an extremely efficient CSV file writer
Source code: https://tdhock.github.io/blog/2023/dt-atime-figures/
most underrated package
powerful
data.table data.table data.table
great sadness
Using data.table for efficient big data analysis
See https://github.com/tdhock/2023-10-LatinR-data.table for full 3 hour tutorial presentation slides, with code, figures, exercises...
Contributing to data.table
we need your help!
Community blog and survey
GitHub repository
Translation Awards
Travel awards
Summary of data.table
2/3
rOpenSci Statistical Software Peer Review
Sharing R packages on CRAN
Please rather use the Authors@R field and declare Maintainer, Authors
and Contributors with their appropriate roles with person() calls.
e.g. something like:
Authors@R: c(person("Alice", "Developer", role = c("aut", "cre","cph"),
email = "alice.developer@some.domain.net"),
person("Bob", "Dev", role = "aut") )
Please always write package names, software names and API (application
programming interface) names in single quotes in title and description.
e.g: --> 'C'; 'PCRE'; ...
Please note that package names are case sensitive.
Example of review from CRAN
Review of form, not content. How to get feedback about the quality of your code/documentation/etc?
Peer-reviewed research papers
(**) Section 1 and abstract: The proposed model uses time information
along the data to constrain the inference during the execution of the dynamic
programming algorithm. I don’t understand the reason why the authors are
discussing train and test data. Training is typically associated with learning
coefficients of variables. In this case, the learning step involves determining the
intervals with the presence or absence of a change. These intervals are provided
by experts using already known data (training set). It seems to me that refer-
ring to apriori information would be less confusing.
...
(**) Page 10 Algorithm and line 48: ”... by the Decode sub-routine”.
We would need more details about the recovering of vector m. Is this back-
tracking step the reason why the authors used notation μ inside the functional
description of the costs? Do we have vector equality m = μ? If not, why?
Example of peer review
Detailed review of content, but this is as detailed as it gets in terms of code review in a typical academic journal.
rOpenSci Guide and example
3/3
Google Summer of Code
What is GSOC?
Guide and example
Thank you! Questions?
Toby Dylan Hocking
Assistant Professor
Northern Arizona University
toby.hocking@r-project.org
Funded by NSF POSE program, project #2303612.
Slides adapted from Arun Srinivasan, and datatable-intro vignette - thanks!
Please use/adapt these slides as you like,
as long as you give me some credit,
as I have done for Arun below.