1 of 36

R package development class presentation

Toby Dylan Hocking

Assistant Professor

Northern Arizona University

toby.hocking@r-project.org

Funded by NSF POSE program, project #2303612.

Slides adapted from Arun Srinivasan, and datatable-intro vignette - thanks!

2 of 36

talk overview

data.table NSF POSE project
rOpenSci Statistical Software Peer Review
Google Summer of Code

3 of 36

who am i?

BA, MS, PHD in Stats/Math (machine learning)
Assistant Professor of computer science since 2018
Using R since 2003! 20+ years! Author of 10+ packages
data.table user since 2015, contributor since 2019
Principal Investigator, NSF Pathways to Enable Open-Source Ecosystems (POSE) project, 2023-2025, about data.table

4 of 36

1/3

data.table NSF POSE project

5 of 36

data.frame in R

	id	val
1	b	4
2	a	2
3	a	3
4	c	1
5	c	5
6	b	6

2D columnar data structure

rows and columns

subset rows — DF[DF$id != “a”, ]
select columns — DF[, “val”]
subset rows & select columns — DF[DF$id != “a”, “val”]
that’s pretty much it…

2 column data.frame

DF

	id	val
1	b	4
2	a	2
3	a	3
4	c	1
5	c	5
6	b	6

	id	val
1	b	4
2	a	2
3	a	3
4	c	1
5	c	5
6	b	6

	id	val
1	b	4
2	a	2
3	a	3
4	c	1
5	c	5
6	b	6

	id	val
1	b	4
2	a	2
3	a	3
4	c	1
5	c	5
6	b	6

	id	val
1	b	4
2	a	2
3	a	3
4	c	1
5	c	5
6	b	6

6 of 36

data.table

	id	val
1	b	4
2	a	2
3	a	3
4	c	1
5	c	5
6	b	6

Like data.frame, but with more powerful R code syntax, and C code implementation
R package on CRAN since 2006
Created by Matt Dowle, co-author Arun Srinivasan since 2013, 50+ contributors
1463 other CRAN packages require data.table (in most popular 0.05% of all CRAN packages, rank 11/19932 as of 1 Oct 2023)

2 column data.table

DT

	id	val
1	b	4
2	a	2
3	a	3
4	c	1
5	c	5
6	b	6

	id	val
1	b	4
2	a	2
3	a	3
4	c	1
5	c	5
6	b	6

	id	val
1	b	4
2	a	2
3	a	3
4	c	1
5	c	5
6	b	6

	id	val
1	b	4
2	a	2
3	a	3
4	c	1
5	c	5
6	b	6

	id	val
1	b	4
2	a	2
3	a	3
4	c	1
5	c	5
6	b	6

7 of 36

Comparing data.table and tidyverse

tidyverse R package 1.0 on CRAN in 2016
tidyverse packages tibble + readr + tidyr + dplyr ~= data.table
tidyverse uses DF |> ... |>, data.table uses DT[...][...]
tidyverse is verbose (lots of code), data.table is concise (little code)�example: tibble |> filter(x=="a") |> group_by(z) |> summarise(m=mean(y)) �vs: DT[x=="a", .(m=mean(y)), by=z]
tidyverse has many dependencies, data.table has none (easier to install)
tidyverse has frequent breaking changes, data.table ensures backwards compatibility �(easier for users to upgrade to new data.table versions)

https://teachdatascience.com/tidyverse/

8 of 36

Why is data.table so popular/powerful?

(efficiency)

9 of 36

two kinds of data table efficiency

Efficient R code syntax (saves programming time)
Efficient C code implementation (saves time and memory, so larger data sets can be analyzed using smaller computational resources)

10 of 36

data table R code syntax

General form: DT[i, j, by]

On which rows

What to do?

Grouped by

what?

think in terms of — rows, what to do with columns, and groups
Matt's 2014 useR talk https://youtu.be/qLrdYhizEMg?t=1m54s

SQL: WHERE SELECT | UPDATE GROUP BY

11 of 36

data.frame(DF) vs data.table(DT)

sum(DF[DF$code != “abd”, “valA”])

DT[code != “abd”, sum(valA)]

Consider subset of rows with "abd" in code column, then compute sum of values in valA column.
DF needs to be repeated, no repetition of DT.
sum can be placed in the square brackets [ ] with DT, rather than outside with DF.

12 of 36

two kinds of data table efficiency

Efficient R code syntax (saves programming time)
Efficient C code implementation (saves time and memory, so larger data sets can be analyzed using smaller computational resources)

13 of 36

100

3000

data.table::fread is an extremely efficient CSV file reader

Source code: https://tdhock.github.io/blog/2023/dt-atime-figures/

14 of 36

data.table computes summaries 100x faster than others

1

...

N

...

In machine learning, K-fold cross-validation is used to estimate the loss of a hyper-parameter, such as the number of epochs of training of a neural network.

data.table can efficiently compute the average loss over the K=10 folds, for each of the N epochs.

epoch

loss

epoch

Mean

SD

Length

1

...

N

1

...

N

fold

10

...

10

1

...

1

15 of 36

100

10000

data.table::fwrite is an extremely efficient CSV file writer

Source code: https://tdhock.github.io/blog/2023/dt-atime-figures/

16 of 36

most underrated package

17 of 36

powerful

18 of 36

data.table data.table data.table

19 of 36

great sadness

20 of 36

Using data.table for efficient big data analysis

See https://github.com/tdhock/2023-10-LatinR-data.table for full 3 hour tutorial presentation slides, with code, figures, exercises...

21 of 36

Contributing to data.table

we need your help!

22 of 36

data.table mascot is a sea lion, which barks "R R R"
data.table community has a new blog, The Raft, https://rdatatable-community.github.io/The-Raft/ �sea lions often float together on the ocean's surface in groups called "rafts." - Marine Mammal Center

Community blog and survey

23 of 36

data.table has an active issue/Pull Request(PR) tracker https://github.com/Rdatatable/data.table/
1000+ open issues, 100+ open PRs
if you have any time/interest, we could use your help!
easy first contribution: try reproducing an issue �(very helpful to know if an issue is reproducible)
very inclusive community -- after you submit your first PR, you will be invited to join the github group!
now is a very exciting time to get involved, as we recently created a formal written document describing de-centralized project/community governance

GitHub repository

24 of 36

In 2023-2025, National Science Foundation has provided funds to support expanding the ecosystem of users and contributors around data.table
20 translation awards, US$500 each, in order to make documentation and messages more accessible, ideas:
Translate errors/warnings/messages (potools package can help)
Translate most important vignettes (intro, import, reshape)
Translate other documentation (cheat sheets, slides, etc)
Priority: Portuguese, Spanish, Chinese, French, Russian, Arabic, Hindi
Call for proposals: https://rdatatable-community.github.io/The-Raft/

Translation Awards

25 of 36

In 2023-2025, National Science Foundation has provided funds to support expanding the ecosystem of users and contributors around data.table
Eight travel awards, US$2700 each
Candidates should give a talk about data.table at a conference with a relevant audience (potential data.table users or contributors)
Call for proposals on https://rdatatable-community.github.io/The-Raft/

Travel awards

26 of 36

concise, consistent syntax
time and memory efficient
No dependencies (easy to install)
No breaking changes (easy to upgrade)
Inclusive user/developer community with opportunities to contribute:
translation awards, US$500 each
travel awards, US$2700 each

Summary of data.table

27 of 36

2/3

rOpenSci Statistical Software Peer Review

28 of 36

R package is the formal unit of code sharing
Package includes documentation, tests, vignettes, ...
CRAN is the Comprehensive R Archive Network, the simplest and most widely used package repository
CRAN regularly checks every package to make sure it is working correctly (and compatible with all other packages). This is a huge benefit for the R community.
But there is very little review of correctness of package content/code

Sharing R packages on CRAN

29 of 36

Please rather use the Authors@R field and declare Maintainer, Authors

and Contributors with their appropriate roles with person() calls.

e.g. something like:

Authors@R: c(person("Alice", "Developer", role = c("aut", "cre","cph"),

email = "alice.developer@some.domain.net"),

person("Bob", "Dev", role = "aut") )

Please always write package names, software names and API (application

programming interface) names in single quotes in title and description.

e.g: --> 'C'; 'PCRE'; ...

Please note that package names are case sensitive.

Example of review from CRAN

Review of form, not content. How to get feedback about the quality of your code/documentation/etc?

30 of 36

Peer-reviewed paper is the formal unit of sharing research results
Peer reviewers typically read the manuscript in depth, and provide substantive feedback/suggestions about the content of the paper
But typically peer reviewers do not read the code required to do the computations in those papers. There are some exceptions, for example Journal of Statistical Software.

Peer-reviewed research papers

31 of 36

(**) Section 1 and abstract: The proposed model uses time information

along the data to constrain the inference during the execution of the dynamic

programming algorithm. I don’t understand the reason why the authors are

discussing train and test data. Training is typically associated with learning

coefficients of variables. In this case, the learning step involves determining the

intervals with the presence or absence of a change. These intervals are provided

by experts using already known data (training set). It seems to me that refer-

ring to apriori information would be less confusing.

...

(**) Page 10 Algorithm and line 48: ”... by the Decode sub-routine”.

We would need more details about the recovering of vector m. Is this back-

tracking step the reason why the authors used notation μ inside the functional

description of the costs? Do we have vector equality m = μ? If not, why?

Example of peer review

Detailed review of content, but this is as detailed as it gets in terms of code review in a typical academic journal.

32 of 36

Goal is to provide thorough peer review of the content of R packages (code/documentation/etc), analogous to what journals do for peer-reviewed research papers.
https://stats-devguide.ropensci.org/
case study of canaper R package peer review that I supervised: https://github.com/ropensci/software-review/issues/475

rOpenSci Guide and example

33 of 36

3/3

Google Summer of Code

34 of 36

3 month free/open-source coding project, paid for by Google
You don't have to be a coding expert to participate
Goal is to teach you how to contribute to free/open-source software projects
I have been co-administrator of R project in Google Summer of Code for 10+ years, and I have mentored 20+ contributors (typically college students)
Your mentor should not be from the same institution as you

What is GSOC?

35 of 36

List of all organizations https://summerofcode.withgoogle.com/programs/2024/organizations
R project idea list: https://github.com/rstats-gsoc/gsoc2024/wiki/table%20of%20proposed%20coding%20projects

Guide and example

36 of 36

Thank you! Questions?

Toby Dylan Hocking

Assistant Professor

Northern Arizona University

toby.hocking@r-project.org

Funded by NSF POSE program, project #2303612.

Slides adapted from Arun Srinivasan, and datatable-intro vignette - thanks!

Please use/adapt these slides as you like,

as long as you give me some credit,

as I have done for Arun below.