1 of 116

Using and contributing to the data.table package for efficient big data analysis

Download these slides: http://ml.nau.edu/latinr.pdf

Toby Dylan Hocking

Assistant Professor

Northern Arizona University

toby.hocking@r-project.org

Funded by NSF POSE program, project #2303612.

Slides adapted from Arun Srinivasan, and datatable-intro vignette - thanks!

2 of 116

who am i?

  • BA, MS, PHD in Stats/Math (machine learning)
  • Assistant Professor of computer science since 2018
  • Using R since 2003! 20 years! Author of 10+ packages
  • data.table user since 2015, contributor since 2019
  • Principal Investigator, NSF Pathways to Enable Open-Source Ecosystems (POSE) project, 2023-2025, about data.table

3 of 116

talk overview

  1. What is data.table?
  2. Why is data.table so popular/powerful? (efficiency)
  3. Using data.table for efficient big data analysis
  4. Contributing to data.table
  5. translate documentation from English to another language, get US$500
  6. travel awards: get up to US$2700 to talk about data.table at conferences
  7. participate in community, Issues and Pull Requests on GitHub

4 of 116

1/4

What is data.table?

5 of 116

data.frame in R

id

val

1

b

4

2

a

2

3

a

3

4

c

1

5

c

5

6

b

6

  • 2D columnar data structure
    • rows and columns
  • subset rows — DF[DF$id != “a”, ]
  • select columns — DF[, “val”]
  • subset rows & select columns — DF[DF$id != “a”, “val”]
  • that’s pretty much it…

2 column data.frame

DF

id

val

1

b

4

2

a

2

3

a

3

4

c

1

5

c

5

6

b

6

id

val

1

b

4

2

a

2

3

a

3

4

c

1

5

c

5

6

b

6

id

val

1

b

4

2

a

2

3

a

3

4

c

1

5

c

5

6

b

6

id

val

1

b

4

2

a

2

3

a

3

4

c

1

5

c

5

6

b

6

id

val

1

b

4

2

a

2

3

a

3

4

c

1

5

c

5

6

b

6

6 of 116

data.table

id

val

1

b

4

2

a

2

3

a

3

4

c

1

5

c

5

6

b

6

  • Like data.frame, but with more powerful R code syntax, and C code implementation
  • R package on CRAN since 2006
  • Created by Matt Dowle, co-author Arun Srinivasan since 2013, 50+ contributors
  • 1463 other CRAN packages require data.table (in most popular 0.05% of all CRAN packages, rank 11/19932 as of 1 Oct 2023)

2 column data.table

DT

id

val

1

b

4

2

a

2

3

a

3

4

c

1

5

c

5

6

b

6

id

val

1

b

4

2

a

2

3

a

3

4

c

1

5

c

5

6

b

6

id

val

1

b

4

2

a

2

3

a

3

4

c

1

5

c

5

6

b

6

id

val

1

b

4

2

a

2

3

a

3

4

c

1

5

c

5

6

b

6

id

val

1

b

4

2

a

2

3

a

3

4

c

1

5

c

5

6

b

6

7 of 116

Comparing data.table and tidyverse

  • tidyverse R package 1.0 on CRAN in 2016
  • tidyverse packages tibble + readr + tidyr + dplyr ~= data.table
  • tidyverse uses DF |> ... |>, data.table uses DT[...][...]
  • tidyverse is verbose (lots of code), data.table is concise (little code)�example: tibble |> filter(x=="a") |> group_by(z) |> summarise(m=mean(y)) �vs: DT[x=="a", .(m=mean(y)), by=z]
  • tidyverse has many dependencies, data.table has none (easier to install)
  • tidyverse has frequent breaking changes, data.table ensures backwards compatibility �(easier for users to upgrade to new data.table versions)

https://teachdatascience.com/tidyverse/

8 of 116

2/4

Why is data.table so popular/powerful?

(efficiency)

9 of 116

two kinds of data table efficiency

  • Efficient R code syntax (saves programming time)
  • Efficient C code implementation (saves time and memory, so larger data sets can be analyzed using smaller computational resources)

10 of 116

data table R code syntax

General form: DT[i, j, by]

On which rows

What to do?

Grouped by

what?

  • think in terms of — rows, what to do with columns, and groups
  • Matt's 2014 useR talk https://youtu.be/qLrdYhizEMg?t=1m54s

SQL: WHERE SELECT | UPDATE GROUP BY

11 of 116

data table R code syntax

id

val

1:

b

4

2:

a

2

3:

a

3

4:

c

1

5:

c

5

6:

b

6

  • are columnar data structures as well
    • 2D — rows and columns
  • subset rows — DT[id != “a”, ]
  • select columns — DT[, val]
  • compute on columns — DT[, mean(val)]
  • subset rows & select / compute on columns — DT[id != “a”, mean(val)]
  • virtual 3rd dimension — group by

2 column data.table

DT

id

val

1:

b

4

2:

a

2

3:

a

3

4:

c

1

5:

c

5

6:

b

6

id

val

1:

b

4

2:

a

2

3:

a

3

4:

c

1

5:

c

5

6:

b

6

id

val

1:

b

4

2:

a

2

3:

a

3

4:

c

1

5:

c

5

6:

b

6

id

val

1:

b

4

2:

a

2

3:

a

3

4:

c

1

5:

c

5

6:

b

6

id

val

1:

b

4

2:

a

2

3:

a

3

4:

c

1

5:

c

5

6:

b

6

id

val

1:

b

4

2:

a

2

3:

a

3

4:

c

1

5:

c

5

6:

b

6

12 of 116

data.frame(DF) vs data.table(DT)

sum(DF[DF$code != “abd”, “valA”])

DT[code != “abd”, sum(valA)]

  • Consider subset of rows with "abd" in code column, then compute sum of values in valA column.
  • DF needs to be repeated, no repetition of DT.
  • sum can be placed in the square brackets [ ] with DT, rather than outside with DF.

13 of 116

data.frame(DF) vs data.table(DT)

DF[DF$code == “abd”, “valA”] <- NA

DT[code == “abd”, valA := NA]

  • For the subset of rows with "abd" in code column, set values in valA column to missing/NA.
  • Need to repeat DF, no need to repeat DT.
  • DF uses arrow assignment <-�DT uses walrus :=

14 of 116

data.frame(DF) vs data.table(DT)

aggregate(cbind(valA, valB) ~ id,

DF[DF$code != “abd”, ],

sum)

DT[code != “abd”,

.(sum(valA), sum(valB)),

by = id]

  • Summarize by group: consider subset of rows with "abd" in code column, then compute sum of values in valA and valB columns, for each unique value of the id column.
  • Everything under DT square brackets [ ], whereas using DF we need to use other functions (aggregate, cbind)

15 of 116

data.frame(DF) vs data.table(DT)

DF1 <- merge(DF1, DF2, all.x=TRUE)

DF1[, “valA”] <- ifelse(is.na(DF1$val),

DF1$valA, DF1$val)

DF1$val <- NULL

DT1[DT2, valA := val, on = .(id, code)]

  • Modify while joining: for every row of DT1 which matches a row of DT2 on id and code columns, set valA to val
  • Easy to do with DT, := assignment, and on= argument
  • Difficult but possible to do in base R with DF

16 of 116

two kinds of data table efficiency

  • Efficient R code syntax (saves programming time)
  • Efficient C code implementation (saves time and memory, so larger data sets can be analyzed using smaller computational resources)

17 of 116

100

100

100

3000

data.table::fread is an extremely efficient CSV file reader

18 of 116

data.table computes summaries 100x faster than others

1

...

N

...

In machine learning, K-fold cross-validation is used to estimate the loss of a hyper-parameter, such as the number of epochs of training of a neural network.

data.table can efficiently compute the average loss over the K=10 folds, for each of the N epochs.

epoch

loss

epoch

Mean

SD

Length

1

...

N

1

...

N

fold

10

...

10

1

...

1

19 of 116

100

100

100

10000

data.table::fwrite is an extremely efficient CSV file writer

20 of 116

most underrated package

21 of 116

powerful

22 of 116

data.table data.table data.table

23 of 116

great sadness

24 of 116

3/4

Using data.table for efficient big data analysis

25 of 116

Install data.table, get PDF docs

> install.packages("remotes")

> remotes::install_github("Rdatatable/data.table")

(need a compiler: xcode for mac, or rtools for windows)

OR: if you do not have a compiler: install from CRAN then update:

> install.packages("data.table")

> data.table::update_dev_pkg()

  • Exercise: install development version from GitHub, which contains the most recent new features
  • Download cheat sheet http://ml.nau.edu/dtcheat.pdf
  • Getting started in R - tinyverse edition http://ml.nau.edu/tiny.pdf

26 of 116

Create data.table from R objects

> data.table(

+ ID = c("b","b","b","a","a","c"),

+ a = 1:2,

+ b = 7,

+ c = 13:18

+ )

ID a b c

1: b 1 7 13

2: b 2 7 14

3: b 1 7 15

4: a 2 7 16

5: a 1 7 17

6: c 2 7 18

  • data.table(column_name=column_value), just like data.frame
  • values in short columns recycled to length of largest column

27 of 116

data.table vs frame columns

> head(data.table("10^8 m^3"=Nile, year=1871:1970))

10^8 m^3 year

1: 1120 1871

2: 1160 1872

3: 963 1873

4: 1210 1874

5: 1160 1875

6: 1160 1876

> head(data.frame("10^8 m^3"=Nile, year=1871:1970))

X10.8.m.3 year

1 1120 1871

2 1160 1872

3 963 1873

4 1210 1874

5 1160 1875

6 1160 1876

  • non-syntactic variable names are kept by default

28 of 116

data.table vs frame list columns

> data.table(L=list("scalar", c("vec","tor")),x=1:2)

L x

1: scalar 1

2: vec,tor 2

> data.frame(L=I(list("scalar", c("vec","tor"))),x=1:2)

L x

1 scalar 1

2 vec, tor 2

  • need to use I( ) to create list columns in frame, not table

29 of 116

Exercises

  • Your turn: convert the built-in iris data set to a data table via data.table(iris)
  • Create data table with one row, and columns for your name (character), years experience with R (numeric), birthplace (character), city of residence (character), language(s) spoken (list of character)

30 of 116

Read CSV from file

> if(!file.exists("flights14.csv"))download.file(

+ "http://ml.nau.edu/flights14.csv",

+ "flights14.csv")

> (flights <- data.table::fread("flights14.csv"))

year month day dep_delay arr_delay carrier origin dest air_time distance hour

1: 2014 1 1 14 13 AA JFK LAX 359 2475 9

2: 2014 1 1 -3 13 AA JFK LAX 363 2475 11

3: 2014 1 1 2 9 AA JFK LAX 351 2475 19

4: 2014 1 1 -8 -26 AA LGA PBI 157 1035 7

5: 2014 1 1 2 1 AA JFK LAX 350 2475 13

---

253312: 2014 10 31 1 -30 UA LGA IAH 201 1416 14

253313: 2014 10 31 -5 -14 UA EWR IAH 189 1400 8

253314: 2014 10 31 -8 16 MQ LGA RDU 83 431 11

253315: 2014 10 31 -4 15 MQ LGA DTW 75 502 11

253316: 2014 10 31 -5 1 MQ LGA SDF 110 659 8

  • Airline NYC-flights14 data set

31 of 116

Read CSV from shell command

> data.table::fread("egrep 'LAX|year' flights14.csv")

year month day dep_delay arr_delay carrier origin dest air_time distance hour

1: 2014 1 1 14 13 AA JFK LAX 359 2475 9

2: 2014 1 1 -3 13 AA JFK LAX 363 2475 11

3: 2014 1 1 2 9 AA JFK LAX 351 2475 19

4: 2014 1 1 2 1 AA JFK LAX 350 2475 13

5: 2014 1 1 4 0 AA EWR LAX 339 2454 18

---

14430: 2014 10 31 -6 -38 UA JFK LAX 323 2475 11

14431: 2014 10 31 9 -21 UA EWR LAX 323 2454 13

14432: 2014 10 31 40 14 UA EWR LAX 320 2454 18

14433: 2014 10 31 3 -32 UA EWR LAX 320 2454 20

14434: 2014 10 31 10 -27 UA EWR LAX 326 2454 10

  • Useful with commands that output CSV data (no intermediate file)
  • egrep useful if you do not need/want to read full data set into R

32 of 116

Exercise: Read CSV from URL

flights <- data.table::fread("http://ml.nau.edu/flights14.csv")

  • Convenient, but I usually download.file( ) first, to avoid having to download the file again, after re-starting R
  • Your turn, try the following code:

33 of 116

Subset rows using i

General form: DT[i, j, by]

On which rows

What to do?

Grouped by

what?

  • think in terms of basic units — rows, columns and groups
  • data.table syntax provides placeholder for each of them

34 of 116

Logical row subset

> flights[origin == "JFK" & month == 6]

year month day dep_delay arr_delay carrier origin dest air_time distance hour

1: 2014 6 1 -9 -5 AA JFK LAX 324 2475 8

2: 2014 6 1 -10 -13 AA JFK LAX 329 2475 12

3: 2014 6 1 18 -1 AA JFK LAX 326 2475 7

4: 2014 6 1 -6 -16 AA JFK LAX 320 2475 10

5: 2014 6 1 -4 -45 AA JFK LAX 326 2475 18

---

8418: 2014 6 30 -3 -6 MQ JFK PIT 62 340 14

8419: 2014 6 30 -5 -32 MQ JFK RDU 65 427 14

8420: 2014 6 30 -3 -16 MQ JFK DCA 39 213 17

8421: 2014 6 30 -2 7 MQ JFK DCA 52 213 7

8422: 2014 6 30 -7 -18 MQ JFK RDU 67 427 8

  • Get all the flights with "JFK" as the origin airport in the month of June.
  • DT[rows, ] comma not necessary (but allowed)

35 of 116

Integer row subset

> flights[1:2]

year month day dep_delay arr_delay carrier origin dest air_time distance hour

1: 2014 1 1 14 13 AA JFK LAX 359 2475 9

2: 2014 1 1 -3 13 AA JFK LAX 363 2475 11

> flights[-(1:2)]

year month day dep_delay arr_delay carrier origin dest air_time distance hour

1: 2014 1 1 2 9 AA JFK LAX 351 2475 19

2: 2014 1 1 -8 -26 AA LGA PBI 157 1035 7

3: 2014 1 1 2 1 AA JFK LAX 350 2475 13

4: 2014 1 1 4 0 AA EWR LAX 339 2454 18

5: 2014 1 1 -2 -18 AA JFK LAX 338 2475 21

---

253310: 2014 10 31 1 -30 UA LGA IAH 201 1416 14

253311: 2014 10 31 -5 -14 UA EWR IAH 189 1400 8

253312: 2014 10 31 -8 16 MQ LGA RDU 83 431 11

253313: 2014 10 31 -4 15 MQ LGA DTW 75 502 11

253314: 2014 10 31 -5 1 MQ LGA SDF 110 659 8

  • Get the first two flights, or all except the first two.
  • Integer subset in DT works just like in DF.

36 of 116

Sort rows

> flights[order(origin, -dest)]

year month day dep_delay arr_delay carrier origin dest air_time distance hour

1: 2014 1 5 6 49 EV EWR XNA 195 1131 8

2: 2014 1 6 7 13 EV EWR XNA 190 1131 8

3: 2014 1 7 -6 -13 EV EWR XNA 179 1131 8

4: 2014 1 8 -7 -12 EV EWR XNA 184 1131 8

5: 2014 1 9 16 7 EV EWR XNA 181 1131 8

---

253312: 2014 10 31 -1 -22 WN LGA ATL 112 762 9

253313: 2014 10 31 -5 -23 WN LGA ATL 112 762 20

253314: 2014 4 6 -6 -1 EV LGA AGS 110 678 10

253315: 2014 4 7 2 1 EV LGA AGS 111 678 11

253316: 2014 4 11 0 -19 EV LGA AGS 102 678 10

  • Sort flights first by column origin in ascending order, and then by dest in descending order
  • data.table sort is so efficient that it was integrated into base R in 2016!

37 of 116

Exercises

  • Your turn, how many rows are there with carrier "AA" and dest "LAX" ?
  • If you sort that table, what is the first and last month/day/hour?
  • Hint: You can use DT[ ... ][ ... ] in a chain.�Similar to piping, DF |> ... |> ...

38 of 116

Compute columns in j

General form: DT[i, j, by]

On which rows

What to do?

Grouped by

what?

  • think in terms of basic units — rows, columns and groups
  • data.table syntax provides placeholder for each of them

39 of 116

Select one column

> str(flights[, "arr_delay" ])

Classes ‘data.table’ and 'data.frame': 253316 obs. of 1 variable:

$ arr_delay: int 13 13 9 -26 1 0 -18 -14 -17 -14 ...

- attr(*, ".internal.selfref")=<externalptr>

> str(flights[, arr_delay]) #like DT[["arr_delay"]] or DT$arr_delay

int [1:253316] 13 13 9 -26 1 0 -18 -14 -17 -14 ...

  • DT[, "arr_delay"] -> data.table with one column
  • DT[, arr_delay] -> vector�like DT$arr_delay or DT[["arr_delay"]] in base R
  • Exercise: use hist( ) to plot a histogram of departure delays

40 of 116

Select two columns with literal

> flights[, data.table(arr_delay, dep_delay)]

arr_delay dep_delay

1: 13 14

2: 13 -3

---

253315: 15 -4

253316: 1 -5

> flights[, c("arr_delay", "dep_delay")]

arr_delay dep_delay

1: 13 14

2: 13 -3

---

253315: 15 -4

253316: 1 -5

  • If data.table/list/. is used in j, then a data.table is returned
  • Can also use c("name") in j

41 of 116

Select two columns with variable

> (DT=data.table(x=1:2, y=3:4,

  • select_cols=c("foo", "bar")))

x y select_cols

1: 1 3 foo

2: 2 4 bar

> select_cols=c("x","y")

> DT[, select_cols]

[1] "foo" "bar"

> DT[, ..select_cols]

x y

1: 1 3

2: 2 4

> DT[, select_cols, with=FALSE]

x y

1: 1 3

2: 2 4

  • Can use with=FALSE or double dot notation .. (like one level up in unix file path, look for select_cols in environment, not as a column name)

42 of 116

Select all columns except two

> delay_cols = c("arr_delay", "dep_delay")

> head(flights[, !..delay_cols])

year month day carrier origin dest air_time distance hour

1: 2014 1 1 AA JFK LAX 359 2475 9

2: 2014 1 1 AA JFK LAX 363 2475 11

3: 2014 1 1 AA JFK LAX 351 2475 19

4: 2014 1 1 AA LGA PBI 157 1035 7

5: 2014 1 1 AA JFK LAX 350 2475 13

6: 2014 1 1 AA EWR LAX 339 2454 18

> head(flights[, -..delay_cols])

year month day carrier origin dest air_time distance hour

1: 2014 1 1 AA JFK LAX 359 2475 9

2: 2014 1 1 AA JFK LAX 363 2475 11

3: 2014 1 1 AA JFK LAX 351 2475 19

4: 2014 1 1 AA LGA PBI 157 1035 7

5: 2014 1 1 AA JFK LAX 350 2475 13

6: 2014 1 1 AA EWR LAX 339 2454 18

  • Can use either ! or - to negate the selection of columns

43 of 116

Select range of columns

> head(flights[, year:day])

year month day

1: 2014 1 1

2: 2014 1 1

3: 2014 1 1

4: 2014 1 1

5: 2014 1 1

6: 2014 1 1

> head(flights[, !(year:day)])

dep_delay arr_delay carrier origin dest air_time distance hour

1: 14 13 AA JFK LAX 359 2475 9

2: -3 13 AA JFK LAX 363 2475 11

3: 2 9 AA JFK LAX 351 2475 19

4: -8 -26 AA LGA PBI 157 1035 7

5: 2 1 AA JFK LAX 350 2475 13

6: 4 0 AA EWR LAX 339 2454 18

  • Can use column1:column2 to select those two, and all columns in between (and can negate with ! or -)

44 of 116

Rename and compute columns

> flights[, data.table(destination=dest, air_hours=air_time/60)]

destination air_hours

1: LAX 5.983333

2: LAX 6.050000

3: LAX 5.850000

4: PBI 2.616667

5: LAX 5.833333

---

253312: IAH 3.350000

253313: IAH 3.150000

253314: RDU 1.383333

253315: DTW 1.250000

253316: SDF 1.833333

  • Exercise: try using .( ) or list( ) instead of data.table( ) in j
  • .( ) is an alias for list( ) -- items in list are columns of table

45 of 116

Summarize all rows

> flights[, .(m_arr = mean(arr_delay), m_dep = mean(dep_delay))]

m_arr m_dep

1: 8.146702 12.46526

> flights[, .(range_arr = range(arr_delay), stat=c("min","max"))]

range_arr stat

1: -112 min

2: 1494 max

  • In data.table/list/. in j, summarization functions can be used,
  • in this case we use mean, which returns a single number
  • we also use range, which returns two numbers
  • we compute mean, min, max arrival delay over all flights
  • Exercise: compute min/max air_time and distance

46 of 116

Create/delete columns in j

General form: DT[i, j, by]

On which rows

What to do?

Grouped by

what?

  • think in terms of basic units — rows, columns and groups
  • We use DT[, variable := value] or set(DT, j="variable", value=value)

47 of 116

Comparing column assignment in base R and data.table

Base R code

data.table code

Literal

DF$next <- DF$current + 1

DT[, next := current + 1 ]

Variable

DF[[my_var_name]] <-

DF$current + 1

set(DT, j=my_var_name,

value=DT$current + 1)

my_var_name <- "next"

  • In column named "next" set value to current + 1
  • Possible to use base R code with data.table
  • Historically, := has been more efficient (time/memory)
  • Recent versions of R have improved efficiency

48 of 116

Create column := value

> month.dt <- flights[, .(month, air_time, distance)]

> head(month.dt)

month air_time distance

1: 10 44 96

2: 10 39 96

3: 10 37 184

4: 10 35 184

5: 10 50 184

6: 10 36 184

> month.dt[, air_hours := air_time/60]

> head(month.dt)

month air_time distance air_hours

1: 10 44 96 0.7333333

2: 10 39 96 0.6500000

3: 10 37 184 0.6166667

4: 10 35 184 0.5833333

5: 10 50 184 0.8333333

6: 10 36 184 0.6000000

  • DT[, variable := value] can be used to create new columns, or update
  • Same effect as DT$variable <- value, or DT[["variable"]] <- value

49 of 116

Delete column := NULL

> head(month.dt)

month air_time distance air_hours

1: 10 44 96 0.0000000

2: 10 39 96 0.0000000

3: 10 37 184 0.6166667

4: 10 35 184 0.5833333

5: 10 50 184 0.8333333

6: 10 36 184 0.6000000

> month.dt[, air_time := NULL]

> head(month.dt)

month distance air_hours

1: 10 96 0.0000000

2: 10 96 0.0000000

3: 10 184 0.6166667

4: 10 184 0.5833333

5: 10 184 0.8333333

6: 10 184 0.6000000

  • DT[, variable := NULL] deletes column
  • All := operations are by reference (efficient)

50 of 116

set( ) to create/delete column

> my_var_name = "new"

> set(flights, j=my_var_name, value="HI!")

> flights

year month day dep_delay arr_delay carrier origin dest air_time distance hour new

1: 2014 1 1 14 13 AA JFK LAX 359 2475 9 HI!

2: 2014 1 1 -3 13 AA JFK LAX 363 2475 11 HI!

---

253315: 2014 10 31 -4 15 MQ LGA DTW 75 502 11 HI!

253316: 2014 10 31 -5 1 MQ LGA SDF 110 659 8 HI!

> set(flights, j=my_var_name, value=NULL)

> flights

year month day dep_delay arr_delay carrier origin dest air_time distance hour

1: 2014 1 1 14 13 AA JFK LAX 359 2475 9

2: 2014 1 1 -3 13 AA JFK LAX 363 2475 11

---

253315: 2014 10 31 -4 15 MQ LGA DTW 75 502 11

253316: 2014 10 31 -5 1 MQ LGA SDF 110 659 8

  • set(DT, j="col_name", value=col_val) same as DT[, col_name := col_val]
  • Useful when column name is stored in a variable, as below:

51 of 116

Exercises

  • Create a new column air_hours := air_time/60 as in previous slides
  • Create a new column arrival_hour := hour + air_hours
  • Use flights[, range(hour)] to find the normal range of the hour variable
  • Do a subset to find how many rows have arrival_hour > max(hour)
  • Create a new column arrival_day which is either "same" or "next" ifelse(arrival_hour>max(hour), "next", "same")

52 of 116

Compute on a subset with i and j

General form: DT[i, j, by]

On which rows

What to do?

Grouped by

what?

  • think in terms of basic units — rows, columns and groups
  • data.table syntax provides placeholder for each of them

53 of 116

Summarize a subset of rows

> flights[origin == "JFK" & month == 6L, length(dest)]

[1] 8422

> flights[origin == "JFK" & month == 6L, .N]

[1] 8422

> flights[origin == "JFK" & month == 6L, .(m_arr=mean(arr_delay), num_flights=.N)]

m_arr num_flights

1: 5.839349 8422

  • How many trips have been made in 2014 from "JFK" airport in the month of June, and what was the mean arrival delay?
  • Special symbol .N can be used in [ ] to get the number of rows
  • Exercise: for all flights with a positive departure delay, what was the mean arrival delay? for flights with a negative departure delay?

54 of 116

Update column :=

> head(month.dt)

month air_time distance air_hours

1: 10 44 96 0.7333333

2: 10 39 96 0.6500000

3: 10 37 184 0.6166667

4: 10 35 184 0.5833333

5: 10 50 184 0.8333333

6: 10 36 184 0.6000000

> month.dt[distance<100, air_hours := 0]

> head(month.dt)

month air_time distance air_hours

1: 10 44 96 0.0000000

2: 10 39 96 0.0000000

3: 10 37 184 0.6166667

4: 10 35 184 0.5833333

5: 10 50 184 0.8333333

6: 10 36 184 0.6000000

  • DT[ i , variable := value] can be used to update rows matching i
  • set(DT, i= , j="variable", value= ) useful for programming

55 of 116

Exercises

  • Continuing previous exercise ...
  • Create a new column air_hours := air_time/60 as in previous slides
  • Create a new column arrival_hour := hour + air_hours
  • Create a new column arrival_day which is either "same" or "next" ifelse(arrival_hour>max(hour), "next", "same")
  • Update arrival_hour := arrival_hour-24 for rows with arrival_hour>24
  • Use flights[, range(arrival_hour)] to confirm that values are between 0 and 24

56 of 116

Compute for each group with j and by

General form: DT[i, j, by]

On which rows

What to do?

Grouped by

what?

  • think in terms of basic units — rows, columns and groups
  • data.table syntax provides placeholder for each of them

57 of 116

Summarize for each group

> flights[, .(num_flights=.N), by=origin]

origin num_flights

1: JFK 81483

2: LGA 84433

3: EWR 87400

> flights[, .(num_flights=.N, mean_arr_delay=mean(arr_delay)), by=origin]

origin num_flights mean_arr_delay

1: JFK 81483 7.731465

2: LGA 84433 6.601968

3: EWR 87400 10.026121

  • by=origin means to treat each unique value of origin as a group
  • for each group/origin, we compute the number of flights, and the mean arrival delay

58 of 116

Summarize for each group

> flights[, .(num_flights=.N, mean_arr_delay=mean(arr_delay)), by=.(origin, dest)]

origin dest num_flights mean_arr_delay

1: JFK LAX 10208 3.301332

2: LGA PBI 2307 4.939315

3: EWR LAX 4226 8.425461

4: JFK MIA 2750 1.630909

5: JFK SEA 1815 3.168595

---

217: LGA AVL 2 -31.500000

218: LGA GSP 3 6.000000

219: LGA SBN 2 -15.500000

220: EWR SBN 6 -2.333333

221: LGA DAL 15 -16.000000

  • several variables can be specified via .( ) in by, which means to use each unique combination of all variables as a group

59 of 116

New by/grouping variables

> flights[, .(

  • num_flights=.N,
  • mean_arr_delay=mean(arr_delay)
  • ), by=.(
  • origin=tolower(origin),
  • destination=dest
  • )]

origin destination num_flights mean_arr_delay

1: jfk LAX 10208 3.301332

2: lga PBI 2307 4.939315

---

220: ewr SBN 6 -2.333333

221: lga DAL 15 -16.000000

  • by variables can be renamed and computed upon
  • Exercise: for all flights with a positive or negative departure delay, what was the mean arrival delay? Hint: use sign(dep_delay) in by.

60 of 116

Using all three arguments together

General form: DT[i, j, by]

On which rows

What to do?

Grouped by

what?

  • think in terms of basic units — rows, columns and groups
  • data.table syntax provides placeholder for each of them

61 of 116

Subset rows before summarizing

> flights[carrier=="AA", .(num_flights=.N, mean_arr_delay=mean(arr_delay)), by=.(origin, dest)]

origin dest num_flights mean_arr_delay

1: JFK LAX 3387 2.4186596

2: LGA PBI 245 -8.9510204

3: EWR LAX 62 4.8064516

4: JFK MIA 1876 1.0964819

5: JFK SEA 298 10.0033557

---

20: JFK EGE 85 35.4352941

21: JFK DFW 474 1.6751055

22: JFK SAN 299 17.1304348

23: JFK DCA 172 12.3139535

24: EWR PHX 121 -2.1404959

origin dest num_flights mean_arr_delay

  • i can be used at the same time as j and by
  • for the subset of flights with AA as carrier, for each origin and dest, get number of flights, and the mean arrival delay

62 of 116

DT[i] inside j with by

> flights[, data.table(air_time,distance)[1], by=.(origin,dest)]

origin dest air_time distance

1: JFK LAX 359 2475

2: LGA PBI 157 1035

3: EWR LAX 339 2454

4: JFK MIA 161 1089

5: JFK SEA 349 2422

---

217: LGA AVL 86 599

218: LGA GSP 90 610

219: LGA SBN 94 651

220: EWR SBN 97 637

221: LGA DAL 210 1381

  • For every group, defined by unique values of origin and dest, select the first row, and air_time,distance columns
  • Note: .( ) is synonym for list( ), so data.table( ) must be used here

63 of 116

.SD[i] inside j with by

> flights[, .SD[1], by=.(origin,dest)]

origin dest year month day dep_delay arr_delay carrier air_time distance hour

1: JFK LAX 2014 1 1 14 13 AA 359 2475 9

2: LGA PBI 2014 1 1 -8 -26 AA 157 1035 7

3: EWR LAX 2014 1 1 4 0 AA 339 2454 18

4: JFK MIA 2014 1 1 -1 -17 AA 161 1089 15

5: JFK SEA 2014 1 1 -2 -14 AA 349 2422 18

---

217: LGA AVL 2014 7 5 -5 -16 EV 86 599 9

218: LGA GSP 2014 8 23 24 34 EV 90 610 10

219: LGA SBN 2014 8 29 -3 -33 EV 94 651 18

220: EWR SBN 2014 10 26 20 19 EV 97 637 19

221: LGA DAL 2014 10 28 -5 -11 VX 210 1381 15

  • For every group, defined by unique values of origin and dest, select the first row, and all columns
  • Note: .SD is a data.table; it means "Subset of Data" for a group

64 of 116

.SD[i] inside j with by

  • Exercise: for every origin and dest, get the last row. Hint: remember that .N means the number of rows in a table
  • Exercise: for every origin and dest, get the first and last rows.

65 of 116

Write to and read from CSV in chunks

General form: DT[i, j, by]

On which rows

What to do?

Grouped by

what?

  • think in terms of basic units — rows, columns and groups
  • data.table syntax provides placeholder for each of them

66 of 116

fwrite - efficient CSV writing

> fwrite(flights, "flights14.csv")

  • Write entire flights data set to disk

67 of 116

DT[, j=fwrite( ), by=variable]

> flights[, fwrite(

+ data.table(air_time, distance),

+ paste0(month,".csv")

+ ), by=month]

Empty data.table (0 rows and 1 cols): month

> Sys.glob("*.csv")

[1] "10.csv" "1.csv" "2.csv" "3.csv" "4.csv"

[6] "5.csv" "6.csv" "7.csv" "8.csv" "9.csv"

  • Write one file per month, 1.csv, 2.csv, etc.
  • by=month means that in j, month is scalar, so paste0(month,".csv") can be used to create 1.csv, 2.csv, etc

68 of 116

fread CSV, combine with rbindlist

> month.dt.list <- list() #(1)

> for(month.csv in Sys.glob("*.csv")){

+ month <- sub(".csv", "", month.csv)

+ month.dt.list[[month.csv]] <- data.table(month, fread(month.csv)) #(2)

+ }

> (month.dt <- data.table::rbindlist(month.dt.list)) #(3)

month air_time distance

1: 10 44 96

2: 10 39 96

3: 10 37 184

4: 10 35 184

5: 10 50 184

---

253312: 9 356 2586

253313: 9 347 2586

253314: 9 356 2586

253315: 9 345 2586

253316: 9 592 4962

  • (1) initialize empty list, (2) in each iteration of for loop, assign a table to an element of that list, (3) rbindlist to create a single table with all data

69 of 116

Read several CSV using by

> data.table(month.csv=Sys.glob("*.csv"))[, fread(month.csv), by=month.csv]

month.csv air_time distance

1: 10.csv 44 96

2: 10.csv 39 96

3: 10.csv 37 184

4: 10.csv 35 184

5: 10.csv 50 184

---

253312: 9.csv 356 2586

253313: 9.csv 347 2586

253314: 9.csv 356 2586

253315: 9.csv 345 2586

253316: 9.csv 592 4962

  • by=month.csv means to compute j=fread(month.csv) for each unique value of month.csv (each CSV file)
  • Results are combined via rbindlist and returned as a single table

70 of 116

Exercise

  • Use flights[, fwrite(.SD), by=origin] to write one CSV file for every New York airport
  • Create CSV.dt, a data.table of CSV file names, using Sys.glob("*.csv")
  • Use CSV.dt[, fread(origin.csv), by=origin.csv] to read them back into R

71 of 116

Joins: when i is a data.table

General form: DT[i, j, by]

On which rows

What to do?

Grouped by

what?

  • SQL: DT RIGHT JOIN i
  • at least one row returned for each row in i (with matching data from DT)

72 of 116

Join DT with table i: DT[ i ]

> (three.days <- rbind(

+ data.table(month=3, day=17, event="St.Patrick"),

+ data.table(month=9, day=26, event="Birthday"),

+ data.table(month=12, day=25, event="Christmas")))

month day event

1: 3 17 St.Patrick

2: 9 26 Birthday

3: 12 25 Christmas

> flights[three.days, on=c("month","day")]

year month day dep_delay arr_delay carrier origin dest air_time distance hour event

1: 2014 3 17 17 2 US LGA PHL 35 96 15 St.Patrick

2: 2014 3 17 0 1 US LGA PHL 41 96 6 St.Patrick

3: 2014 3 17 -7 -16 EV EWR ALB 30 143 8 St.Patrick

4: 2014 3 17 37 33 EV EWR ALB 31 143 22 St.Patrick

5: 2014 3 17 19 3 EV EWR ALB 30 143 13 St.Patrick

---

1723: 2014 9 26 -2 -15 UA JFK SFO 318 2586 7 Birthday

1724: 2014 9 26 -6 -22 UA JFK SFO 329 2586 11 Birthday

1725: 2014 9 26 8 -11 UA EWR HNL 589 4962 9 Birthday

1726: 2014 9 26 -3 -26 HA JFK HNL 595 4983 9 Birthday

1727: NA 12 25 NA NA <NA> <NA> <NA> NA NA NA Christmas

  • Returns every row in flights which matches one row in three.days,
  • with a NA row at the end for the the day which did not match

73 of 116

Join two tables, mult argument

> flights[three.days, on=c("month","day"), mult="first"]

year month day dep_delay arr_delay carrier origin dest air_time distance hour event

1: 2014 3 17 17 2 US LGA PHL 35 96 15 St.Patrick

2: 2014 9 26 -2 -4 US LGA PHL 30 96 5 Birthday

3: NA 12 25 NA NA <NA> <NA> <NA> NA NA NA Christmas

> flights[three.days, on=c("month","day"), mult="last"]

year month day dep_delay arr_delay carrier origin dest air_time distance hour event

1: 2014 3 17 -8 13 HA JFK HNL 675 4983 9 St.Patrick

2: 2014 9 26 -3 -26 HA JFK HNL 595 4983 9 Birthday

3: NA 12 25 NA NA <NA> <NA> <NA> NA NA NA Christmas

  • mult="all" is default, meaning to return all matching rows
  • first or last are other options, return one row from flights, for each row of three.days

74 of 116

Join two tables, no NA rows

> flights[three.days, on=c("month","day"), nomatch=0L]

year month day dep_delay arr_delay carrier origin dest air_time distance hour event

1: 2014 3 17 17 2 US LGA PHL 35 96 15 St.Patrick

2: 2014 3 17 0 1 US LGA PHL 41 96 6 St.Patrick

3: 2014 3 17 -7 -16 EV EWR ALB 30 143 8 St.Patrick

4: 2014 3 17 37 33 EV EWR ALB 31 143 22 St.Patrick

5: 2014 3 17 19 3 EV EWR ALB 30 143 13 St.Patrick

---

1722: 2014 9 26 -5 -37 UA JFK SFO 319 2586 10 Birthday

1723: 2014 9 26 -2 -15 UA JFK SFO 318 2586 7 Birthday

1724: 2014 9 26 -6 -22 UA JFK SFO 329 2586 11 Birthday

1725: 2014 9 26 8 -11 UA EWR HNL 589 4962 9 Birthday

1726: 2014 9 26 -3 -26 HA JFK HNL 595 4983 9 Birthday

  • nomatch=0L to return 0 rows if there is no match (instead of 1 NA row)

75 of 116

Join two tables using setkey

> setkey(three.days, month, day)

> setkey(flights, month, day)

> flights[three.days]

year month day dep_delay arr_delay carrier origin dest air_time distance hour event

1: 2014 3 17 17 2 US LGA PHL 35 96 15 St.Patrick

2: 2014 3 17 0 1 US LGA PHL 41 96 6 St.Patrick

3: 2014 3 17 -7 -16 EV EWR ALB 30 143 8 St.Patrick

4: 2014 3 17 37 33 EV EWR ALB 31 143 22 St.Patrick

5: 2014 3 17 19 3 EV EWR ALB 30 143 13 St.Patrick

---

1723: 2014 9 26 -2 -15 UA JFK SFO 318 2586 7 Birthday

1724: 2014 9 26 -6 -22 UA JFK SFO 329 2586 11 Birthday

1725: 2014 9 26 8 -11 UA EWR HNL 589 4962 9 Birthday

1726: 2014 9 26 -3 -26 HA JFK HNL 595 4983 9 Birthday

1727: NA 12 25 NA NA <NA> <NA> <NA> NA NA NA Christmas

  • setkey(DT, col1, col2) used to sort DT by col1 and col2
  • joins are faster, and no need to specify on=, if data already have key

76 of 116

Join two tables other way

> (join.dt <- three.days[flights, on=c("month","day")][order(is.na(event))])

month day event year dep_delay arr_delay carrier origin dest air_time distance hour

1: 3 17 St.Patrick 2014 17 2 US LGA PHL 35 96 15

2: 3 17 St.Patrick 2014 0 1 US LGA PHL 41 96 6

3: 3 17 St.Patrick 2014 -7 -16 EV EWR ALB 30 143 8

4: 3 17 St.Patrick 2014 37 33 EV EWR ALB 31 143 22

5: 3 17 St.Patrick 2014 19 3 EV EWR ALB 30 143 13

---

253312: 10 31 <NA> 2014 -4 -21 UA JFK SFO 337 2586 17

253313: 10 31 <NA> 2014 -2 -37 UA JFK SFO 344 2586 18

253314: 10 31 <NA> 2014 -6 -38 UA JFK SFO 343 2586 9

253315: 10 31 <NA> 2014 77 23 UA EWR HNL 589 4962 10

253316: 10 31 <NA> 2014 -5 1 HA JFK HNL 633 4983 9

> table(join.dt$event, useNA="always")

Birthday St.Patrick <NA>

815 911 251590

  • join other way always returns at least one row for each flight, NA if no match
  • Remember: DT[ i ] means to return a row for each element in i

77 of 116

Exercise

  • Create an airports data.table as below

  • Join using flights[airports, on=.(origin=code)] which means to join the origin column in flights with the code column in airports

code name

1: JFK Kennedy

2: LGA Laguardia

3: EWR Newark

78 of 116

How to compute histogram of distances?

  • or mean for each distance bin (0, 1000, 2000, etc)

79 of 116

Define histogram bins

grid.point <- seq(0, 5000, by=1000)

grid.dt <- data.table(grid.point, distance=grid.point)

80 of 116

Rolling join, roll=Inf

> setkey(grid.dt, distance)

> setkey(flights, distance)

> (join.dt <- grid.dt[flights, roll=Inf])

grid.point distance year month day dep_delay arr_delay carrier origin dest air_time hour

1: 0 80 2014 1 30 9 17 US EWR PHL 46 15

2: 0 80 2014 1 30 -13 -25 US EWR PHL 34 17

---

253315: 4000 4983 2014 10 30 -6 1 HA JFK HNL 627 9

253316: 4000 4983 2014 10 31 -5 1 HA JFK HNL 633 9

  • each grid point matches the flights which occur after

81 of 116

Rolling join, roll=-Inf

> setkey(grid.dt, distance)

> setkey(flights, distance)

> (join.dt <- grid.dt[flights, roll=-Inf])

grid.point distance year month day dep_delay arr_delay carrier origin dest air_time hour

1: 1000 80 2014 1 30 9 17 US EWR PHL 46 15

2: 1000 80 2014 1 30 -13 -25 US EWR PHL 34 17

---

253315: 5000 4983 2014 10 30 -6 1 HA JFK HNL 627 9

253316: 5000 4983 2014 10 31 -5 1 HA JFK HNL 633 9

  • each grid point matches the flights which occur before

82 of 116

Rolling join, roll="nearest"

> setkey(grid.dt, distance)

> setkey(flights, distance)

> (join.dt <- grid.dt[flights, roll="nearest"])

grid.point distance year month day dep_delay arr_delay carrier origin dest air_time hour

1: 0 80 2014 1 30 9 17 US EWR PHL 46 15

2: 0 80 2014 1 30 -13 -25 US EWR PHL 34 17

---

253315: 5000 4983 2014 10 30 -6 1 HA JFK HNL 627 9

253316: 5000 4983 2014 10 31 -5 1 HA JFK HNL 633 9

  • each grid point matches the flights which occur nearest

83 of 116

Summarize after rolling join

> join.dt[, .(num_flights=.N, mean_minutes=mean(air_time)), by=grid.point]

grid.point num_flights mean_minutes

1: 0 55212 53.33366

2: 1000 136285 133.56513

3: 2000 48593 283.92688

4: 3000 12665 348.11062

5: 5000 561 618.78253

  • How to compute histogram/summary?

84 of 116

Non-equi join on=.(DTcol < icol)

> (rect.dt <- data.table(

  • grid.point,
  • min_dist=grid.point-200,
  • max_dist=grid.point+400))

grid.point min_dist max_dist

1: 0 -200 400

2: 1000 800 1400

3: 2000 1800 2400

4: 3000 2800 3400

5: 4000 3800 4400

6: 5000 4800 5400

  • Join all rows which satisfy the given inequalities

> head(join.dt <- rect.dt[flights, .(

+ distance, air_time, grid.point

+ ), on=.(

+ min_dist<distance, max_dist>distance

+ )])

distance air_time grid.point

1: 2475 359 NA

2: 2475 363 NA

3: 2475 351 NA

4: 1035 157 1000

5: 2475 350 NA

6: 2454 339 NA

85 of 116

Compute for every row in table i

General form: DT[i, j, by=.EACHI]

On which rows

What to do?

Grouped by

what?

  • think in terms of basic units — rows, columns and groups
  • data.table syntax provides placeholder for each of them

86 of 116

Summarize in join via by=.EACHI

> flights[three.days, .(

+ mean_air_time=mean(air_time),

+ num_flights=.N

+ ), by=.EACHI, on=.(month,day)]

month day mean_air_time num_flights

1: 3 17 168.9890 911

2: 9 26 151.4147 815

3: 12 25 NA 0

  • DT[ i, j, by=.EACHI, on] means, for each row in i, lookup the rows in DT which match using on columns, and compute/return j
  • Not the same as by=.(month,day) which would do the computation for every unique combination of month and day (much less efficient if we only want the result for a few combinations)

87 of 116

Data reshaping: wide to long (melt)

Wide data

Few rows

Many columns

Long data

Many rows

Few columns

melt( )

dcast( )

similar to stats::reshape(direction="long"), tidyr::pivot_longer( )

88 of 116

melt: wide to long data reshape

melt examples adapted from:

Hocking TD. Wide-to-tall data reshaping using regular expressions and the nc package. R Journal (2021), doi:10.32614/RJ-2021-029

  • how to make the plot below of iris data?
  • iris has columns Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species
  • would need facet_grid(part ~ dim) + geom_histogram(aes(cm, fill=Species)),
  • where dim column has values Length or Width, part values are Petal or Sepal

89 of 116

melt: wide to long data reshape

melt(DT, id.vars=c("Species","flower"))

melt examples adapted from:

Hocking TD. Wide-to-tall data reshaping using regular expressions and the nc package. R Journal (2021), doi:10.32614/RJ-2021-029

90 of 116

melt: wide to long data reshape

melt(DT, measure.vars=c("Sepal.Length","Sepal.Width","Petal.Length","Petal.Width"))

melt examples adapted from:

Hocking TD. Wide-to-tall data reshaping using regular expressions and the nc package. R Journal (2021), doi:10.32614/RJ-2021-029

91 of 116

melt: wide to long data reshape

melt(DT, measure.vars=patterns(".*[.].*"))

melt examples adapted from:

Hocking TD. Wide-to-tall data reshaping using regular expressions and the nc package. R Journal (2021), doi:10.32614/RJ-2021-029

92 of 116

melt: wide to long data reshape

melt(DT, measure.vars=patterns("regular expression"))

  • Exercise: melt the arr_delay and dep_delay columns of flights data, by specifying them as measure.vars
  • Additionally specify id.vars to limit which columns are copied to output

melt(DT, measure.vars=c("valueA","valueB"))

melt(DT, id.vars=c("ID1","ID2"))

melt examples adapted from:

Hocking TD. Wide-to-tall data reshaping using regular expressions and the nc package. R Journal (2021), doi:10.32614/RJ-2021-029

93 of 116

melt with measure( ), new on GitHub

melt(DT, measure.vars=measure(part,dim,pattern="(.*)[.](.*)"))

  • Capture groups ( ) in pattern used to define output variable columns

melt examples adapted from:

Hocking TD. Wide-to-tall data reshaping using regular expressions and the nc package. R Journal (2021), doi:10.32614/RJ-2021-029

94 of 116

melt with measure( ), new on GitHub

melt(DT, measure.vars=measure(part,dim,sep=".")

  • Uses input columns with most groups after splitting (2 in this case)

melt examples adapted from:

Hocking TD. Wide-to-tall data reshaping using regular expressions and the nc package. R Journal (2021), doi:10.32614/RJ-2021-029

95 of 116

melt examples adapted from:

Hocking TD. Wide-to-tall data reshaping using regular expressions and the nc package. R Journal (2021), doi:10.32614/RJ-2021-029

  • Exercise: make the plot below of iris data
  • First convert iris to data.table, then reshape using �melt(DT, measure=measure(____________), value.name="cm")
  • library(ggplot2); ggplot( ) + facet_grid(part ~ dim) + �geom_histogram(aes(cm, fill=Species), data=output_of_melt)

melt with measure( ), new on GitHub

96 of 116

melt(DT, measure.vars=measure(part, value.name, sep="."))

melt with measure( ), new on GitHub

melt examples adapted from:

Hocking TD. Wide-to-tall data reshaping using regular expressions and the nc package. R Journal (2021), doi:10.32614/RJ-2021-029

97 of 116

melt(DT, measure.vars=measure(value.name, dim, sep="."))

melt with measure( ), new on GitHub

melt examples adapted from:

Hocking TD. Wide-to-tall data reshaping using regular expressions and the nc package. R Journal (2021), doi:10.32614/RJ-2021-029

98 of 116

melt with measure( ), new on GitHub

  • Exercise: to show that sepals are larger than petals, make the plot below of iris data. First convert iris to data.table.
  • Then use melt with measure( ) to reshape
  • library(ggplot2); ggplot ( ) + facet_grid(. ~ dim) + �geom_point(aes(� Petal, Sepal, color=Species), � data=output_of_melt)+�geom_abline(� slope=1, intercept=0)

melt examples adapted from:

Hocking TD. Wide-to-tall data reshaping using regular expressions and the nc package. R Journal (2021), doi:10.32614/RJ-2021-029

99 of 116

melt exercise for flights data

  • Exercise: make the plot below of flights data
  • Use melt with either patterns( ) or measure( ) to reshape
  • ggplot( ) + �geom_histogram(aes(delay_minutes, fill=variable), data=output_of_melt)

100 of 116

Data reshaping: long to wide (dcast)

Wide data

Few rows

Many columns

Long data

Many rows

Few columns

melt( )

dcast( )

similar to stats::reshape(direction="wide"), tidyr::pivot_wider( )

101 of 116

dcast: long to wide reshape

> iris.dt=data.table(iris)[, flower := .I] # .I means row number in data.table

> iris.long=melt(iris.dt,measure=measure(part,dim,sep="."))

> head(iris.long)

Species flower part dim value

1: setosa 1 Sepal Length 5.1

2: setosa 2 Sepal Length 4.9

3: setosa 3 Sepal Length 4.7

4: setosa 4 Sepal Length 4.6

5: setosa 5 Sepal Length 5.0

6: setosa 6 Sepal Length 5.4

# Formula indicates where to put different values of variables: rows ~ cols

> iris.wide=dcast(iris.long, flower ~ part + dim, sep=".")

> head(iris.wide)

flower Petal.Length Petal.Width Sepal.Length Sepal.Width

1: 1 1.4 0.2 5.1 3.5

2: 2 1.4 0.2 4.9 3.0

3: 3 1.3 0.2 4.7 3.2

4: 4 1.5 0.2 4.6 3.1

5: 5 1.4 0.2 5.0 3.6

6: 6 1.7 0.4 5.4 3.9

  • dcast can get the original wide iris data back from the long version

102 of 116

dcast: long to wide reshape

> iris.long=melt(iris.dt,measure=measure(part,dim,sep="."),value.name="cm")

> iris.long

Species flower part dim value

1: setosa 1 Sepal Length 5.1

2: setosa 2 Sepal Length 4.9

3: setosa 3 Sepal Length 4.7

4: setosa 4 Sepal Length 4.6

5: setosa 5 Sepal Length 5.0

---

596: virginica 146 Petal Width 2.3

597: virginica 147 Petal Width 1.9

598: virginica 148 Petal Width 2.0

599: virginica 149 Petal Width 2.3

600: virginica 150 Petal Width 1.8

> dcast(iris.long, Species ~ dim, fun.aggregate=mean, value.var="cm")

Species Length Width

1: setosa 3.234 1.837

2: versicolor 5.098 2.048

3: virginica 6.070 2.500

  • dcast can compute a summary/aggregation function

103 of 116

dcast: long to wide reshape

> iris.long=melt(iris.dt,measure=measure(part,dim,sep="."),value.name="cm")

> iris.long

Species flower part dim value

1: setosa 1 Sepal Length 5.1

2: setosa 2 Sepal Length 4.9

3: setosa 3 Sepal Length 4.7

4: setosa 4 Sepal Length 4.6

5: setosa 5 Sepal Length 5.0

---

596: virginica 146 Petal Width 2.3

597: virginica 147 Petal Width 1.9

598: virginica 148 Petal Width 2.0

599: virginica 149 Petal Width 2.3

600: virginica 150 Petal Width 1.8

> dcast(iris.long, Species ~ dim, fun.aggregate=list(min, mean, max), value.var="cm")

Species cm_min_Length cm_min_Width cm_mean_Length cm_mean_Width cm_max_Length cm_max_Width

1: setosa 1.0 0.1 3.234 1.837 5.8 4.4

2: versicolor 3.0 1.0 5.098 2.048 7.0 3.4

3: virginica 4.5 1.4 6.070 2.500 7.9 3.8

  • dcast can compute several summary/aggregation functions

104 of 116

dcast: long to wide reshape

> iris.long=melt(iris.dt,measure=measure(part,value.name,sep="."))

> iris.long

Species flower part Length Width

1: setosa 1 Sepal 5.1 3.5

2: setosa 2 Sepal 4.9 3.0

3: setosa 3 Sepal 4.7 3.2

4: setosa 4 Sepal 4.6 3.1

5: setosa 5 Sepal 5.0 3.6

---

296: virginica 146 Petal 5.2 2.3

297: virginica 147 Petal 5.0 1.9

298: virginica 148 Petal 5.2 2.0

299: virginica 149 Petal 5.4 2.3

300: virginica 150 Petal 5.1 1.8

> dcast(iris.long, Species ~ part, fun.aggregate=mean, value.var=c("Length","Width"))

Species Length_Petal Length_Sepal Width_Petal Width_Sepal

1: setosa 1.462 5.006 0.246 3.428

2: versicolor 4.260 5.936 1.326 2.770

3: virginica 5.552 6.588 2.026 2.974

  • dcast can compute for several value columns

105 of 116

dcast: long to wide reshape

> iris.long=melt(iris.dt,measure=measure(part,value.name,sep="."))

> iris.long

Species flower part Length Width

1: setosa 1 Sepal 5.1 3.5

2: setosa 2 Sepal 4.9 3.0

3: setosa 3 Sepal 4.7 3.2

4: setosa 4 Sepal 4.6 3.1

5: setosa 5 Sepal 5.0 3.6

---

296: virginica 146 Petal 5.2 2.3

297: virginica 147 Petal 5.0 1.9

298: virginica 148 Petal 5.2 2.0

299: virginica 149 Petal 5.4 2.3

300: virginica 150 Petal 5.1 1.8

~ . means no reshaping variables in columns

> dcast(iris.long, Species + part ~ ., fun.aggregate=list(min, max), value.var=c("Length","Width"))

Species part Length_min Width_min Length_max Width_max

1: setosa Petal 1.0 0.1 1.9 0.6

2: setosa Sepal 4.3 2.3 5.8 4.4

3: versicolor Petal 3.0 1.0 5.1 1.8

4: versicolor Sepal 4.9 2.0 7.0 3.4

5: virginica Petal 4.5 1.4 6.9 2.5

6: virginica Sepal 4.9 2.2 7.9 3.8

  • dcast can compute several functions for several value columns

106 of 116

Exercise

  • Make a table as above, with one column for every origin, and one row for every destination (entries are flight counts). Hint: use dest ~ origin to create a different dest on each row, and a different origin on each column.
  • Specify value.var=c("arr_delay","dep_delay") with fun.aggregate=mean to compare the average delays. Use this info to pick the best airport for any given destination.

dest EWR JFK LGA

1: ABQ 0 278 0

2: ACK 0 277 0

3: AGS 0 0 3

4: ALB 169 0 0

5: ANC 13 0 0

...

107 of 116

import data.table in your package

  • data.table is already imported by 1400+ CRAN packages
  • "Imports: data.table" in DESCRIPTION
  • "import(data.table)" in NAMESPACE
  • . <- j_variable <- NULL in first line of R function to avoid CRAN NOTE about no visible function/variable
  • data.table::setDTthreads(2) in examples/tests to avoid CRAN NOTE about taking too much time (data.table uses 1/2 of all CPUs by default)

108 of 116

Avoiding CRAN NOTE 1

my_fun <- function(DT) DT[, .(m=mean(x)), by=y]

* checking R code for possible problems ... NOTE

my_fun: no visible global function definition for ‘.’

my_fun: no visible binding for global variable ‘x’

my_fun: no visible binding for global variable ‘y’

...

Undefined global functions or variables:

. x y

  • . <- j_variable <- NULL in first line of R function to avoid CRAN NOTE about no visible function/variable
  • To avoid CRAN check NOTE as above, use R code like below:

my_fun <- function(DT){

x <- y <- . <- NULL

DT[, .(m=mean(x)), by=y]

}

109 of 116

Avoiding CRAN NOTE 2

Flavor: r-devel-linux-x86_64-debian-gcc

Check: examples, Result: NOTE

Examples with CPU time > 2.5 times elapsed time

user system elapsed ratio

aum_line_search 12.349 0.322 1.935 6.548

aum_line_search_grid 10.033 0.308 1.781 5.806

aum_diffs_penalty 4.730 0.169 1.635 2.996

  • CRAN requires packages to use max 2 CPUs during checks
  • data.table default uses 1/2 of all CPUs on machine
  • To avoid CRAN check NOTE as above, use R code like below in examples and tests:

data.table::setDTthreads(2)

110 of 116

4/4

Contributing to data.table

111 of 116

  • data.table mascot is a sea lion, which barks "R R R"
  • data.table community has a new blog, The Raft, https://rdatatable-community.github.io/The-Raft/ �sea lions often float together on the ocean's surface in groups called "rafts." - Marine Mammal Center
  • Please fill out our community survey, https://tinyurl.com/datatable-survey

~10 minutes, tell us how you use data.table!

(help us choose what to prioritize in the future)

Community blog and survey

112 of 116

  • data.table has an active issue/Pull Request(PR) tracker https://github.com/Rdatatable/data.table/
  • 1000+ open issues, 100+ open PRs
  • if you have any time/interest, we could use your help!
  • easy first contribution: try reproducing an issue �(very helpful to know if an issue is reproducible)
  • very inclusive community -- after you submit your first PR, you will be invited to join the github group!
  • now is a very exciting time to get involved, as we are currently creating a formal written document describing de-centralized project governance, issue #5676

GitHub repository

113 of 116

  • In 2023-2025, National Science Foundation has provided funds to support expanding the ecosystem of users and contributors around data.table
  • 20 translation awards, US$500 each, in order to make documentation and messages more accessible, ideas:
  • Translate errors/warnings/messages (potools package can help)
  • Translate most important vignettes (intro, import, reshape)
  • Translate other documentation (cheat sheets, slides, etc)
  • Priority: Portuguese, Spanish, Chinese, French, Russian, Arabic, Hindi
  • Call for proposals: https://rdatatable-community.github.io/The-Raft/

Translation Awards

114 of 116

  • In 2023-2025, National Science Foundation has provided funds to support expanding the ecosystem of users and contributors around data.table
  • Eight travel awards, US$2700 each
  • Candidates should give a talk about data.table at a conference with a relevant audience (potential data.table users or contributors)
  • Call for proposals coming soon on https://rdatatable-community.github.io/The-Raft/

Travel awards

115 of 116

  • concise, consistent syntax
  • fast, memory efficient
  • No dependencies (easy to install)
  • No breaking changes (easy to upgrade)
  • Inclusive user/developer community with opportunities to contribute:
  • translation awards, US$500 each
  • travel awards, US$2700 each

Summary of data.table

116 of 116

Thank you! Questions?

Toby Dylan Hocking

Assistant Professor

Northern Arizona University

toby.hocking@r-project.org

Funded by NSF POSE program, project #2303612.

Slides adapted from Arun Srinivasan, and datatable-intro vignette - thanks!

Please use/adapt these slides as you like,

as long as you give me some credit,

as I have done for Arun below.