Using and contributing to the data.table package for efficient big data analysis
Download these slides: http://ml.nau.edu/latinr.pdf
Toby Dylan Hocking
Assistant Professor
Northern Arizona University
toby.hocking@r-project.org
Funded by NSF POSE program, project #2303612.
Slides adapted from Arun Srinivasan, and datatable-intro vignette - thanks!
who am i?
talk overview
1/4
What is data.table?
data.frame in R
| id | val |
1 | b | 4 |
2 | a | 2 |
3 | a | 3 |
4 | c | 1 |
5 | c | 5 |
6 | b | 6 |
2 column data.frame
DF
| id | val |
1 | b | 4 |
2 | a | 2 |
3 | a | 3 |
4 | c | 1 |
5 | c | 5 |
6 | b | 6 |
| id | val |
1 | b | 4 |
2 | a | 2 |
3 | a | 3 |
4 | c | 1 |
5 | c | 5 |
6 | b | 6 |
| id | val |
1 | b | 4 |
2 | a | 2 |
3 | a | 3 |
4 | c | 1 |
5 | c | 5 |
6 | b | 6 |
| id | val |
1 | b | 4 |
2 | a | 2 |
3 | a | 3 |
4 | c | 1 |
5 | c | 5 |
6 | b | 6 |
| id | val |
1 | b | 4 |
2 | a | 2 |
3 | a | 3 |
4 | c | 1 |
5 | c | 5 |
6 | b | 6 |
data.table
| id | val |
1 | b | 4 |
2 | a | 2 |
3 | a | 3 |
4 | c | 1 |
5 | c | 5 |
6 | b | 6 |
2 column data.table
DT
| id | val |
1 | b | 4 |
2 | a | 2 |
3 | a | 3 |
4 | c | 1 |
5 | c | 5 |
6 | b | 6 |
| id | val |
1 | b | 4 |
2 | a | 2 |
3 | a | 3 |
4 | c | 1 |
5 | c | 5 |
6 | b | 6 |
| id | val |
1 | b | 4 |
2 | a | 2 |
3 | a | 3 |
4 | c | 1 |
5 | c | 5 |
6 | b | 6 |
| id | val |
1 | b | 4 |
2 | a | 2 |
3 | a | 3 |
4 | c | 1 |
5 | c | 5 |
6 | b | 6 |
| id | val |
1 | b | 4 |
2 | a | 2 |
3 | a | 3 |
4 | c | 1 |
5 | c | 5 |
6 | b | 6 |
Comparing data.table and tidyverse
https://teachdatascience.com/tidyverse/
2/4
Why is data.table so popular/powerful?
(efficiency)
two kinds of data table efficiency
data table R code syntax
General form: DT[i, j, by]
On which rows
What to do?
Grouped by
what?
SQL: WHERE SELECT | UPDATE GROUP BY
data table R code syntax
| id | val |
1: | b | 4 |
2: | a | 2 |
3: | a | 3 |
4: | c | 1 |
5: | c | 5 |
6: | b | 6 |
2 column data.table
DT
| id | val |
1: | b | 4 |
2: | a | 2 |
3: | a | 3 |
4: | c | 1 |
5: | c | 5 |
6: | b | 6 |
| id | val |
1: | b | 4 |
2: | a | 2 |
3: | a | 3 |
4: | c | 1 |
5: | c | 5 |
6: | b | 6 |
| id | val |
1: | b | 4 |
2: | a | 2 |
3: | a | 3 |
4: | c | 1 |
5: | c | 5 |
6: | b | 6 |
| id | val |
1: | b | 4 |
2: | a | 2 |
3: | a | 3 |
4: | c | 1 |
5: | c | 5 |
6: | b | 6 |
| id | val |
1: | b | 4 |
2: | a | 2 |
3: | a | 3 |
4: | c | 1 |
5: | c | 5 |
6: | b | 6 |
| id | val |
1: | b | 4 |
2: | a | 2 |
3: | a | 3 |
4: | c | 1 |
5: | c | 5 |
6: | b | 6 |
data.frame(DF) vs data.table(DT)
sum(DF[DF$code != “abd”, “valA”])
DT[code != “abd”, sum(valA)]
data.frame(DF) vs data.table(DT)
DF[DF$code == “abd”, “valA”] <- NA
DT[code == “abd”, valA := NA]
data.frame(DF) vs data.table(DT)
aggregate(cbind(valA, valB) ~ id,
DF[DF$code != “abd”, ],
sum)
DT[code != “abd”,
.(sum(valA), sum(valB)),
by = id]
data.frame(DF) vs data.table(DT)
DF1 <- merge(DF1, DF2, all.x=TRUE)
DF1[, “valA”] <- ifelse(is.na(DF1$val),
DF1$valA, DF1$val)
DF1$val <- NULL
DT1[DT2, valA := val, on = .(id, code)]
two kinds of data table efficiency
100
100
100
3000
data.table::fread is an extremely efficient CSV file reader
Source code: https://tdhock.github.io/blog/2023/dt-atime-figures/
data.table computes summaries 100x faster than others
1
...
N
...
In machine learning, K-fold cross-validation is used to estimate the loss of a hyper-parameter, such as the number of epochs of training of a neural network.
data.table can efficiently compute the average loss over the K=10 folds, for each of the N epochs.
epoch
loss
epoch
Mean
SD
Length
1
...
N
1
...
N
fold
10
...
10
1
...
1
100
100
100
10000
data.table::fwrite is an extremely efficient CSV file writer
Source code: https://tdhock.github.io/blog/2023/dt-atime-figures/
most underrated package
powerful
data.table data.table data.table
great sadness
3/4
Using data.table for efficient big data analysis
Install data.table, get PDF docs
> install.packages("remotes")
> remotes::install_github("Rdatatable/data.table")
(need a compiler: xcode for mac, or rtools for windows)
OR: if you do not have a compiler: install from CRAN then update:
> install.packages("data.table")
> data.table::update_dev_pkg()
Create data.table from R objects
> data.table(
+ ID = c("b","b","b","a","a","c"),
+ a = 1:2,
+ b = 7,
+ c = 13:18
+ )
ID a b c
1: b 1 7 13
2: b 2 7 14
3: b 1 7 15
4: a 2 7 16
5: a 1 7 17
6: c 2 7 18
data.table vs frame columns
> head(data.table("10^8 m^3"=Nile, year=1871:1970))
10^8 m^3 year
1: 1120 1871
2: 1160 1872
3: 963 1873
4: 1210 1874
5: 1160 1875
6: 1160 1876
> head(data.frame("10^8 m^3"=Nile, year=1871:1970))
X10.8.m.3 year
1 1120 1871
2 1160 1872
3 963 1873
4 1210 1874
5 1160 1875
6 1160 1876
data.table vs frame list columns
> data.table(L=list("scalar", c("vec","tor")),x=1:2)
L x
1: scalar 1
2: vec,tor 2
> data.frame(L=I(list("scalar", c("vec","tor"))),x=1:2)
L x
1 scalar 1
2 vec, tor 2
Exercises
Read CSV from file
> if(!file.exists("flights14.csv"))download.file(
+ "http://ml.nau.edu/flights14.csv",
+ "flights14.csv")
> (flights <- data.table::fread("flights14.csv"))
year month day dep_delay arr_delay carrier origin dest air_time distance hour
1: 2014 1 1 14 13 AA JFK LAX 359 2475 9
2: 2014 1 1 -3 13 AA JFK LAX 363 2475 11
3: 2014 1 1 2 9 AA JFK LAX 351 2475 19
4: 2014 1 1 -8 -26 AA LGA PBI 157 1035 7
5: 2014 1 1 2 1 AA JFK LAX 350 2475 13
---
253312: 2014 10 31 1 -30 UA LGA IAH 201 1416 14
253313: 2014 10 31 -5 -14 UA EWR IAH 189 1400 8
253314: 2014 10 31 -8 16 MQ LGA RDU 83 431 11
253315: 2014 10 31 -4 15 MQ LGA DTW 75 502 11
253316: 2014 10 31 -5 1 MQ LGA SDF 110 659 8
Read CSV from shell command
> data.table::fread("egrep 'LAX|year' flights14.csv")
year month day dep_delay arr_delay carrier origin dest air_time distance hour
1: 2014 1 1 14 13 AA JFK LAX 359 2475 9
2: 2014 1 1 -3 13 AA JFK LAX 363 2475 11
3: 2014 1 1 2 9 AA JFK LAX 351 2475 19
4: 2014 1 1 2 1 AA JFK LAX 350 2475 13
5: 2014 1 1 4 0 AA EWR LAX 339 2454 18
---
14430: 2014 10 31 -6 -38 UA JFK LAX 323 2475 11
14431: 2014 10 31 9 -21 UA EWR LAX 323 2454 13
14432: 2014 10 31 40 14 UA EWR LAX 320 2454 18
14433: 2014 10 31 3 -32 UA EWR LAX 320 2454 20
14434: 2014 10 31 10 -27 UA EWR LAX 326 2454 10
Exercise: Read CSV from URL
flights <- data.table::fread("http://ml.nau.edu/flights14.csv")
Subset rows using i
General form: DT[i, j, by]
On which rows
What to do?
Grouped by
what?
Logical row subset
> flights[origin == "JFK" & month == 6]
year month day dep_delay arr_delay carrier origin dest air_time distance hour
1: 2014 6 1 -9 -5 AA JFK LAX 324 2475 8
2: 2014 6 1 -10 -13 AA JFK LAX 329 2475 12
3: 2014 6 1 18 -1 AA JFK LAX 326 2475 7
4: 2014 6 1 -6 -16 AA JFK LAX 320 2475 10
5: 2014 6 1 -4 -45 AA JFK LAX 326 2475 18
---
8418: 2014 6 30 -3 -6 MQ JFK PIT 62 340 14
8419: 2014 6 30 -5 -32 MQ JFK RDU 65 427 14
8420: 2014 6 30 -3 -16 MQ JFK DCA 39 213 17
8421: 2014 6 30 -2 7 MQ JFK DCA 52 213 7
8422: 2014 6 30 -7 -18 MQ JFK RDU 67 427 8
Integer row subset
> flights[1:2]
year month day dep_delay arr_delay carrier origin dest air_time distance hour
1: 2014 1 1 14 13 AA JFK LAX 359 2475 9
2: 2014 1 1 -3 13 AA JFK LAX 363 2475 11
> flights[-(1:2)]
year month day dep_delay arr_delay carrier origin dest air_time distance hour
1: 2014 1 1 2 9 AA JFK LAX 351 2475 19
2: 2014 1 1 -8 -26 AA LGA PBI 157 1035 7
3: 2014 1 1 2 1 AA JFK LAX 350 2475 13
4: 2014 1 1 4 0 AA EWR LAX 339 2454 18
5: 2014 1 1 -2 -18 AA JFK LAX 338 2475 21
---
253310: 2014 10 31 1 -30 UA LGA IAH 201 1416 14
253311: 2014 10 31 -5 -14 UA EWR IAH 189 1400 8
253312: 2014 10 31 -8 16 MQ LGA RDU 83 431 11
253313: 2014 10 31 -4 15 MQ LGA DTW 75 502 11
253314: 2014 10 31 -5 1 MQ LGA SDF 110 659 8
Sort rows
> flights[order(origin, -dest)]
year month day dep_delay arr_delay carrier origin dest air_time distance hour
1: 2014 1 5 6 49 EV EWR XNA 195 1131 8
2: 2014 1 6 7 13 EV EWR XNA 190 1131 8
3: 2014 1 7 -6 -13 EV EWR XNA 179 1131 8
4: 2014 1 8 -7 -12 EV EWR XNA 184 1131 8
5: 2014 1 9 16 7 EV EWR XNA 181 1131 8
---
253312: 2014 10 31 -1 -22 WN LGA ATL 112 762 9
253313: 2014 10 31 -5 -23 WN LGA ATL 112 762 20
253314: 2014 4 6 -6 -1 EV LGA AGS 110 678 10
253315: 2014 4 7 2 1 EV LGA AGS 111 678 11
253316: 2014 4 11 0 -19 EV LGA AGS 102 678 10
Exercises
Compute columns in j
General form: DT[i, j, by]
On which rows
What to do?
Grouped by
what?
Select one column
> str(flights[, "arr_delay" ])
Classes ‘data.table’ and 'data.frame': 253316 obs. of 1 variable:
$ arr_delay: int 13 13 9 -26 1 0 -18 -14 -17 -14 ...
- attr(*, ".internal.selfref")=<externalptr>
> str(flights[, arr_delay]) #like DT[["arr_delay"]] or DT$arr_delay
int [1:253316] 13 13 9 -26 1 0 -18 -14 -17 -14 ...
Select two columns with literal
> flights[, data.table(arr_delay, dep_delay)]
arr_delay dep_delay
1: 13 14
2: 13 -3
---
253315: 15 -4
253316: 1 -5
> flights[, c("arr_delay", "dep_delay")]
arr_delay dep_delay
1: 13 14
2: 13 -3
---
253315: 15 -4
253316: 1 -5
Select two columns with variable
> (DT=data.table(x=1:2, y=3:4,
x y select_cols
1: 1 3 foo
2: 2 4 bar
> select_cols=c("x","y")
> DT[, select_cols]
[1] "foo" "bar"
> DT[, ..select_cols]
x y
1: 1 3
2: 2 4
> DT[, select_cols, with=FALSE]
x y
1: 1 3
2: 2 4
Select all columns except two
> delay_cols = c("arr_delay", "dep_delay")
> head(flights[, !..delay_cols])
year month day carrier origin dest air_time distance hour
1: 2014 1 1 AA JFK LAX 359 2475 9
2: 2014 1 1 AA JFK LAX 363 2475 11
3: 2014 1 1 AA JFK LAX 351 2475 19
4: 2014 1 1 AA LGA PBI 157 1035 7
5: 2014 1 1 AA JFK LAX 350 2475 13
6: 2014 1 1 AA EWR LAX 339 2454 18
> head(flights[, -..delay_cols])
year month day carrier origin dest air_time distance hour
1: 2014 1 1 AA JFK LAX 359 2475 9
2: 2014 1 1 AA JFK LAX 363 2475 11
3: 2014 1 1 AA JFK LAX 351 2475 19
4: 2014 1 1 AA LGA PBI 157 1035 7
5: 2014 1 1 AA JFK LAX 350 2475 13
6: 2014 1 1 AA EWR LAX 339 2454 18
Select range of columns
> head(flights[, year:day])
year month day
1: 2014 1 1
2: 2014 1 1
3: 2014 1 1
4: 2014 1 1
5: 2014 1 1
6: 2014 1 1
> head(flights[, !(year:day)])
dep_delay arr_delay carrier origin dest air_time distance hour
1: 14 13 AA JFK LAX 359 2475 9
2: -3 13 AA JFK LAX 363 2475 11
3: 2 9 AA JFK LAX 351 2475 19
4: -8 -26 AA LGA PBI 157 1035 7
5: 2 1 AA JFK LAX 350 2475 13
6: 4 0 AA EWR LAX 339 2454 18
Rename and compute columns
> flights[, data.table(destination=dest, air_hours=air_time/60)]
destination air_hours
1: LAX 5.983333
2: LAX 6.050000
3: LAX 5.850000
4: PBI 2.616667
5: LAX 5.833333
---
253312: IAH 3.350000
253313: IAH 3.150000
253314: RDU 1.383333
253315: DTW 1.250000
253316: SDF 1.833333
Summarize all rows
> flights[, .(m_arr = mean(arr_delay), m_dep = mean(dep_delay))]
m_arr m_dep
1: 8.146702 12.46526
> flights[, .(range_arr = range(arr_delay), stat=c("min","max"))]
range_arr stat
1: -112 min
2: 1494 max
Create/delete columns in j
General form: DT[i, j, by]
On which rows
What to do?
Grouped by
what?
Comparing column assignment in base R and data.table
| Base R code | data.table code |
Literal | DF$next <- DF$current + 1 | DT[, next := current + 1 ] |
Variable | DF[[my_var_name]] <- DF$current + 1 | set(DT, j=my_var_name, value=DT$current + 1) |
my_var_name <- "next"
Create column := value
> month.dt <- flights[, .(month, air_time, distance)]
> head(month.dt)
month air_time distance
1: 10 44 96
2: 10 39 96
3: 10 37 184
4: 10 35 184
5: 10 50 184
6: 10 36 184
> month.dt[, air_hours := air_time/60]
> head(month.dt)
month air_time distance air_hours
1: 10 44 96 0.7333333
2: 10 39 96 0.6500000
3: 10 37 184 0.6166667
4: 10 35 184 0.5833333
5: 10 50 184 0.8333333
6: 10 36 184 0.6000000
Delete column := NULL
> head(month.dt)
month air_time distance air_hours
1: 10 44 96 0.0000000
2: 10 39 96 0.0000000
3: 10 37 184 0.6166667
4: 10 35 184 0.5833333
5: 10 50 184 0.8333333
6: 10 36 184 0.6000000
> month.dt[, air_time := NULL]
> head(month.dt)
month distance air_hours
1: 10 96 0.0000000
2: 10 96 0.0000000
3: 10 184 0.6166667
4: 10 184 0.5833333
5: 10 184 0.8333333
6: 10 184 0.6000000
set( ) to create/delete column
> my_var_name = "new"
> set(flights, j=my_var_name, value="HI!")
> flights
year month day dep_delay arr_delay carrier origin dest air_time distance hour new
1: 2014 1 1 14 13 AA JFK LAX 359 2475 9 HI!
2: 2014 1 1 -3 13 AA JFK LAX 363 2475 11 HI!
---
253315: 2014 10 31 -4 15 MQ LGA DTW 75 502 11 HI!
253316: 2014 10 31 -5 1 MQ LGA SDF 110 659 8 HI!
> set(flights, j=my_var_name, value=NULL)
> flights
year month day dep_delay arr_delay carrier origin dest air_time distance hour
1: 2014 1 1 14 13 AA JFK LAX 359 2475 9
2: 2014 1 1 -3 13 AA JFK LAX 363 2475 11
---
253315: 2014 10 31 -4 15 MQ LGA DTW 75 502 11
253316: 2014 10 31 -5 1 MQ LGA SDF 110 659 8
Exercises
Compute on a subset with i and j
General form: DT[i, j, by]
On which rows
What to do?
Grouped by
what?
Summarize a subset of rows
> flights[origin == "JFK" & month == 6L, length(dest)]
[1] 8422
> flights[origin == "JFK" & month == 6L, .N]
[1] 8422
> flights[origin == "JFK" & month == 6L, .(m_arr=mean(arr_delay), num_flights=.N)]
m_arr num_flights
1: 5.839349 8422
Update column :=
> head(month.dt)
month air_time distance air_hours
1: 10 44 96 0.7333333
2: 10 39 96 0.6500000
3: 10 37 184 0.6166667
4: 10 35 184 0.5833333
5: 10 50 184 0.8333333
6: 10 36 184 0.6000000
> month.dt[distance<100, air_hours := 0]
> head(month.dt)
month air_time distance air_hours
1: 10 44 96 0.0000000
2: 10 39 96 0.0000000
3: 10 37 184 0.6166667
4: 10 35 184 0.5833333
5: 10 50 184 0.8333333
6: 10 36 184 0.6000000
Exercises
Compute for each group with j and by
General form: DT[i, j, by]
On which rows
What to do?
Grouped by
what?
Summarize for each group
> flights[, .(num_flights=.N), by=origin]
origin num_flights
1: JFK 81483
2: LGA 84433
3: EWR 87400
> flights[, .(num_flights=.N, mean_arr_delay=mean(arr_delay)), by=origin]
origin num_flights mean_arr_delay
1: JFK 81483 7.731465
2: LGA 84433 6.601968
3: EWR 87400 10.026121
Summarize for each group
> flights[, .(num_flights=.N, mean_arr_delay=mean(arr_delay)), by=.(origin, dest)]
origin dest num_flights mean_arr_delay
1: JFK LAX 10208 3.301332
2: LGA PBI 2307 4.939315
3: EWR LAX 4226 8.425461
4: JFK MIA 2750 1.630909
5: JFK SEA 1815 3.168595
---
217: LGA AVL 2 -31.500000
218: LGA GSP 3 6.000000
219: LGA SBN 2 -15.500000
220: EWR SBN 6 -2.333333
221: LGA DAL 15 -16.000000
New by/grouping variables
> flights[, .(
origin destination num_flights mean_arr_delay
1: jfk LAX 10208 3.301332
2: lga PBI 2307 4.939315
---
220: ewr SBN 6 -2.333333
221: lga DAL 15 -16.000000
Using all three arguments together
General form: DT[i, j, by]
On which rows
What to do?
Grouped by
what?
Subset rows before summarizing
> flights[carrier=="AA", .(num_flights=.N, mean_arr_delay=mean(arr_delay)), by=.(origin, dest)]
origin dest num_flights mean_arr_delay
1: JFK LAX 3387 2.4186596
2: LGA PBI 245 -8.9510204
3: EWR LAX 62 4.8064516
4: JFK MIA 1876 1.0964819
5: JFK SEA 298 10.0033557
---
20: JFK EGE 85 35.4352941
21: JFK DFW 474 1.6751055
22: JFK SAN 299 17.1304348
23: JFK DCA 172 12.3139535
24: EWR PHX 121 -2.1404959
origin dest num_flights mean_arr_delay
DT[i] inside j with by
> flights[, data.table(air_time,distance)[1], by=.(origin,dest)]
origin dest air_time distance
1: JFK LAX 359 2475
2: LGA PBI 157 1035
3: EWR LAX 339 2454
4: JFK MIA 161 1089
5: JFK SEA 349 2422
---
217: LGA AVL 86 599
218: LGA GSP 90 610
219: LGA SBN 94 651
220: EWR SBN 97 637
221: LGA DAL 210 1381
.SD[i] inside j with by
> flights[, .SD[1], by=.(origin,dest)]
origin dest year month day dep_delay arr_delay carrier air_time distance hour
1: JFK LAX 2014 1 1 14 13 AA 359 2475 9
2: LGA PBI 2014 1 1 -8 -26 AA 157 1035 7
3: EWR LAX 2014 1 1 4 0 AA 339 2454 18
4: JFK MIA 2014 1 1 -1 -17 AA 161 1089 15
5: JFK SEA 2014 1 1 -2 -14 AA 349 2422 18
---
217: LGA AVL 2014 7 5 -5 -16 EV 86 599 9
218: LGA GSP 2014 8 23 24 34 EV 90 610 10
219: LGA SBN 2014 8 29 -3 -33 EV 94 651 18
220: EWR SBN 2014 10 26 20 19 EV 97 637 19
221: LGA DAL 2014 10 28 -5 -11 VX 210 1381 15
.SD[i] inside j with by
Write to and read from CSV in chunks
General form: DT[i, j, by]
On which rows
What to do?
Grouped by
what?
fwrite - efficient CSV writing
> fwrite(flights, "flights14.csv")
DT[, j=fwrite( ), by=variable]
> flights[, fwrite(
+ data.table(air_time, distance),
+ paste0(month,".csv")
+ ), by=month]
Empty data.table (0 rows and 1 cols): month
> Sys.glob("*.csv")
[1] "10.csv" "1.csv" "2.csv" "3.csv" "4.csv"
[6] "5.csv" "6.csv" "7.csv" "8.csv" "9.csv"
fread CSV, combine with rbindlist
> month.dt.list <- list() #(1)
> for(month.csv in Sys.glob("*.csv")){
+ month <- sub(".csv", "", month.csv)
+ month.dt.list[[month.csv]] <- data.table(month, fread(month.csv)) #(2)
+ }
> (month.dt <- data.table::rbindlist(month.dt.list)) #(3)
month air_time distance
1: 10 44 96
2: 10 39 96
3: 10 37 184
4: 10 35 184
5: 10 50 184
---
253312: 9 356 2586
253313: 9 347 2586
253314: 9 356 2586
253315: 9 345 2586
253316: 9 592 4962
Read several CSV using by
> data.table(month.csv=Sys.glob("*.csv"))[, fread(month.csv), by=month.csv]
month.csv air_time distance
1: 10.csv 44 96
2: 10.csv 39 96
3: 10.csv 37 184
4: 10.csv 35 184
5: 10.csv 50 184
---
253312: 9.csv 356 2586
253313: 9.csv 347 2586
253314: 9.csv 356 2586
253315: 9.csv 345 2586
253316: 9.csv 592 4962
Exercise
Joins: when i is a data.table
General form: DT[i, j, by]
On which rows
What to do?
Grouped by
what?
Join DT with table i: DT[ i ]
> (three.days <- rbind(
+ data.table(month=3, day=17, event="St.Patrick"),
+ data.table(month=9, day=26, event="Birthday"),
+ data.table(month=12, day=25, event="Christmas")))
month day event
1: 3 17 St.Patrick
2: 9 26 Birthday
3: 12 25 Christmas
> flights[three.days, on=c("month","day")]
year month day dep_delay arr_delay carrier origin dest air_time distance hour event
1: 2014 3 17 17 2 US LGA PHL 35 96 15 St.Patrick
2: 2014 3 17 0 1 US LGA PHL 41 96 6 St.Patrick
3: 2014 3 17 -7 -16 EV EWR ALB 30 143 8 St.Patrick
4: 2014 3 17 37 33 EV EWR ALB 31 143 22 St.Patrick
5: 2014 3 17 19 3 EV EWR ALB 30 143 13 St.Patrick
---
1723: 2014 9 26 -2 -15 UA JFK SFO 318 2586 7 Birthday
1724: 2014 9 26 -6 -22 UA JFK SFO 329 2586 11 Birthday
1725: 2014 9 26 8 -11 UA EWR HNL 589 4962 9 Birthday
1726: 2014 9 26 -3 -26 HA JFK HNL 595 4983 9 Birthday
1727: NA 12 25 NA NA <NA> <NA> <NA> NA NA NA Christmas
Join two tables, mult argument
> flights[three.days, on=c("month","day"), mult="first"]
year month day dep_delay arr_delay carrier origin dest air_time distance hour event
1: 2014 3 17 17 2 US LGA PHL 35 96 15 St.Patrick
2: 2014 9 26 -2 -4 US LGA PHL 30 96 5 Birthday
3: NA 12 25 NA NA <NA> <NA> <NA> NA NA NA Christmas
> flights[three.days, on=c("month","day"), mult="last"]
year month day dep_delay arr_delay carrier origin dest air_time distance hour event
1: 2014 3 17 -8 13 HA JFK HNL 675 4983 9 St.Patrick
2: 2014 9 26 -3 -26 HA JFK HNL 595 4983 9 Birthday
3: NA 12 25 NA NA <NA> <NA> <NA> NA NA NA Christmas
Join two tables, no NA rows
> flights[three.days, on=c("month","day"), nomatch=0L]
year month day dep_delay arr_delay carrier origin dest air_time distance hour event
1: 2014 3 17 17 2 US LGA PHL 35 96 15 St.Patrick
2: 2014 3 17 0 1 US LGA PHL 41 96 6 St.Patrick
3: 2014 3 17 -7 -16 EV EWR ALB 30 143 8 St.Patrick
4: 2014 3 17 37 33 EV EWR ALB 31 143 22 St.Patrick
5: 2014 3 17 19 3 EV EWR ALB 30 143 13 St.Patrick
---
1722: 2014 9 26 -5 -37 UA JFK SFO 319 2586 10 Birthday
1723: 2014 9 26 -2 -15 UA JFK SFO 318 2586 7 Birthday
1724: 2014 9 26 -6 -22 UA JFK SFO 329 2586 11 Birthday
1725: 2014 9 26 8 -11 UA EWR HNL 589 4962 9 Birthday
1726: 2014 9 26 -3 -26 HA JFK HNL 595 4983 9 Birthday
Join two tables using setkey
> setkey(three.days, month, day)
> setkey(flights, month, day)
> flights[three.days]
year month day dep_delay arr_delay carrier origin dest air_time distance hour event
1: 2014 3 17 17 2 US LGA PHL 35 96 15 St.Patrick
2: 2014 3 17 0 1 US LGA PHL 41 96 6 St.Patrick
3: 2014 3 17 -7 -16 EV EWR ALB 30 143 8 St.Patrick
4: 2014 3 17 37 33 EV EWR ALB 31 143 22 St.Patrick
5: 2014 3 17 19 3 EV EWR ALB 30 143 13 St.Patrick
---
1723: 2014 9 26 -2 -15 UA JFK SFO 318 2586 7 Birthday
1724: 2014 9 26 -6 -22 UA JFK SFO 329 2586 11 Birthday
1725: 2014 9 26 8 -11 UA EWR HNL 589 4962 9 Birthday
1726: 2014 9 26 -3 -26 HA JFK HNL 595 4983 9 Birthday
1727: NA 12 25 NA NA <NA> <NA> <NA> NA NA NA Christmas
Join two tables other way
> (join.dt <- three.days[flights, on=c("month","day")][order(is.na(event))])
month day event year dep_delay arr_delay carrier origin dest air_time distance hour
1: 3 17 St.Patrick 2014 17 2 US LGA PHL 35 96 15
2: 3 17 St.Patrick 2014 0 1 US LGA PHL 41 96 6
3: 3 17 St.Patrick 2014 -7 -16 EV EWR ALB 30 143 8
4: 3 17 St.Patrick 2014 37 33 EV EWR ALB 31 143 22
5: 3 17 St.Patrick 2014 19 3 EV EWR ALB 30 143 13
---
253312: 10 31 <NA> 2014 -4 -21 UA JFK SFO 337 2586 17
253313: 10 31 <NA> 2014 -2 -37 UA JFK SFO 344 2586 18
253314: 10 31 <NA> 2014 -6 -38 UA JFK SFO 343 2586 9
253315: 10 31 <NA> 2014 77 23 UA EWR HNL 589 4962 10
253316: 10 31 <NA> 2014 -5 1 HA JFK HNL 633 4983 9
> table(join.dt$event, useNA="always")
Birthday St.Patrick <NA>
815 911 251590
Exercise
code name
1: JFK Kennedy
2: LGA Laguardia
3: EWR Newark
How to compute histogram of distances?
Define histogram bins
grid.point <- seq(0, 5000, by=1000)
grid.dt <- data.table(grid.point, distance=grid.point)
Rolling join, roll=Inf
> setkey(grid.dt, distance)
> setkey(flights, distance)
> (join.dt <- grid.dt[flights, roll=Inf])
grid.point distance year month day dep_delay arr_delay carrier origin dest air_time hour
1: 0 80 2014 1 30 9 17 US EWR PHL 46 15
2: 0 80 2014 1 30 -13 -25 US EWR PHL 34 17
---
253315: 4000 4983 2014 10 30 -6 1 HA JFK HNL 627 9
253316: 4000 4983 2014 10 31 -5 1 HA JFK HNL 633 9
Rolling join, roll=-Inf
> setkey(grid.dt, distance)
> setkey(flights, distance)
> (join.dt <- grid.dt[flights, roll=-Inf])
grid.point distance year month day dep_delay arr_delay carrier origin dest air_time hour
1: 1000 80 2014 1 30 9 17 US EWR PHL 46 15
2: 1000 80 2014 1 30 -13 -25 US EWR PHL 34 17
---
253315: 5000 4983 2014 10 30 -6 1 HA JFK HNL 627 9
253316: 5000 4983 2014 10 31 -5 1 HA JFK HNL 633 9
Rolling join, roll="nearest"
> setkey(grid.dt, distance)
> setkey(flights, distance)
> (join.dt <- grid.dt[flights, roll="nearest"])
grid.point distance year month day dep_delay arr_delay carrier origin dest air_time hour
1: 0 80 2014 1 30 9 17 US EWR PHL 46 15
2: 0 80 2014 1 30 -13 -25 US EWR PHL 34 17
---
253315: 5000 4983 2014 10 30 -6 1 HA JFK HNL 627 9
253316: 5000 4983 2014 10 31 -5 1 HA JFK HNL 633 9
Summarize after rolling join
> join.dt[, .(num_flights=.N, mean_minutes=mean(air_time)), by=grid.point]
grid.point num_flights mean_minutes
1: 0 55212 53.33366
2: 1000 136285 133.56513
3: 2000 48593 283.92688
4: 3000 12665 348.11062
5: 5000 561 618.78253
Non-equi join on=.(DTcol < icol)
> (rect.dt <- data.table(
grid.point min_dist max_dist
1: 0 -200 400
2: 1000 800 1400
3: 2000 1800 2400
4: 3000 2800 3400
5: 4000 3800 4400
6: 5000 4800 5400
> head(join.dt <- rect.dt[flights, .(
+ distance, air_time, grid.point
+ ), on=.(
+ min_dist<distance, max_dist>distance
+ )])
distance air_time grid.point
1: 2475 359 NA
2: 2475 363 NA
3: 2475 351 NA
4: 1035 157 1000
5: 2475 350 NA
6: 2454 339 NA
Compute for every row in table i
General form: DT[i, j, by=.EACHI]
On which rows
What to do?
Grouped by
what?
Summarize in join via by=.EACHI
> flights[three.days, .(
+ mean_air_time=mean(air_time),
+ num_flights=.N
+ ), by=.EACHI, on=.(month,day)]
month day mean_air_time num_flights
1: 3 17 168.9890 911
2: 9 26 151.4147 815
3: 12 25 NA 0
Data reshaping: wide to long (melt)
Wide data
Few rows
Many columns
Long data
Many rows
Few columns
melt( )
dcast( )
similar to stats::reshape(direction="long"), tidyr::pivot_longer( )
melt: wide to long data reshape
melt examples adapted from:
Hocking TD. Wide-to-tall data reshaping using regular expressions and the nc package. R Journal (2021), doi:10.32614/RJ-2021-029
melt: wide to long data reshape
melt(DT, id.vars=c("Species","flower"))
melt examples adapted from:
Hocking TD. Wide-to-tall data reshaping using regular expressions and the nc package. R Journal (2021), doi:10.32614/RJ-2021-029
melt: wide to long data reshape
melt(DT, measure.vars=c("Sepal.Length","Sepal.Width","Petal.Length","Petal.Width"))
melt examples adapted from:
Hocking TD. Wide-to-tall data reshaping using regular expressions and the nc package. R Journal (2021), doi:10.32614/RJ-2021-029
melt: wide to long data reshape
melt(DT, measure.vars=patterns(".*[.].*"))
melt examples adapted from:
Hocking TD. Wide-to-tall data reshaping using regular expressions and the nc package. R Journal (2021), doi:10.32614/RJ-2021-029
melt: wide to long data reshape
melt(DT, measure.vars=patterns("regular expression"))
melt(DT, measure.vars=c("valueA","valueB"))
melt(DT, id.vars=c("ID1","ID2"))
melt examples adapted from:
Hocking TD. Wide-to-tall data reshaping using regular expressions and the nc package. R Journal (2021), doi:10.32614/RJ-2021-029
melt with measure( ), new on GitHub
melt(DT, measure.vars=measure(part,dim,pattern="(.*)[.](.*)"))
melt examples adapted from:
Hocking TD. Wide-to-tall data reshaping using regular expressions and the nc package. R Journal (2021), doi:10.32614/RJ-2021-029
melt with measure( ), new on GitHub
melt(DT, measure.vars=measure(part,dim,sep=".")
melt examples adapted from:
Hocking TD. Wide-to-tall data reshaping using regular expressions and the nc package. R Journal (2021), doi:10.32614/RJ-2021-029
melt examples adapted from:
Hocking TD. Wide-to-tall data reshaping using regular expressions and the nc package. R Journal (2021), doi:10.32614/RJ-2021-029
melt with measure( ), new on GitHub
melt(DT, measure.vars=measure(part, value.name, sep="."))
melt with measure( ), new on GitHub
melt examples adapted from:
Hocking TD. Wide-to-tall data reshaping using regular expressions and the nc package. R Journal (2021), doi:10.32614/RJ-2021-029
melt(DT, measure.vars=measure(value.name, dim, sep="."))
melt with measure( ), new on GitHub
melt examples adapted from:
Hocking TD. Wide-to-tall data reshaping using regular expressions and the nc package. R Journal (2021), doi:10.32614/RJ-2021-029
melt with measure( ), new on GitHub
melt examples adapted from:
Hocking TD. Wide-to-tall data reshaping using regular expressions and the nc package. R Journal (2021), doi:10.32614/RJ-2021-029
melt exercise for flights data
Data reshaping: long to wide (dcast)
Wide data
Few rows
Many columns
Long data
Many rows
Few columns
melt( )
dcast( )
similar to stats::reshape(direction="wide"), tidyr::pivot_wider( )
dcast: long to wide reshape
> iris.dt=data.table(iris)[, flower := .I] # .I means row number in data.table
> iris.long=melt(iris.dt,measure=measure(part,dim,sep="."))
> head(iris.long)
Species flower part dim value
1: setosa 1 Sepal Length 5.1
2: setosa 2 Sepal Length 4.9
3: setosa 3 Sepal Length 4.7
4: setosa 4 Sepal Length 4.6
5: setosa 5 Sepal Length 5.0
6: setosa 6 Sepal Length 5.4
# Formula indicates where to put different values of variables: rows ~ cols
> iris.wide=dcast(iris.long, flower ~ part + dim, sep=".")
> head(iris.wide)
flower Petal.Length Petal.Width Sepal.Length Sepal.Width
1: 1 1.4 0.2 5.1 3.5
2: 2 1.4 0.2 4.9 3.0
3: 3 1.3 0.2 4.7 3.2
4: 4 1.5 0.2 4.6 3.1
5: 5 1.4 0.2 5.0 3.6
6: 6 1.7 0.4 5.4 3.9
dcast: long to wide reshape
> iris.long=melt(iris.dt,measure=measure(part,dim,sep="."),value.name="cm")
> iris.long
Species flower part dim value
1: setosa 1 Sepal Length 5.1
2: setosa 2 Sepal Length 4.9
3: setosa 3 Sepal Length 4.7
4: setosa 4 Sepal Length 4.6
5: setosa 5 Sepal Length 5.0
---
596: virginica 146 Petal Width 2.3
597: virginica 147 Petal Width 1.9
598: virginica 148 Petal Width 2.0
599: virginica 149 Petal Width 2.3
600: virginica 150 Petal Width 1.8
> dcast(iris.long, Species ~ dim, fun.aggregate=mean, value.var="cm")
Species Length Width
1: setosa 3.234 1.837
2: versicolor 5.098 2.048
3: virginica 6.070 2.500
dcast: long to wide reshape
> iris.long=melt(iris.dt,measure=measure(part,dim,sep="."),value.name="cm")
> iris.long
Species flower part dim value
1: setosa 1 Sepal Length 5.1
2: setosa 2 Sepal Length 4.9
3: setosa 3 Sepal Length 4.7
4: setosa 4 Sepal Length 4.6
5: setosa 5 Sepal Length 5.0
---
596: virginica 146 Petal Width 2.3
597: virginica 147 Petal Width 1.9
598: virginica 148 Petal Width 2.0
599: virginica 149 Petal Width 2.3
600: virginica 150 Petal Width 1.8
> dcast(iris.long, Species ~ dim, fun.aggregate=list(min, mean, max), value.var="cm")
Species cm_min_Length cm_min_Width cm_mean_Length cm_mean_Width cm_max_Length cm_max_Width
1: setosa 1.0 0.1 3.234 1.837 5.8 4.4
2: versicolor 3.0 1.0 5.098 2.048 7.0 3.4
3: virginica 4.5 1.4 6.070 2.500 7.9 3.8
dcast: long to wide reshape
> iris.long=melt(iris.dt,measure=measure(part,value.name,sep="."))
> iris.long
Species flower part Length Width
1: setosa 1 Sepal 5.1 3.5
2: setosa 2 Sepal 4.9 3.0
3: setosa 3 Sepal 4.7 3.2
4: setosa 4 Sepal 4.6 3.1
5: setosa 5 Sepal 5.0 3.6
---
296: virginica 146 Petal 5.2 2.3
297: virginica 147 Petal 5.0 1.9
298: virginica 148 Petal 5.2 2.0
299: virginica 149 Petal 5.4 2.3
300: virginica 150 Petal 5.1 1.8
> dcast(iris.long, Species ~ part, fun.aggregate=mean, value.var=c("Length","Width"))
Species Length_Petal Length_Sepal Width_Petal Width_Sepal
1: setosa 1.462 5.006 0.246 3.428
2: versicolor 4.260 5.936 1.326 2.770
3: virginica 5.552 6.588 2.026 2.974
dcast: long to wide reshape
> iris.long=melt(iris.dt,measure=measure(part,value.name,sep="."))
> iris.long
Species flower part Length Width
1: setosa 1 Sepal 5.1 3.5
2: setosa 2 Sepal 4.9 3.0
3: setosa 3 Sepal 4.7 3.2
4: setosa 4 Sepal 4.6 3.1
5: setosa 5 Sepal 5.0 3.6
---
296: virginica 146 Petal 5.2 2.3
297: virginica 147 Petal 5.0 1.9
298: virginica 148 Petal 5.2 2.0
299: virginica 149 Petal 5.4 2.3
300: virginica 150 Petal 5.1 1.8
~ . means no reshaping variables in columns
> dcast(iris.long, Species + part ~ ., fun.aggregate=list(min, max), value.var=c("Length","Width"))
Species part Length_min Width_min Length_max Width_max
1: setosa Petal 1.0 0.1 1.9 0.6
2: setosa Sepal 4.3 2.3 5.8 4.4
3: versicolor Petal 3.0 1.0 5.1 1.8
4: versicolor Sepal 4.9 2.0 7.0 3.4
5: virginica Petal 4.5 1.4 6.9 2.5
6: virginica Sepal 4.9 2.2 7.9 3.8
Exercise
dest EWR JFK LGA
1: ABQ 0 278 0
2: ACK 0 277 0
3: AGS 0 0 3
4: ALB 169 0 0
5: ANC 13 0 0
...
import data.table in your package
Avoiding CRAN NOTE 1
my_fun <- function(DT) DT[, .(m=mean(x)), by=y]
* checking R code for possible problems ... NOTE
my_fun: no visible global function definition for ‘.’
my_fun: no visible binding for global variable ‘x’
my_fun: no visible binding for global variable ‘y’
...
Undefined global functions or variables:
. x y
my_fun <- function(DT){
x <- y <- . <- NULL
DT[, .(m=mean(x)), by=y]
}
Avoiding CRAN NOTE 2
Flavor: r-devel-linux-x86_64-debian-gcc
Check: examples, Result: NOTE
Examples with CPU time > 2.5 times elapsed time
user system elapsed ratio
aum_line_search 12.349 0.322 1.935 6.548
aum_line_search_grid 10.033 0.308 1.781 5.806
aum_diffs_penalty 4.730 0.169 1.635 2.996
data.table::setDTthreads(2)
4/4
Contributing to data.table
~10 minutes, tell us how you use data.table!
(help us choose what to prioritize in the future)
Community blog and survey
GitHub repository
Translation Awards
Travel awards
Summary of data.table
Thank you! Questions?
Toby Dylan Hocking
Assistant Professor
Northern Arizona University
toby.hocking@r-project.org
Funded by NSF POSE program, project #2303612.
Slides adapted from Arun Srinivasan, and datatable-intro vignette - thanks!
Please use/adapt these slides as you like,
as long as you give me some credit,
as I have done for Arun below.