# data.table or data.frame?

[This article was first published on

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

**- R**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I spent a portion of today trying to convince a colleague that there are times when the `data.table`

package is faster than traditional methods in R. It took a few of the tests below to prove the point.

Generate a data.frame of `characters`

and numbers for easy plotting.

df <- data.frame(letters = as.character(sample(letters[1:10], 1e+08, replace = TRUE)), numbers = sample(1:100, 1e+08, replace = TRUE)) head(df) ## letters numbers ## 1 f 69 ## 2 j 65 ## 3 h 29 ## 4 c 69 ## 5 j 12 ## 6 e 65

Aggregate using the base R function aggregate.

start <- proc.time() aggregate(numbers ~ letters, data = df, FUN = sum) ## letters numbers ## 1 a 504884636 ## 2 b 504587923 ## 3 c 505357057 ## 4 d 505106809 ## 5 e 504788174 ## 6 f 505219078 ## 7 g 504796095 ## 8 h 504693166 ## 9 i 505079861 ## 10 j 505044118 aggregate_time <- proc.time() - start aggregate_time ## user system elapsed ## 120.13 30.51 261.79

Aggregate using `ddply`

from the package `plyr`

.

require("plyr") ## Loading required package: plyr start <- proc.time() ddply(df, .(letters), summarize, sums = sum(numbers)) ## letters sums ## 1 a 504884636 ## 2 b 504587923 ## 3 c 505357057 ## 4 d 505106809 ## 5 e 504788174 ## 6 f 505219078 ## 7 g 504796095 ## 8 h 504693166 ## 9 i 505079861 ## 10 j 505044118 ddply_time <- proc.time() - start ddply_time ## user system elapsed ## 22.04 27.38 192.99

Aggregate using the `data.table`

pacakge.

require("data.table") ## Loading required package: data.table start <- proc.time() dt <- data.table(df, key = "letters") dt[, list(sums = sum(numbers)), by = c("letters")] ## letters sums ## 1: a 504884636 ## 2: b 504587923 ## 3: c 505357057 ## 4: d 505106809 ## 5: e 504788174 ## 6: f 505219078 ## 7: g 504796095 ## 8: h 504693166 ## 9: i 505079861 ## 10: j 505044118 dt_time <- proc.time() - start dt_time ## user system elapsed ## 7.102 7.017 55.957

Comparison of the system times.

# how many times slower is aggregate aggregate_time[2]/ddply_time[2] ## sys.self ## 1.114 aggregate_time[2]/dt_time[2] ## sys.self ## 4.347 # how many times slower is ddply ddply_time[2]/aggregate_time[2] ## sys.self ## 0.8975 ddply_time[2]/dt_time[2] ## sys.self ## 3.902 # how many times slower is data.table dt_time[2]/aggregate_time[2] ## sys.self ## 0.23 dt_time[2]/ddply_time[2] ## sys.self ## 0.2563

Based on 1 billion observations with the time to conver to a data.table included in the time elapsed.

- ddply requires ~0.8975 more system time than aggregate
- aggregate requires ~4.3474x more system time data.table
- ddply requires ~3.902x more system time than data.table

**Conclusion - data.table for the win.**

To

**leave a comment**for the author, please follow the link and comment on their blog:**- R**.R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.