How and when to use an alternative to as.data.frame
data.frame() or as.data.frame() are such ubiquitous functions that we rarely think twice about using them to create dataframes or to convert other objects to dataframes.
However, they are slow. Extremely slow.
This is somewhat surprising considering how much they are used, and given that the ‘data.frame’ object is the de facto standard for tabular data in R, for their constructors to be so inefficient.
However this is the direct result of the presence of a lot of error checking and validation code, which is perhaps understandable for something as widely used. It simply needs to handle a wide possible variety of inputs and so tries to do its best or fail gracefully.
Below, we demonstrate the inefficiencies of as.data.frame() versus efficient ‘data.frame’ constructors from the ‘ichimoku’ package1 coded for performance.
For benchmarking, the ‘microbenchmark’ package will be used. It is usual to compare the median times averaged over a large number of runs, and we will use 10,000 in the cases below.
A 100x10 matrix of random data drawn from the normal distribution is created as the object ‘matrix’.
This will be converted into a dataframe using as.data.frame() and ichimoku::matrix_df().
library(ichimoku)
library(microbenchmark)
matrix <- matrix(rnorm(1000), ncol = 10, dimnames = list(1:100, letters[1:10]))
dim(matrix)
[1] 100 10
head(matrix)
a b c d e f
1 -0.1400286 1.1118323 0.4669602 -1.4488988 -0.7541324 0.9637862
2 1.0460964 -0.7047356 -0.4437435 0.7018097 0.4328479 0.9859072
3 -2.0406678 0.2809715 -0.2868613 -1.9068354 -1.2635379 -0.3884103
4 1.4877548 -2.3801003 -1.0285516 -0.4433571 -0.9534238 1.4033954
5 1.5941811 -1.1293828 1.3376018 1.1229029 1.0523098 1.3629215
6 0.3723751 -1.6727550 -1.1478162 -1.7363323 2.0140197 -0.6120398
g h i j
1 0.9775267 -0.8609284 1.1481851 -0.8444851
2 0.3858858 -0.9267438 -0.7355040 -0.6310779
3 -0.1272062 -1.3729532 1.9566503 1.0197956
4 1.0877965 0.7028592 -1.4809024 -1.8808421
5 -0.2563864 -0.1795008 0.3667372 -1.1918655
6 1.4115147 -0.9282505 0.0256846 -0.6860796
microbenchmark(as.data.frame(matrix), matrix_df(matrix), times = 10000)
Unit: microseconds
expr min lq mean median uq
as.data.frame(matrix) 30.494 32.5955 38.57970 33.5535 35.5940
matrix_df(matrix) 6.353 7.1900 10.21914 7.6450 8.2785
max neval
12755.62 10000
11542.73 10000
identical(as.data.frame(matrix), matrix_df(matrix))
[1] TRUE
As can be seen, the outputs are identical, but ichimoku::matrix_df()
, which is designed to be a performant ‘data.frame’ constructor, is over 3x as fast.
The ‘xts’ format is a popular choice for large time series data as each observation is indexed by a unique valid timestamp.
As an example, we use the ichimoku() function from the ‘ichimoku’ package which creates ichimoku objects inheriting the ‘xts’ class. We run ichimoku() on the sample data contained within the package to create an ‘xts’ object ‘cloud’.2
This will be converted into a dataframe using as.data.frame() and ichimoku::xts_df().
library(ichimoku)
library(microbenchmark)
cloud <- ichimoku(sample_ohlc_data)
class(cloud) <- c("xts", "zoo")
xts::is.xts(cloud)
[1] TRUE
dim(cloud)
[1] 281 12
print(cloud[1:6], plot = FALSE)
open high low close cd tenkan kijun senkouA senkouB
2020-01-02 123.0 123.1 122.5 122.7 -1 NA NA NA NA
2020-01-03 122.7 122.8 122.6 122.8 1 NA NA NA NA
2020-01-06 122.8 123.4 122.4 123.3 1 NA NA NA NA
2020-01-07 123.3 124.3 123.3 124.1 1 NA NA NA NA
2020-01-08 124.1 124.8 124.0 124.8 1 NA NA NA NA
2020-01-09 124.8 125.4 124.5 125.3 1 NA NA NA NA
chikou cloudT cloudB
2020-01-02 122.8 NA NA
2020-01-03 122.9 NA NA
2020-01-06 123.0 NA NA
2020-01-07 123.9 NA NA
2020-01-08 123.6 NA NA
2020-01-09 122.5 NA NA
microbenchmark(as.data.frame(cloud), xts_df(cloud), times = 10000)
Unit: microseconds
expr min lq mean median uq
as.data.frame(cloud) 234.909 243.014 270.20903 248.7585 267.1130
xts_df(cloud) 24.201 27.103 40.36881 28.7090 30.5855
max neval
7683.47 10000
59508.19 10000
It can be seen that ichimoku::xts_df()
, which is designed to be a performant ‘data.frame’ constructor, is over 8x as fast.
df1 <- as.data.frame(cloud)
is.data.frame(df1)
[1] TRUE
str(df1)
'data.frame': 281 obs. of 12 variables:
$ open : num 123 123 123 123 124 ...
$ high : num 123 123 123 124 125 ...
$ low : num 122 123 122 123 124 ...
$ close : num 123 123 123 124 125 ...
$ cd : num -1 1 1 1 1 1 -1 0 -1 -1 ...
$ tenkan : num NA NA NA NA NA ...
$ kijun : num NA NA NA NA NA NA NA NA NA NA ...
$ senkouA: num NA NA NA NA NA NA NA NA NA NA ...
$ senkouB: num NA NA NA NA NA NA NA NA NA NA ...
$ chikou : num 123 123 123 124 124 ...
$ cloudT : num NA NA NA NA NA NA NA NA NA NA ...
$ cloudB : num NA NA NA NA NA NA NA NA NA NA ...
df2 <- xts_df(cloud)
is.data.frame(df2)
[1] TRUE
str(df2)
'data.frame': 281 obs. of 13 variables:
$ index : POSIXct, format: "2020-01-02" ...
$ open : num 123 123 123 123 124 ...
$ high : num 123 123 123 124 125 ...
$ low : num 122 123 122 123 124 ...
$ close : num 123 123 123 124 125 ...
$ cd : num -1 1 1 1 1 1 -1 0 -1 -1 ...
$ tenkan : num NA NA NA NA NA ...
$ kijun : num NA NA NA NA NA NA NA NA NA NA ...
$ senkouA: num NA NA NA NA NA NA NA NA NA NA ...
$ senkouB: num NA NA NA NA NA NA NA NA NA NA ...
$ chikou : num 123 123 123 124 124 ...
$ cloudT : num NA NA NA NA NA NA NA NA NA NA ...
$ cloudB : num NA NA NA NA NA NA NA NA NA NA ...
The outputs are slightly different as xts_df() preserves the date-time index of ‘xts’ objects as a new first column ‘index’ which is POSIXct in format. The default as.data.frame() constructor converts the index into the row names, which is not desirable as the dates are coerced to type ‘character’.
So it can be seen that in this case, not only is the performant constructor faster, it is also more fit for purpose.
ichimoku::xts_df()
is still around 5x as fast.microbenchmark(as.data.frame(cloud), ggplot2::fortify(cloud), xts_df(cloud), times = 10000)
Unit: microseconds
expr min lq mean median uq
as.data.frame(cloud) 237.257 252.8185 297.82993 263.3445 290.8465
ggplot2::fortify(cloud) 130.149 149.6695 176.68291 160.8120 178.8900
xts_df(cloud) 25.166 29.8525 38.59277 31.8160 34.9770
max neval
6863.294 10000
6107.186 10000
9592.124 10000
In a context where performance is critical. This is usually in interactive environments such as a Shiny app, perhaps with real time data where slow code can reduce responsiveness or cause bottlenecks in execution.
Within packages. It is usually safe to use performant constructors within functions or for internal unexported functions. If following programming best practices the input and output types for functions are kept consistent, and so the input to the constructor can be controlled and hence its function predictable. Setting appropriate unit tests can also catch any issues early.
For user-facing functions. Having no validation or error-checking code means that a performant constructor may behave unpredictably on data that is not intended to be an input. Within a function, there is a specific or at most finite range of objects that a constructor can receive. When that limit is removed, if the input is not the intended input for a constructor then an error can be expected. As long as this is made clear to the user and there are adequate instructions on proper usage, in an environment where the occasional error message is acceptable, then the performant constructor can still be used.
When the constructor needs to handle a range of input types. as.data.frame() is actually an S3 generic with a variety of methods for different object classes. If required to handle a variety of different types of input, it may be easier (if not more performant) to rely on as.data.frame() rather than write code which handles different scenarios.
First of all, it is possible to directly use the functions matrix_df()
and xts_df()
which are exported from the ‘ichimoku’ package.
What lies behind those functions? Some variation on the below:
# The stucture underlying a data frame is simply a list
df <- list(vec1, vec2, vec3)
# Set the following attributes to turn the list into a data frame:
attributes(df) <- list(names = c("vec1", "vec2", "vec3"),
class = "data.frame",
row.names = .set_row_names(length(vec1)))
.set_row_names()
which takes a single argument, the number of rows.Note:
.set_row_names()
sets row names efficiently using a compact internal notation used by R. They can also be assigned an integer sequence, or a series of dates for example. However if not an integer vector, they are first coerced to type ‘character’.In conclusion, dataframes are not complicated structures but internally represented by lists with a couple of enforced constraints.
Documentation for the performant constructors discussed: https://shikokuchuo.net/ichimoku/articles/utilities.html#performant-dataframe-constructors.
Gao, C. (2021), ichimoku: Visualization and Tools for Ichimoku Kinko Hyo Strategies. R package version 1.2.2, https://CRAN.R-project.org/package=ichimoku.↩︎
We then remove the ‘ichimoku’ class from the object as ‘ichimoku’ now has an efficient ‘as.data.frame’ S3 method since version 1.2.4.↩︎
For attribution, please cite this work as
shikokuchuo (2021, July 23). shikokuchuo{net}: Efficient R: Performant data.frame constructors. Retrieved from https://shikokuchuo.net/posts/11-dataframes/
BibTeX citation
@misc{shikokuchuo2021efficient, author = {shikokuchuo, }, title = {shikokuchuo{net}: Efficient R: Performant data.frame constructors}, url = {https://shikokuchuo.net/posts/11-dataframes/}, year = {2021} }