Efficient R: Performant data.frame constructors

R

How and when to use an alternative to as.data.frame

shikokuchuo
2021-07-23

About as.data.frame

data.frame() or as.data.frame() are such ubiquitous functions that we rarely think twice about using them to create dataframes or to convert other objects to dataframes.

However, they are slow. Extremely slow.

This is somewhat surprising considering how much they are used, and given that the ‘data.frame’ object is the de facto standard for tabular data in R, for their constructors to be so inefficient.

However this is the direct result of the presence of a lot of error checking and validation code, which is perhaps understandable for something as widely used. It simply needs to handle a wide possible variety of inputs and so tries to do its best or fail gracefully.

Below, we demonstrate the inefficiencies of as.data.frame() versus efficient ‘data.frame’ constructors from the ‘ichimoku’ package1 coded for performance.

For benchmarking, the ‘microbenchmark’ package will be used. It is usual to compare the median times averaged over a large number of runs, and we will use 10,000 in the cases below.

Matrix conversion benchmarking

A 100x10 matrix of random data drawn from the normal distribution is created as the object ‘matrix’.

This will be converted into a dataframe using as.data.frame() and ichimoku::matrix_df().

library(ichimoku)
library(microbenchmark)

matrix <- matrix(rnorm(1000), ncol = 10, dimnames = list(1:100, letters[1:10]))

dim(matrix)
[1] 100  10
head(matrix)
           a          b          c          d          e          f
1 -0.1400286  1.1118323  0.4669602 -1.4488988 -0.7541324  0.9637862
2  1.0460964 -0.7047356 -0.4437435  0.7018097  0.4328479  0.9859072
3 -2.0406678  0.2809715 -0.2868613 -1.9068354 -1.2635379 -0.3884103
4  1.4877548 -2.3801003 -1.0285516 -0.4433571 -0.9534238  1.4033954
5  1.5941811 -1.1293828  1.3376018  1.1229029  1.0523098  1.3629215
6  0.3723751 -1.6727550 -1.1478162 -1.7363323  2.0140197 -0.6120398
           g          h          i          j
1  0.9775267 -0.8609284  1.1481851 -0.8444851
2  0.3858858 -0.9267438 -0.7355040 -0.6310779
3 -0.1272062 -1.3729532  1.9566503  1.0197956
4  1.0877965  0.7028592 -1.4809024 -1.8808421
5 -0.2563864 -0.1795008  0.3667372 -1.1918655
6  1.4115147 -0.9282505  0.0256846 -0.6860796
microbenchmark(as.data.frame(matrix), matrix_df(matrix), times = 10000)
Unit: microseconds
                  expr    min      lq     mean  median      uq
 as.data.frame(matrix) 30.494 32.5955 38.57970 33.5535 35.5940
     matrix_df(matrix)  6.353  7.1900 10.21914  7.6450  8.2785
      max neval
 12755.62 10000
 11542.73 10000
identical(as.data.frame(matrix), matrix_df(matrix))
[1] TRUE

As can be seen, the outputs are identical, but ichimoku::matrix_df(), which is designed to be a performant ‘data.frame’ constructor, is over 3x as fast.

xts conversion benchmarking

The ‘xts’ format is a popular choice for large time series data as each observation is indexed by a unique valid timestamp.

As an example, we use the ichimoku() function from the ‘ichimoku’ package which creates ichimoku objects inheriting the ‘xts’ class. We run ichimoku() on the sample data contained within the package to create an ‘xts’ object ‘cloud’.2

This will be converted into a dataframe using as.data.frame() and ichimoku::xts_df().

library(ichimoku)
library(microbenchmark)

cloud <- ichimoku(sample_ohlc_data)
class(cloud) <- c("xts", "zoo")

xts::is.xts(cloud)
[1] TRUE
dim(cloud)
[1] 281  12
print(cloud[1:6], plot = FALSE)
            open  high   low close cd tenkan kijun senkouA senkouB
2020-01-02 123.0 123.1 122.5 122.7 -1     NA    NA      NA      NA
2020-01-03 122.7 122.8 122.6 122.8  1     NA    NA      NA      NA
2020-01-06 122.8 123.4 122.4 123.3  1     NA    NA      NA      NA
2020-01-07 123.3 124.3 123.3 124.1  1     NA    NA      NA      NA
2020-01-08 124.1 124.8 124.0 124.8  1     NA    NA      NA      NA
2020-01-09 124.8 125.4 124.5 125.3  1     NA    NA      NA      NA
           chikou cloudT cloudB
2020-01-02  122.8     NA     NA
2020-01-03  122.9     NA     NA
2020-01-06  123.0     NA     NA
2020-01-07  123.9     NA     NA
2020-01-08  123.6     NA     NA
2020-01-09  122.5     NA     NA
microbenchmark(as.data.frame(cloud), xts_df(cloud), times = 10000)
Unit: microseconds
                 expr     min      lq      mean   median       uq
 as.data.frame(cloud) 234.909 243.014 270.20903 248.7585 267.1130
        xts_df(cloud)  24.201  27.103  40.36881  28.7090  30.5855
      max neval
  7683.47 10000
 59508.19 10000

It can be seen that ichimoku::xts_df(), which is designed to be a performant ‘data.frame’ constructor, is over 8x as fast.

df1 <- as.data.frame(cloud)

is.data.frame(df1)
[1] TRUE
str(df1)
'data.frame':   281 obs. of  12 variables:
 $ open   : num  123 123 123 123 124 ...
 $ high   : num  123 123 123 124 125 ...
 $ low    : num  122 123 122 123 124 ...
 $ close  : num  123 123 123 124 125 ...
 $ cd     : num  -1 1 1 1 1 1 -1 0 -1 -1 ...
 $ tenkan : num  NA NA NA NA NA ...
 $ kijun  : num  NA NA NA NA NA NA NA NA NA NA ...
 $ senkouA: num  NA NA NA NA NA NA NA NA NA NA ...
 $ senkouB: num  NA NA NA NA NA NA NA NA NA NA ...
 $ chikou : num  123 123 123 124 124 ...
 $ cloudT : num  NA NA NA NA NA NA NA NA NA NA ...
 $ cloudB : num  NA NA NA NA NA NA NA NA NA NA ...
df2 <- xts_df(cloud)

is.data.frame(df2)
[1] TRUE
str(df2)
'data.frame':   281 obs. of  13 variables:
 $ index  : POSIXct, format: "2020-01-02" ...
 $ open   : num  123 123 123 123 124 ...
 $ high   : num  123 123 123 124 125 ...
 $ low    : num  122 123 122 123 124 ...
 $ close  : num  123 123 123 124 125 ...
 $ cd     : num  -1 1 1 1 1 1 -1 0 -1 -1 ...
 $ tenkan : num  NA NA NA NA NA ...
 $ kijun  : num  NA NA NA NA NA NA NA NA NA NA ...
 $ senkouA: num  NA NA NA NA NA NA NA NA NA NA ...
 $ senkouB: num  NA NA NA NA NA NA NA NA NA NA ...
 $ chikou : num  123 123 123 124 124 ...
 $ cloudT : num  NA NA NA NA NA NA NA NA NA NA ...
 $ cloudB : num  NA NA NA NA NA NA NA NA NA NA ...

The outputs are slightly different as xts_df() preserves the date-time index of ‘xts’ objects as a new first column ‘index’ which is POSIXct in format. The default as.data.frame() constructor converts the index into the row names, which is not desirable as the dates are coerced to type ‘character’.

So it can be seen that in this case, not only is the performant constructor faster, it is also more fit for purpose.

When to use performant constructors

  1. Data which is not already a ‘data.frame’ object being plotted using ‘ggplot2’. For example if you have time series data in the ‘xts’ format, calling a ‘ggplot2’ plot method automatically converts the data into a dataframe behind the scenes as ggplot() only works with dataframes internally. Fortunately it does not use as.data.frame() but its own constructor ggplot2::fortify(). Benchmarked below, it is slightly faster than as.data.frame() but the performant constructor ichimoku::xts_df() is still around 5x as fast.
microbenchmark(as.data.frame(cloud), ggplot2::fortify(cloud), xts_df(cloud), times = 10000)
Unit: microseconds
                    expr     min       lq      mean   median       uq
    as.data.frame(cloud) 237.257 252.8185 297.82993 263.3445 290.8465
 ggplot2::fortify(cloud) 130.149 149.6695 176.68291 160.8120 178.8900
           xts_df(cloud)  25.166  29.8525  38.59277  31.8160  34.9770
      max neval
 6863.294 10000
 6107.186 10000
 9592.124 10000
  1. In a context where performance is critical. This is usually in interactive environments such as a Shiny app, perhaps with real time data where slow code can reduce responsiveness or cause bottlenecks in execution.

  2. Within packages. It is usually safe to use performant constructors within functions or for internal unexported functions. If following programming best practices the input and output types for functions are kept consistent, and so the input to the constructor can be controlled and hence its function predictable. Setting appropriate unit tests can also catch any issues early.

When to question the use of performant constructors

  1. For user-facing functions. Having no validation or error-checking code means that a performant constructor may behave unpredictably on data that is not intended to be an input. Within a function, there is a specific or at most finite range of objects that a constructor can receive. When that limit is removed, if the input is not the intended input for a constructor then an error can be expected. As long as this is made clear to the user and there are adequate instructions on proper usage, in an environment where the occasional error message is acceptable, then the performant constructor can still be used.

  2. When the constructor needs to handle a range of input types. as.data.frame() is actually an S3 generic with a variety of methods for different object classes. If required to handle a variety of different types of input, it may be easier (if not more performant) to rely on as.data.frame() rather than write code which handles different scenarios.

What is a performant constructor

First of all, it is possible to directly use the functions matrix_df() and xts_df() which are exported from the ‘ichimoku’ package.

What lies behind those functions? Some variation on the below:

# The stucture underlying a data frame is simply a list
df <- list(vec1, vec2, vec3)

# Set the following attributes to turn the list into a data frame:
attributes(df) <- list(names = c("vec1", "vec2", "vec3"),
                       class = "data.frame",
                       row.names = .set_row_names(length(vec1)))
  1. A data.frame is simply a list (where each element must be the same length).
  2. It has an attribute ‘class’ which equals ‘data.frame’.
  3. It must have row names, which can be set by the base R internal function .set_row_names() which takes a single argument, the number of rows.

Note:

  1. The vectors in the list (vec1, vec2, vec3, etc.) must be the same length, otherwise a corrupt data.frame warning will be generated.
  2. If row names are missing then the data will still be present but dim() will show a 0-row dataframe and its print method will not work.
  3. .set_row_names() sets row names efficiently using a compact internal notation used by R. They can also be assigned an integer sequence, or a series of dates for example. However if not an integer vector, they are first coerced to type ‘character’.

In conclusion, dataframes are not complicated structures but internally represented by lists with a couple of enforced constraints.

class(df1)
[1] "data.frame"
typeof(df1)
[1] "list"

Further information

Documentation for the performant constructors discussed: https://shikokuchuo.net/ichimoku/articles/utilities.html#performant-dataframe-constructors.


  1. Gao, C. (2021), ichimoku: Visualization and Tools for Ichimoku Kinko Hyo Strategies. R package version 1.2.2, https://CRAN.R-project.org/package=ichimoku.↩︎

  2. We then remove the ‘ichimoku’ class from the object as ‘ichimoku’ now has an efficient ‘as.data.frame’ S3 method since version 1.2.4.↩︎

Citation

For attribution, please cite this work as

shikokuchuo (2021, July 23). shikokuchuo{net}: Efficient R: Performant data.frame constructors. Retrieved from https://shikokuchuo.net/posts/11-dataframes/

BibTeX citation

@misc{shikokuchuo2021efficient,
  author = {shikokuchuo, },
  title = {shikokuchuo{net}: Efficient R: Performant data.frame constructors},
  url = {https://shikokuchuo.net/posts/11-dataframes/},
  year = {2021}
}