Efficient R: Performant data.frame constructors

R

How and when to use an alternative to as.data.frame

shikokuchuo
2021-07-23

                                                            sha256
1 eb5d71529ab540bc4865c181a1129e03186e0959c76196a9fbc0c2a16c767856

About as.data.frame

data.frame() or as.data.frame() are such ubiquitous functions that we rarely think twice about using them to create dataframes or to convert other objects to dataframes.

However, they are slow. Extremely slow.

This is somewhat surprising considering how much they are used, and given that the ‘data.frame’ object is the de facto standard for tabular data in R, for their constructors to be so inefficient.

However this is the direct result of the presence of a lot of error checking and validation code, which is perhaps understandable for something as widely used. You simply don’t know what is going to be thrown at the function and so it needs to try to do its best or fail gracefully.

Below, we demonstrate the inefficiencies of as.data.frame() versus efficient ‘data.frame’ constructors from the ‘ichimoku’ package1 coded for performance.

For benchmarking, the ‘microbenchmark’ package will be used. It is usual to compare the median times averaged over a large number of runs, and we will use 10,000 in the cases below.

Matrix conversion benchmarking

A 100x10 matrix of random data drawn from the normal distribution is created as the object ‘matrix’.

This will be converted into a dataframe using as.data.frame() and ichimoku::matrix_df().

library(ichimoku)
library(microbenchmark)

matrix <- matrix(rnorm(1000), ncol = 10, dimnames = list(1:100, letters[1:10]))

dim(matrix)
[1] 100  10
head(matrix)
           a          b          c           d           e
1 -0.8961569 -1.9361078 -0.3575423 -0.47233456 -1.27145483
2 -0.3640157 -0.7236045  1.4374122  0.87418507  0.97002024
3  0.3025398  0.7944878 -0.9100365  0.02939649  0.36748167
4 -0.3441126 -0.4488198 -2.2665552  1.75116663  0.51858041
5  0.6111886  0.5303838  0.8224153 -1.69407014  0.31327151
6  1.0011440 -0.8571095 -1.1841243  1.57866701  0.06359564
            f          g           h            i           j
1 -0.08769499 -0.3698239 -0.61193931  0.002457263 -0.65207409
2 -0.48725754  0.4600647  0.43107715 -1.032294929  0.52566352
3  1.00486928  0.4786253 -0.07370084 -1.451494851  0.07309418
4 -0.16829176 -0.4877551 -2.75762930 -0.477793273  0.10695395
5  0.68183273 -0.9793648  0.16314497 -1.946877877 -0.44795292
6 -0.67145932 -0.4418492 -0.85477727  1.382608578 -1.74825575
microbenchmark(as.data.frame(matrix), matrix_df(matrix), times = 10000)
Unit: microseconds
                  expr    min      lq     mean  median      uq
 as.data.frame(matrix) 30.189 32.3255 38.26354 33.2550 34.6285
     matrix_df(matrix) 11.717 12.9260 16.87939 13.4805 14.3245
      max neval
 8989.016 10000
 9248.635 10000
identical(as.data.frame(matrix), matrix_df(matrix)) &&
  all.equal(as.data.frame(matrix), matrix_df(matrix))
[1] TRUE

As can be seen, the outputs are identical, but ichimoku::matrix_df(), which is designed to be a performant ‘data.frame’ constructor, is around 2.5x as fast.

xts conversion benchmarking

The ‘xts’ format is a popular choice for large time series data as each observation is indexed by a unique valid timestamp.

As an example, we use the ichimoku() function from the ‘ichimoku’ package which creates ichimoku objects inheriting the ‘xts’ class. We run ichimoku() on the sample data contained within the package to create an ‘xts’ object ‘cloud’.

This will be converted into a dataframe using as.data.frame() and ichimoku::xts_df().

library(ichimoku)
library(microbenchmark)

cloud <- ichimoku(sample_ohlc_data)

xts::is.xts(cloud)
[1] TRUE
dim(cloud)
[1] 281  12
print(cloud[1:6], plot = FALSE)
            open  high   low close cd tenkan kijun senkouA senkouB
2020-01-02 123.0 123.1 122.5 122.7 -1     NA    NA      NA      NA
2020-01-03 122.7 122.8 122.6 122.8  1     NA    NA      NA      NA
2020-01-06 122.8 123.4 122.4 123.3  1     NA    NA      NA      NA
2020-01-07 123.3 124.3 123.3 124.1  1     NA    NA      NA      NA
2020-01-08 124.1 124.8 124.0 124.8  1     NA    NA      NA      NA
2020-01-09 124.8 125.4 124.5 125.3  1     NA    NA      NA      NA
           chikou cloudT cloudB
2020-01-02  122.8     NA     NA
2020-01-03  122.9     NA     NA
2020-01-06  123.0     NA     NA
2020-01-07  123.9     NA     NA
2020-01-08  123.6     NA     NA
2020-01-09  122.5     NA     NA
microbenchmark(as.data.frame(cloud), xts_df(cloud), times = 10000)
Unit: microseconds
                 expr     min       lq      mean  median       uq
 as.data.frame(cloud) 230.196 236.9875 264.12672 240.507 247.5335
        xts_df(cloud)  29.491  32.6070  39.02587  34.270  36.3655
       max neval
 54999.799 10000
  6943.992 10000

It can be seen that ichimoku::xts_df(), which is designed to be a performant ‘data.frame’ constructor, is over 7x as fast.

df1 <- as.data.frame(cloud)

is.data.frame(df1)
[1] TRUE
str(df1)
'data.frame':   281 obs. of  12 variables:
 $ open   : num  123 123 123 123 124 ...
 $ high   : num  123 123 123 124 125 ...
 $ low    : num  122 123 122 123 124 ...
 $ close  : num  123 123 123 124 125 ...
 $ cd     : num  -1 1 1 1 1 1 -1 0 -1 -1 ...
 $ tenkan : num  NA NA NA NA NA ...
 $ kijun  : num  NA NA NA NA NA NA NA NA NA NA ...
 $ senkouA: num  NA NA NA NA NA NA NA NA NA NA ...
 $ senkouB: num  NA NA NA NA NA NA NA NA NA NA ...
 $ chikou : num  123 123 123 124 124 ...
 $ cloudT : num  NA NA NA NA NA NA NA NA NA NA ...
 $ cloudB : num  NA NA NA NA NA NA NA NA NA NA ...
df2 <- xts_df(cloud)

is.data.frame(df2)
[1] TRUE
str(df2)
'data.frame':   281 obs. of  13 variables:
 $ index  : POSIXct, format: "2020-01-02" ...
 $ open   : num  123 123 123 123 124 ...
 $ high   : num  123 123 123 124 125 ...
 $ low    : num  122 123 122 123 124 ...
 $ close  : num  123 123 123 124 125 ...
 $ cd     : num  -1 1 1 1 1 1 -1 0 -1 -1 ...
 $ tenkan : num  NA NA NA NA NA ...
 $ kijun  : num  NA NA NA NA NA NA NA NA NA NA ...
 $ senkouA: num  NA NA NA NA NA NA NA NA NA NA ...
 $ senkouB: num  NA NA NA NA NA NA NA NA NA NA ...
 $ chikou : num  123 123 123 124 124 ...
 $ cloudT : num  NA NA NA NA NA NA NA NA NA NA ...
 $ cloudB : num  NA NA NA NA NA NA NA NA NA NA ...

The outputs are slightly different as xts_df() preserves the date-time index of ‘xts’ objects as a new first column ‘index’ which is POSIXct in format. The default as.data.frame() constructor converts the index into the row names, which is not desirable as the dates are coerced to type ‘character’.

So it can be seen that in this case, not only is the performant constructor faster, it is also more fit for purpose.

When to use performant constructors

  1. Data which is not already a ‘data.frame’ object being plotted using ‘ggplot2’. For example if you have time series data in the ‘xts’ format, calling a ‘ggplot2’ plot method automatically converts the data into a dataframe behind the scenes as ggplot() only works with dataframes internally. Fortunately it does not use as.data.frame() but its own constructor ggplot2::fortify(). Benchmarked below, it is slightly faster than as.data.frame() but the performant constructor ichimoku::xts_df() is still over 4x as fast.
microbenchmark(as.data.frame(cloud), ggplot2::fortify(cloud), xts_df(cloud), times = 10000)
Unit: microseconds
                    expr     min       lq      mean   median       uq
    as.data.frame(cloud) 231.736 248.6160 272.84370 254.9245 262.7640
 ggplot2::fortify(cloud) 133.140 149.8605 172.57276 157.6810 167.0625
           xts_df(cloud)  29.731  34.6615  39.60722  36.6090  38.5510
      max neval
 6119.136 10000
 8398.905 10000
 5104.159 10000
  1. In a context where performance is critical. This is usually in interactive environments such as a Shiny app, perhaps with real time data where slow code can reduce responsiveness or cause bottlenecks in execution.

  2. Within packages. It is usually safe to use performant constructors within functions or for internal unexported functions. If following programming best practices the input and output types for functions are kept consistent, and so the input to the constructor can be controlled and hence its function predictable. Setting appropriate unit tests can also catch any issues early.

When to question the use of performant constructors

  1. For user-facing functions. Having no validation or error-checking code means that a performant constructor may behave unpredictably on data that is not intended to be an input. Within a function, there is a specific or at most finite range of objects that a constructor can receive. When that limit is removed, if the input is not the intended input for a constructor then an error can be expected. As long as this is made clear to the user and there are adequate instructions on proper usage, in an environment where the occasional error message is acceptable, then proceed to use the performant constructor.

  2. When the constructor needs to handle a range of input types. as.data.frame() is actually an S3 generic with a variety of methods for different object classes. If required to handle a variety of different types of input, it may be easier (if not more performant) to rely on as.data.frame() rather than write code which handles different scenarios.

What is a performant constructor

First of all, it is possible to directly use the functions matrix_df() and xts_df() which are exported from the ‘ichimoku’ package. Given the nature of the R ecosystem, this is indeed encouraged.

However, having seen the advantages of using a performant constructor above, we can now turn to the ‘what’ for the curious.

What lies behind those functions? Some variation on the below:

# The stucture underlying a data frame is simply a list
df <- list(vec1, vec2, vec3)

# Set the following attributes to turn the list into a data frame:
attributes(df) <- list(names = c("vec1", "vec2", "vec3"),
                       class = "data.frame",
                       row.names = .set_row_names(length(vec1)))
  1. A data.frame is simply a list (where each element must be the same length).
  2. It has an attribute ‘class’ which equals ‘data.frame’.
  3. It must have row names, which can be set by the base R internal function .set_row_names() which takes a single argument, the number of rows.

Note:

  1. The vectors in the list (vec1, vec2, vec3, etc.) must be the same length, otherwise a corrupt data.frame warning will be generated.
  2. If row names are missing then the data will still be present but dim() will show a 0-row dataframe and its print method will not work.
  3. .set_row_names() sets row names efficiently using a compact internal notation used by R. They can also be assigned an integer sequence, or a series of dates for example. However if not an integer vector, they are first coerced to type ‘character’.

In conclusion, dataframes are not complicated structures but essentially lists with a couple of enforced constraints. Indeed you can see that the underlying data type of a dataframe is just a list:

c(class(df1), class(df2))
[1] "data.frame" "data.frame"
c(typeof(df1), typeof(df2))
[1] "list" "list"

Further information

Documentation for the performant constructors discussed: https://shikokuchuo.net/ichimoku/articles/utilities.html#performant-dataframe-constructors.


  1. Gao, C. (2021), ichimoku: Visualization and Tools for Ichimoku Kinko Hyo Strategies. R package version 1.1.6, https://CRAN.R-project.org/package=ichimoku.↩︎

Citation

For attribution, please cite this work as

shikokuchuo (2021, July 23). shikokuchuo{net}: Efficient R: Performant data.frame constructors. Retrieved from https://shikokuchuo.net/posts/11-dataframes/

BibTeX citation

@misc{shikokuchuo2021efficient,
  author = {shikokuchuo, },
  title = {shikokuchuo{net}: Efficient R: Performant data.frame constructors},
  url = {https://shikokuchuo.net/posts/11-dataframes/},
  year = {2021}
}