Efficient R: Combinations using expand.grid

R

A faster way to generate combinations for mapply and vapply

shikokuchuo
2021-06-17

                                                            sha256
1 96e41b3b0fa827b7c9c1f4a7765667064026f9448a78327415264112f7f54dbe

Introduction

It seems that there is no base R function to generate exhaustive combinations of two identical vectors, sometimes desired as function inputs to mapply/vapply(). The combn() function from the ‘utils’ package is required.

utils::combn() outputs the unique set of combinations, so for the example below where the first 8 letters of the alphabet are used, the combination {a, b} appears, but {b, a} does not. Similarly the cases where the two elements are identical such as {a, a} also do not feature. It can be seen that there are 28 (or 8 choose 2) unique combinations for the vector of length 8.

x <- letters[1:8]
xlen <- length(x)

combn <- utils::combn(x, 2)
combn
     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
[1,] "a"  "a"  "a"  "a"  "a"  "a"  "a"  "b"  "b"  "b"   "b"   "b"  
[2,] "b"  "c"  "d"  "e"  "f"  "g"  "h"  "c"  "d"  "e"   "f"   "g"  
     [,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22]
[1,] "b"   "c"   "c"   "c"   "c"   "c"   "d"   "d"   "d"   "d"  
[2,] "h"   "d"   "e"   "f"   "g"   "h"   "e"   "f"   "g"   "h"  
     [,23] [,24] [,25] [,26] [,27] [,28]
[1,] "e"   "e"   "e"   "f"   "f"   "g"  
[2,] "f"   "g"   "h"   "g"   "h"   "h"  

expand.grid

expand.grid() from the base package is a useful function in its own right, most well-known perhaps for its use in generating hyperparameter tuning grids in machine learning models.

expand.grid() produces a data frame in columns rather than a matrix in rows like utils::combn(). Hence just for demonstration purposes to compare like-for-like, a bit of manipulation is done below to make the output exactly the same. In real world usage the output of expand.grid() can be used ‘as is’ without the additional manipulation.

grid <- expand.grid(x, x, KEEP.OUT.ATTRS = FALSE, stringsAsFactors = FALSE)
grid <- t(as.matrix(grid))
grid <- rbind(grid[2,], grid[1,])
rownames(grid) <- NULL
grid
     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
[1,] "a"  "a"  "a"  "a"  "a"  "a"  "a"  "a"  "b"  "b"   "b"   "b"  
[2,] "a"  "b"  "c"  "d"  "e"  "f"  "g"  "h"  "a"  "b"   "c"   "d"  
     [,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22]
[1,] "b"   "b"   "b"   "b"   "c"   "c"   "c"   "c"   "c"   "c"  
[2,] "e"   "f"   "g"   "h"   "a"   "b"   "c"   "d"   "e"   "f"  
     [,23] [,24] [,25] [,26] [,27] [,28] [,29] [,30] [,31] [,32]
[1,] "c"   "c"   "d"   "d"   "d"   "d"   "d"   "d"   "d"   "d"  
[2,] "g"   "h"   "a"   "b"   "c"   "d"   "e"   "f"   "g"   "h"  
     [,33] [,34] [,35] [,36] [,37] [,38] [,39] [,40] [,41] [,42]
[1,] "e"   "e"   "e"   "e"   "e"   "e"   "e"   "e"   "f"   "f"  
[2,] "a"   "b"   "c"   "d"   "e"   "f"   "g"   "h"   "a"   "b"  
     [,43] [,44] [,45] [,46] [,47] [,48] [,49] [,50] [,51] [,52]
[1,] "f"   "f"   "f"   "f"   "f"   "f"   "g"   "g"   "g"   "g"  
[2,] "c"   "d"   "e"   "f"   "g"   "h"   "a"   "b"   "c"   "d"  
     [,53] [,54] [,55] [,56] [,57] [,58] [,59] [,60] [,61] [,62]
[1,] "g"   "g"   "g"   "g"   "h"   "h"   "h"   "h"   "h"   "h"  
[2,] "e"   "f"   "g"   "h"   "a"   "b"   "c"   "d"   "e"   "f"  
     [,63] [,64]
[1,] "h"   "h"  
[2,] "g"   "h"  

It can be seen that the output of expand.grid() is simply all combinations, of which there are 8^2 = 64 in total.

ichimoku::grid_dup

So how to get from the output of expand.grid() to that of utils::combn()? Well, with the help of a simple algorithm, which has been coded into the grid_dup() function from the ‘ichimoku’ package.1

From the function documentation: ‘create a vector of element positions of duplicates in the output of expand.grid on 2 identical vectors’.

Using the function as per the below, ‘grid1’ contains all unique combinations and also those where the two elements are identical. This is sometimes the desired output if two of the same elements is still considered a unique combination, and simply that the order of appearance does not matter.

library(ichimoku)

grid1 <- grid[, -grid_dup(xlen)]
grid1
     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
[1,] "a"  "a"  "a"  "a"  "a"  "a"  "a"  "a"  "b"  "b"   "b"   "b"  
[2,] "a"  "b"  "c"  "d"  "e"  "f"  "g"  "h"  "b"  "c"   "d"   "e"  
     [,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22]
[1,] "b"   "b"   "b"   "c"   "c"   "c"   "c"   "c"   "c"   "d"  
[2,] "f"   "g"   "h"   "c"   "d"   "e"   "f"   "g"   "h"   "d"  
     [,23] [,24] [,25] [,26] [,27] [,28] [,29] [,30] [,31] [,32]
[1,] "d"   "d"   "d"   "d"   "e"   "e"   "e"   "e"   "f"   "f"  
[2,] "e"   "f"   "g"   "h"   "e"   "f"   "g"   "h"   "f"   "g"  
     [,33] [,34] [,35] [,36]
[1,] "f"   "g"   "g"   "h"  
[2,] "h"   "g"   "h"   "h"  

If the ‘omit.id = TRUE’ argument is set for grid_dup(), identical elements are also removed. ‘grid2’ should then be the same as ‘combn’ obtained above.

Indeed it can be seen that both identical() and all.equal() below return TRUE.

grid2 <- grid[, -grid_dup(xlen, omit.id = TRUE)]
grid2
     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
[1,] "a"  "a"  "a"  "a"  "a"  "a"  "a"  "b"  "b"  "b"   "b"   "b"  
[2,] "b"  "c"  "d"  "e"  "f"  "g"  "h"  "c"  "d"  "e"   "f"   "g"  
     [,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22]
[1,] "b"   "c"   "c"   "c"   "c"   "c"   "d"   "d"   "d"   "d"  
[2,] "h"   "d"   "e"   "f"   "g"   "h"   "e"   "f"   "g"   "h"  
     [,23] [,24] [,25] [,26] [,27] [,28]
[1,] "e"   "e"   "e"   "f"   "f"   "g"  
[2,] "f"   "g"   "h"   "g"   "h"   "h"  
identical(combn, grid2) && all.equal(combn, grid2)
[1] TRUE

Benchmarking the results

‘Microbenchmark’ can be used to benchmark the performance, where it is usual practice to compare median values.

For small vector lengths, expand.grid() is not as performant. This is somewhat to be expected given the overhead of working with data frames rather than matrices. However the absolute times are also small so any difference would not matter as much.

When the vector length reaches 16, the custom algorithm using expand.grid()/grid_dup() starts to outperform.

By the time the vector length reaches 1,000, this implies total unique combinations of 499,500 and the custom algorithm is already c. 7x faster.

It should be noted that the custom algorithm is tailored for the special case of combn(x, m) where m = 2 and that is most likely why there can be such an outperformance.

fn_combn <- function(x) {
  utils::combn(x, 2)
}

fn_grid <- function(x) {
  expand.grid(x, x, KEEP.OUT.ATTRS = FALSE,
              stringsAsFactors = FALSE)[-grid_dup(length(x), omit.id = TRUE), ]
}

microbenchmark::microbenchmark(fn_combn(1:16), fn_grid(1:16))
Unit: microseconds
           expr    min      lq     mean  median      uq      max
 fn_combn(1:16) 68.291 71.6470 89.04774 73.6770 81.3600 1104.949
  fn_grid(1:16) 55.779 59.8265 82.22014 61.9655 67.0235 1630.001
 neval
   100
   100
microbenchmark::microbenchmark(fn_combn(1:1000), fn_grid(1:1000))
Unit: milliseconds
             expr       min        lq      mean    median        uq
 fn_combn(1:1000) 263.47792 270.79163 275.92440 272.65941 275.49595
  fn_grid(1:1000)  34.04806  37.34776  42.65135  38.44289  40.73221
       max neval
 329.07174   100
  99.61415   100

Use case: mapply() and vapply()

This type of output is suitable for feeding into functions such as mapply() or vapply().

A standard use for mapply is when multiple arguments have to be mapped into a function. Here ‘simplify = FALSE’ is set to have mapply return a list, and fed into do.call() with c() to create a vector. This is a safer and more performant method to create a vector than relying on the built-in simplification.

do.call(c, mapply(function(x, y) paste0(x, "&", y), 
                  grid2[1, ], grid2[2, ],
                  SIMPLIFY = FALSE, USE.NAMES = FALSE))
 [1] "a&b" "a&c" "a&d" "a&e" "a&f" "a&g" "a&h" "b&c" "b&d" "b&e" "b&f"
[12] "b&g" "b&h" "c&d" "c&e" "c&f" "c&g" "c&h" "d&e" "d&f" "d&g" "d&h"
[23] "e&f" "e&g" "e&h" "f&g" "f&h" "g&h"

An equivalent example using vapply() is given below. vapply() is also a safe choice for programming as an output template is explicitly specified, here ‘character(1L)’, hence the returned values are all expected to be of type ‘character’ of length ‘1’ otherwise an error is thrown.

vapply(seq_along(grid2[1L, ]),
       function(i) paste0(grid2[1L, i], "&", grid2[2L, i]),
       character(1L), USE.NAMES = FALSE)
 [1] "a&b" "a&c" "a&d" "a&e" "a&f" "a&g" "a&h" "b&c" "b&d" "b&e" "b&f"
[12] "b&g" "b&h" "c&d" "c&e" "c&f" "c&g" "c&h" "d&e" "d&f" "d&g" "d&h"
[23] "e&f" "e&g" "e&h" "f&g" "f&h" "g&h"

Of the two however, mapply() is marginally faster and should normally be used when iteration is required over multiple arguments.


  1. Gao, C. (2021), ichimoku: Visualization and Tools for Ichimoku Kinko Hyo Strategies. R package version 1.1.6, https://CRAN.R-project.org/package=ichimoku.↩︎

Citation

For attribution, please cite this work as

shikokuchuo (2021, June 17). shikokuchuo{net}: Efficient R: Combinations using expand.grid. Retrieved from https://shikokuchuo.net/posts/10-combinations/

BibTeX citation

@misc{shikokuchuo2021efficient,
  author = {shikokuchuo, },
  title = {shikokuchuo{net}: Efficient R: Combinations using expand.grid},
  url = {https://shikokuchuo.net/posts/10-combinations/},
  year = {2021}
}