Efficient R: Combinations using expand.grid

R

A faster way to generate combinations for mapply and vapply

shikokuchuo
2021-06-17

Introduction

Exhaustive combinations of two identical vectors are often desired as function inputs to mapply/vapply(). Usually, the combn() function from the ‘utils’ package serves this purpose.

utils::combn() outputs the unique set of combinations, so for the example below where the first 8 letters of the alphabet are used, the combination {a, b} appears, but {b, a} does not. Similarly the cases where the two elements are identical such as {a, a} also do not feature. It can be seen that there are 28 (or 8 choose 2) unique combinations for the vector of length 8.

x <- letters[1:8]
xlen <- length(x)

combn <- utils::combn(x, 2)
combn
     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
[1,] "a"  "a"  "a"  "a"  "a"  "a"  "a"  "b"  "b"  "b"   "b"   "b"  
[2,] "b"  "c"  "d"  "e"  "f"  "g"  "h"  "c"  "d"  "e"   "f"   "g"  
     [,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22]
[1,] "b"   "c"   "c"   "c"   "c"   "c"   "d"   "d"   "d"   "d"  
[2,] "h"   "d"   "e"   "f"   "g"   "h"   "e"   "f"   "g"   "h"  
     [,23] [,24] [,25] [,26] [,27] [,28]
[1,] "e"   "e"   "e"   "f"   "f"   "g"  
[2,] "f"   "g"   "h"   "g"   "h"   "h"  

expand.grid

expand.grid() from the base package is a useful function in its own right, most well-known perhaps for its use in generating hyperparameter tuning grids in machine learning models.

expand.grid() produces a data frame in columns rather than a matrix in rows like utils::combn(). Hence just for demonstration purposes to compare like-for-like, a bit of manipulation is done below to make the output exactly the same. In real world usage the output of expand.grid() can be used ‘as is’ without the additional manipulation.

grid <- expand.grid(x, x, KEEP.OUT.ATTRS = FALSE, stringsAsFactors = FALSE)
grid <- t(as.matrix(grid))
grid <- rbind(grid[2,], grid[1,])
unname(grid)
     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
[1,] "a"  "a"  "a"  "a"  "a"  "a"  "a"  "a"  "b"  "b"   "b"   "b"  
[2,] "a"  "b"  "c"  "d"  "e"  "f"  "g"  "h"  "a"  "b"   "c"   "d"  
     [,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22]
[1,] "b"   "b"   "b"   "b"   "c"   "c"   "c"   "c"   "c"   "c"  
[2,] "e"   "f"   "g"   "h"   "a"   "b"   "c"   "d"   "e"   "f"  
     [,23] [,24] [,25] [,26] [,27] [,28] [,29] [,30] [,31] [,32]
[1,] "c"   "c"   "d"   "d"   "d"   "d"   "d"   "d"   "d"   "d"  
[2,] "g"   "h"   "a"   "b"   "c"   "d"   "e"   "f"   "g"   "h"  
     [,33] [,34] [,35] [,36] [,37] [,38] [,39] [,40] [,41] [,42]
[1,] "e"   "e"   "e"   "e"   "e"   "e"   "e"   "e"   "f"   "f"  
[2,] "a"   "b"   "c"   "d"   "e"   "f"   "g"   "h"   "a"   "b"  
     [,43] [,44] [,45] [,46] [,47] [,48] [,49] [,50] [,51] [,52]
[1,] "f"   "f"   "f"   "f"   "f"   "f"   "g"   "g"   "g"   "g"  
[2,] "c"   "d"   "e"   "f"   "g"   "h"   "a"   "b"   "c"   "d"  
     [,53] [,54] [,55] [,56] [,57] [,58] [,59] [,60] [,61] [,62]
[1,] "g"   "g"   "g"   "g"   "h"   "h"   "h"   "h"   "h"   "h"  
[2,] "e"   "f"   "g"   "h"   "a"   "b"   "c"   "d"   "e"   "f"  
     [,63] [,64]
[1,] "h"   "h"  
[2,] "g"   "h"  

It can be seen that the output of expand.grid() is simply all combinations, of which there are 8^2 = 64 in total.

ichimoku::grid_dup

So how to get from the output of expand.grid() to that of utils::combn()? The answer comes courtesy of a simple algorithm coded into the grid_dup() function from the ‘ichimoku’ package.1

From the function documentation: ‘create a vector of element positions of duplicates in the output of expand.grid on 2 identical vectors’.

Using the function as per the below, ‘grid1’ contains all unique combinations and also those where the two elements are identical. This is sometimes the desired output if two of the same elements is still considered a unique combination, but their order of appearance does not matter.

library(ichimoku)

grid1 <- grid[, -grid_dup(xlen)]
grid1
     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
[1,] "a"  "a"  "a"  "a"  "a"  "a"  "a"  "a"  "b"  "b"   "b"   "b"  
[2,] "a"  "b"  "c"  "d"  "e"  "f"  "g"  "h"  "b"  "c"   "d"   "e"  
     [,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22]
[1,] "b"   "b"   "b"   "c"   "c"   "c"   "c"   "c"   "c"   "d"  
[2,] "f"   "g"   "h"   "c"   "d"   "e"   "f"   "g"   "h"   "d"  
     [,23] [,24] [,25] [,26] [,27] [,28] [,29] [,30] [,31] [,32]
[1,] "d"   "d"   "d"   "d"   "e"   "e"   "e"   "e"   "f"   "f"  
[2,] "e"   "f"   "g"   "h"   "e"   "f"   "g"   "h"   "f"   "g"  
     [,33] [,34] [,35] [,36]
[1,] "f"   "g"   "g"   "h"  
[2,] "h"   "g"   "h"   "h"  

If the ‘omit.id = TRUE’ argument is set for grid_dup(), identical elements are also removed. ‘grid2’ should then be the same as ‘combn’ obtained above, as confirmed by the result of identical() below.

grid2 <- grid[, -grid_dup(xlen, omit.id = TRUE)]
grid2
     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
[1,] "a"  "a"  "a"  "a"  "a"  "a"  "a"  "b"  "b"  "b"   "b"   "b"  
[2,] "b"  "c"  "d"  "e"  "f"  "g"  "h"  "c"  "d"  "e"   "f"   "g"  
     [,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22]
[1,] "b"   "c"   "c"   "c"   "c"   "c"   "d"   "d"   "d"   "d"  
[2,] "h"   "d"   "e"   "f"   "g"   "h"   "e"   "f"   "g"   "h"  
     [,23] [,24] [,25] [,26] [,27] [,28]
[1,] "e"   "e"   "e"   "f"   "f"   "g"  
[2,] "f"   "g"   "h"   "g"   "h"   "h"  
identical(combn, grid2)
[1] TRUE

Benchmarking the results

‘Microbenchmark’ can be used to benchmark the performance, where it is usual practice to compare median values.

For small vector lengths, expand.grid() is not as performant. However the absolute times are also small so any difference would not matter as much. When the vector length reaches 13, the custom algorithm using expand.grid()/grid_dup() starts to outperform.

By the time the vector length reaches 1,000, this implies total unique combinations of 499,500 and the custom algorithm is already c. 7x faster.

It should be noted that the custom algorithm is tailored for the special case of combn(x, m) where m = 2 and this is the reason why there can be such an outperformance. In programming, where the implementation has already been tuned to be the most efficient possible, it can be useful to think whether the algorithm or code logic can be adapted for the case required.

fn_combn <- function(x) {
  utils::combn(x, 2)
}

fn_grid <- function(x) {
  expand.grid(x, x, KEEP.OUT.ATTRS = FALSE,
              stringsAsFactors = FALSE)[-grid_dup(length(x), omit.id = TRUE), ]
}

microbenchmark::microbenchmark(fn_combn(1:13), fn_grid(1:13))
Unit: microseconds
           expr    min      lq     mean  median      uq      max
 fn_combn(1:13) 38.927 41.6245 51.37336 43.0385 45.0745  686.700
  fn_grid(1:13) 35.883 38.2090 60.74484 39.6740 43.0570 1762.119
 neval
   100
   100
microbenchmark::microbenchmark(fn_combn(1:1000), fn_grid(1:1000))
Unit: milliseconds
             expr       min       lq     mean    median        uq
 fn_combn(1:1000) 201.59798 207.6833 214.4459 210.29602 213.29315
  fn_grid(1:1000)  27.23142  30.1157  35.8723  31.56724  34.42106
       max neval
 272.82774   100
  90.80061   100

Use case: mapply() and vapply()

This type of output is suitable for feeding into functions such as mapply() or vapply().

A standard use for mapply is when multiple arguments have to be mapped into a function. Here ‘simplify = FALSE’ is set to have mapply return a list, and fed into do.call() with c() to create a vector. This is a safer and more performant method to create a vector than relying on the built-in simplification.

do.call(c, mapply(function(x, y) paste0(x, "&", y), 
                  grid2[1, ], grid2[2, ],
                  SIMPLIFY = FALSE, USE.NAMES = FALSE))
 [1] "a&b" "a&c" "a&d" "a&e" "a&f" "a&g" "a&h" "b&c" "b&d" "b&e" "b&f"
[12] "b&g" "b&h" "c&d" "c&e" "c&f" "c&g" "c&h" "d&e" "d&f" "d&g" "d&h"
[23] "e&f" "e&g" "e&h" "f&g" "f&h" "g&h"

An equivalent example using vapply() is given below. vapply() is also a safe choice for programming as an output template is explicitly specified, here ‘character(1L)’, hence the returned values are all expected to be of type ‘character’ of length ‘1’ otherwise an error is thrown.

vapply(seq_along(grid2[1L, ]),
       function(i) paste0(grid2[1L, i], "&", grid2[2L, i]),
       character(1L), USE.NAMES = FALSE)
 [1] "a&b" "a&c" "a&d" "a&e" "a&f" "a&g" "a&h" "b&c" "b&d" "b&e" "b&f"
[12] "b&g" "b&h" "c&d" "c&e" "c&f" "c&g" "c&h" "d&e" "d&f" "d&g" "d&h"
[23] "e&f" "e&g" "e&h" "f&g" "f&h" "g&h"

Of the two however, mapply() is marginally faster and should normally be used when iteration is required over multiple arguments.


  1. Gao, C. (2021), ichimoku: Visualization and Tools for Ichimoku Kinko Hyo Strategies. R package version 1.2.4, https://CRAN.R-project.org/package=ichimoku.↩︎

Citation

For attribution, please cite this work as

shikokuchuo (2021, June 17). shikokuchuo{net}: Efficient R: Combinations using expand.grid. Retrieved from https://shikokuchuo.net/posts/10-combinations/

BibTeX citation

@misc{shikokuchuo2021efficient,
  author = {shikokuchuo, },
  title = {shikokuchuo{net}: Efficient R: Combinations using expand.grid},
  url = {https://shikokuchuo.net/posts/10-combinations/},
  year = {2021}
}