A faster way to generate combinations for mapply and vapply
Exhaustive combinations of two identical vectors are often desired as function inputs to mapply/vapply(). Usually, the combn() function from the ‘utils’ package serves this purpose.
utils::combn() outputs the unique set of combinations, so for the example below where the first 8 letters of the alphabet are used, the combination {a, b} appears, but {b, a} does not. Similarly the cases where the two elements are identical such as {a, a} also do not feature. It can be seen that there are 28 (or 8 choose 2) unique combinations for the vector of length 8.
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
[1,] "a" "a" "a" "a" "a" "a" "a" "b" "b" "b" "b" "b"
[2,] "b" "c" "d" "e" "f" "g" "h" "c" "d" "e" "f" "g"
[,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22]
[1,] "b" "c" "c" "c" "c" "c" "d" "d" "d" "d"
[2,] "h" "d" "e" "f" "g" "h" "e" "f" "g" "h"
[,23] [,24] [,25] [,26] [,27] [,28]
[1,] "e" "e" "e" "f" "f" "g"
[2,] "f" "g" "h" "g" "h" "h"
expand.grid() from the base package is a useful function in its own right, most well-known perhaps for its use in generating hyperparameter tuning grids in machine learning models.
expand.grid() produces a data frame in columns rather than a matrix in rows like utils::combn(). Hence just for demonstration purposes to compare like-for-like, a bit of manipulation is done below to make the output exactly the same. In real world usage the output of expand.grid() can be used ‘as is’ without the additional manipulation.
grid <- expand.grid(x, x, KEEP.OUT.ATTRS = FALSE, stringsAsFactors = FALSE)
grid <- t(as.matrix(grid))
grid <- rbind(grid[2,], grid[1,])
unname(grid)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
[1,] "a" "a" "a" "a" "a" "a" "a" "a" "b" "b" "b" "b"
[2,] "a" "b" "c" "d" "e" "f" "g" "h" "a" "b" "c" "d"
[,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22]
[1,] "b" "b" "b" "b" "c" "c" "c" "c" "c" "c"
[2,] "e" "f" "g" "h" "a" "b" "c" "d" "e" "f"
[,23] [,24] [,25] [,26] [,27] [,28] [,29] [,30] [,31] [,32]
[1,] "c" "c" "d" "d" "d" "d" "d" "d" "d" "d"
[2,] "g" "h" "a" "b" "c" "d" "e" "f" "g" "h"
[,33] [,34] [,35] [,36] [,37] [,38] [,39] [,40] [,41] [,42]
[1,] "e" "e" "e" "e" "e" "e" "e" "e" "f" "f"
[2,] "a" "b" "c" "d" "e" "f" "g" "h" "a" "b"
[,43] [,44] [,45] [,46] [,47] [,48] [,49] [,50] [,51] [,52]
[1,] "f" "f" "f" "f" "f" "f" "g" "g" "g" "g"
[2,] "c" "d" "e" "f" "g" "h" "a" "b" "c" "d"
[,53] [,54] [,55] [,56] [,57] [,58] [,59] [,60] [,61] [,62]
[1,] "g" "g" "g" "g" "h" "h" "h" "h" "h" "h"
[2,] "e" "f" "g" "h" "a" "b" "c" "d" "e" "f"
[,63] [,64]
[1,] "h" "h"
[2,] "g" "h"
It can be seen that the output of expand.grid() is simply all combinations, of which there are 8^2 = 64 in total.
So how to get from the output of expand.grid() to that of utils::combn()? The answer comes courtesy of a simple algorithm coded into the grid_dup() function from the ‘ichimoku’ package.1
From the function documentation: ‘create a vector of element positions of duplicates in the output of expand.grid on 2 identical vectors’.
Using the function as per the below, ‘grid1’ contains all unique combinations and also those where the two elements are identical. This is sometimes the desired output if two of the same elements is still considered a unique combination, but their order of appearance does not matter.
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
[1,] "a" "a" "a" "a" "a" "a" "a" "a" "b" "b" "b" "b"
[2,] "a" "b" "c" "d" "e" "f" "g" "h" "b" "c" "d" "e"
[,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22]
[1,] "b" "b" "b" "c" "c" "c" "c" "c" "c" "d"
[2,] "f" "g" "h" "c" "d" "e" "f" "g" "h" "d"
[,23] [,24] [,25] [,26] [,27] [,28] [,29] [,30] [,31] [,32]
[1,] "d" "d" "d" "d" "e" "e" "e" "e" "f" "f"
[2,] "e" "f" "g" "h" "e" "f" "g" "h" "f" "g"
[,33] [,34] [,35] [,36]
[1,] "f" "g" "g" "h"
[2,] "h" "g" "h" "h"
If the ‘omit.id = TRUE’ argument is set for grid_dup(), identical elements are also removed. ‘grid2’ should then be the same as ‘combn’ obtained above, as confirmed by the result of identical() below.
grid2 <- grid[, -grid_dup(xlen, omit.id = TRUE)]
grid2
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
[1,] "a" "a" "a" "a" "a" "a" "a" "b" "b" "b" "b" "b"
[2,] "b" "c" "d" "e" "f" "g" "h" "c" "d" "e" "f" "g"
[,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22]
[1,] "b" "c" "c" "c" "c" "c" "d" "d" "d" "d"
[2,] "h" "d" "e" "f" "g" "h" "e" "f" "g" "h"
[,23] [,24] [,25] [,26] [,27] [,28]
[1,] "e" "e" "e" "f" "f" "g"
[2,] "f" "g" "h" "g" "h" "h"
identical(combn, grid2)
[1] TRUE
‘Microbenchmark’ can be used to benchmark the performance, where it is usual practice to compare median values.
For small vector lengths, expand.grid() is not as performant. However the absolute times are also small so any difference would not matter as much. When the vector length reaches 13, the custom algorithm using expand.grid()/grid_dup() starts to outperform.
By the time the vector length reaches 1,000, this implies total unique combinations of 499,500 and the custom algorithm is already c. 7x faster.
It should be noted that the custom algorithm is tailored for the special case of combn(x, m) where m = 2 and this is the reason why there can be such an outperformance. In programming, where the implementation has already been tuned to be the most efficient possible, it can be useful to think whether the algorithm or code logic can be adapted for the case required.
fn_combn <- function(x) {
utils::combn(x, 2)
}
fn_grid <- function(x) {
expand.grid(x, x, KEEP.OUT.ATTRS = FALSE,
stringsAsFactors = FALSE)[-grid_dup(length(x), omit.id = TRUE), ]
}
microbenchmark::microbenchmark(fn_combn(1:13), fn_grid(1:13))
Unit: microseconds
expr min lq mean median uq max
fn_combn(1:13) 38.927 41.6245 51.37336 43.0385 45.0745 686.700
fn_grid(1:13) 35.883 38.2090 60.74484 39.6740 43.0570 1762.119
neval
100
100
microbenchmark::microbenchmark(fn_combn(1:1000), fn_grid(1:1000))
Unit: milliseconds
expr min lq mean median uq
fn_combn(1:1000) 201.59798 207.6833 214.4459 210.29602 213.29315
fn_grid(1:1000) 27.23142 30.1157 35.8723 31.56724 34.42106
max neval
272.82774 100
90.80061 100
This type of output is suitable for feeding into functions such as mapply() or vapply().
A standard use for mapply is when multiple arguments have to be mapped into a function. Here ‘simplify = FALSE’ is set to have mapply return a list, and fed into do.call() with c() to create a vector. This is a safer and more performant method to create a vector than relying on the built-in simplification.
do.call(c, mapply(function(x, y) paste0(x, "&", y),
grid2[1, ], grid2[2, ],
SIMPLIFY = FALSE, USE.NAMES = FALSE))
[1] "a&b" "a&c" "a&d" "a&e" "a&f" "a&g" "a&h" "b&c" "b&d" "b&e" "b&f"
[12] "b&g" "b&h" "c&d" "c&e" "c&f" "c&g" "c&h" "d&e" "d&f" "d&g" "d&h"
[23] "e&f" "e&g" "e&h" "f&g" "f&h" "g&h"
An equivalent example using vapply() is given below. vapply() is also a safe choice for programming as an output template is explicitly specified, here ‘character(1L)’, hence the returned values are all expected to be of type ‘character’ of length ‘1’ otherwise an error is thrown.
vapply(seq_along(grid2[1L, ]),
function(i) paste0(grid2[1L, i], "&", grid2[2L, i]),
character(1L), USE.NAMES = FALSE)
[1] "a&b" "a&c" "a&d" "a&e" "a&f" "a&g" "a&h" "b&c" "b&d" "b&e" "b&f"
[12] "b&g" "b&h" "c&d" "c&e" "c&f" "c&g" "c&h" "d&e" "d&f" "d&g" "d&h"
[23] "e&f" "e&g" "e&h" "f&g" "f&h" "g&h"
Of the two however, mapply() is marginally faster and should normally be used when iteration is required over multiple arguments.
Gao, C. (2021), ichimoku: Visualization and Tools for Ichimoku Kinko Hyo Strategies. R package version 1.2.4, https://CRAN.R-project.org/package=ichimoku.↩︎
For attribution, please cite this work as
shikokuchuo (2021, June 17). shikokuchuo{net}: Efficient R: Combinations using expand.grid. Retrieved from https://shikokuchuo.net/posts/10-combinations/
BibTeX citation
@misc{shikokuchuo2021efficient, author = {shikokuchuo, }, title = {shikokuchuo{net}: Efficient R: Combinations using expand.grid}, url = {https://shikokuchuo.net/posts/10-combinations/}, year = {2021} }