Summary

One potential improvement is being able to reuse work between combinations. Currently, the design uses itertool's combinations to generate independent combinations of of questions. If we instead manually create the combinations we can calculate and reuse the bits masks for the earlier sub-set of the combination.

This strategy is implemented in this PR as a separate binary for several reasons:

Uses a unified structure rather than the inner/outer dichotomy your binary has
Uses a plain vec of u64 for it's bit mask due to requiring the use of BitOr functionality rather than BitOrAssign and didn't want to modify your library
- Even with this (presumably) regression in performance of calculation intersections, the reuse of bitmasks gives a performance benefit.

As such you probably will want to more smoothly integrate the ideas here more natively into your repo. This PR mainly serves as an example implementation.

The next few sections give an overview of my implementation. If you already have the jist of it and all that is unnecessary, I have included performance stats on my machine (Windows, 8 core 16 thread, 5800x3d) near the end of this comment.

Manual Combination Calculation

Given a list of N unique elements, we can iterate through all combinations of K elements. The combinations will be represented as a sorted list of element indices. With this scheme we can determine the smallest and largest such combinations, telegraphically speaking (ie. [0, 1, 2, ... , k-1] and [n-k, ... , n-3, n-2, n-1]). To begin the iteration, we start with the smallest combination. Then for any given combination, to get the next combination, we iterate backwards through the combination to find the last (furthest right) element index that is not at the max value for its position. If there isn't one, we have the largest combination and we are done, otherwise we increase the found element index by one and each subsequent index (moving right) is one larger than the one we just set.

Example with N=5 K=3:

[0, 1, 2] - Smallest combination (max is [3, 4, 5] and is what we will compare against and stop at)
[0, 1, 3] - We check 2 and see that is it not at that position's max of 5 and increment it
[0, 1, 4]
[0, 1, 5]
[0, 2, 3] - After iterating a few times, we arrive at the max of 5 and have to increment the second position and set the last one to 3
[0, 2, 4]
[0, 2, 5]
[0, 3, 4]
[1, 2, 3] - Now that both the second and last position are that their max, we increment the first position and reset the other two
[1, 2, 4]
[1, 2, 5]
[1, 3, 4]
[1, 3, 5]
[2, 3, 4]
[2, 3, 5]
[3, 4, 5] - All positions are at their max so we are done iterating

Re-using Bit masks

This method of iterations allows us to keep a list of bitmasks the correspond with the intersection of a prefix of the combination. So the first bitmask in our list corresponds to the bitmask for the first question in the combination. The second bitmask is the intersection of the first two questions, and so on. When we iterate through the combinations we can update only the bitmasks that are associated with the positions that change. This reduces the number of intersection calculations per combination at the cost of more memory used and touched overall. Since we are storing these longer, we need to be able to get the intersection of two bitmasks and store them into a third, this will require modifying the bitmask library to support this.

Parallel Iteration

Due to the large amount of sequential state, it is difficult to iterate in parallel over individual or bulked combinations. Instead we iterate over the first position of the combination, where the iteration is passed the first position and runs through all of the combinations with that value in that position. This reduces the need for passing buffers between iterations and I have skipped that optimization in my example implementation.

Performance statistics

These are some informal statistics after using hyperfine to run each configuration 5 times after a release build. The medium file is the ones provided in the repo, while the large file was generated with the suggested cargo run --release --bin gen-data -- 60000 200 0.2 > data/data-large.json command.

For the faster test cases, most of the time is taken up by the reading and parsing of the file (my implementation prints that statistic and it seems to take 224ms for the medium file and 1820ms for the large file).

File	Depth	batched + alloc time	reuse_bin time
Medium	2	304.3 ms	315.5 ms
Medium	3	332.3 ms	342.2 ms
Medium	5	4.452 s	2.095 s
Large	2	2.584 s	2.661 s
Large	3	3.551 s	3.824 s
Large	5	449.479 s	229.557 s

Other considerations

Itertools::combinations creates a Vec for each combination single threadedly, while can be a bottle neck. So the fact that we manually iterate over the combinations in each iteration is in of itself a boon as we don't allocate a Vec and the determination of the combination indices is also now parallel.

As mentioned previously, my implementation uses non-SIMD based bitsets since it was too much effort for me to modify your library in place or to extract the needed portion out. It stands to reason that a SIMD implementation of intersect_into would provide even more benefit.

willcrichton / corrset-benchmark