willcrichton / corrset-benchmark

A repository to test different performance optimizations in a sparse matrix computation.
https://willcrichton.net/notes/k-corrset/
43 stars 8 forks source link

Reuse bitmasks between combinations #2

Closed cjcormier closed 10 months ago

cjcormier commented 10 months ago

Summary

One potential improvement is being able to reuse work between combinations. Currently, the design uses itertool's combinations to generate independent combinations of of questions. If we instead manually create the combinations we can calculate and reuse the bits masks for the earlier sub-set of the combination.

This strategy is implemented in this PR as a separate binary for several reasons:

As such you probably will want to more smoothly integrate the ideas here more natively into your repo. This PR mainly serves as an example implementation.

The next few sections give an overview of my implementation. If you already have the jist of it and all that is unnecessary, I have included performance stats on my machine (Windows, 8 core 16 thread, 5800x3d) near the end of this comment.

Manual Combination Calculation

Given a list of N unique elements, we can iterate through all combinations of K elements. The combinations will be represented as a sorted list of element indices. With this scheme we can determine the smallest and largest such combinations, telegraphically speaking (ie. [0, 1, 2, ... , k-1] and [n-k, ... , n-3, n-2, n-1]). To begin the iteration, we start with the smallest combination. Then for any given combination, to get the next combination, we iterate backwards through the combination to find the last (furthest right) element index that is not at the max value for its position. If there isn't one, we have the largest combination and we are done, otherwise we increase the found element index by one and each subsequent index (moving right) is one larger than the one we just set.

Example with N=5 K=3:

Re-using Bit masks

This method of iterations allows us to keep a list of bitmasks the correspond with the intersection of a prefix of the combination. So the first bitmask in our list corresponds to the bitmask for the first question in the combination. The second bitmask is the intersection of the first two questions, and so on. When we iterate through the combinations we can update only the bitmasks that are associated with the positions that change. This reduces the number of intersection calculations per combination at the cost of more memory used and touched overall. Since we are storing these longer, we need to be able to get the intersection of two bitmasks and store them into a third, this will require modifying the bitmask library to support this.

Parallel Iteration

Due to the large amount of sequential state, it is difficult to iterate in parallel over individual or bulked combinations. Instead we iterate over the first position of the combination, where the iteration is passed the first position and runs through all of the combinations with that value in that position. This reduces the need for passing buffers between iterations and I have skipped that optimization in my example implementation.

Performance statistics

These are some informal statistics after using hyperfine to run each configuration 5 times after a release build. The medium file is the ones provided in the repo, while the large file was generated with the suggested cargo run --release --bin gen-data -- 60000 200 0.2 > data/data-large.json command.

For the faster test cases, most of the time is taken up by the reading and parsing of the file (my implementation prints that statistic and it seems to take 224ms for the medium file and 1820ms for the large file).

File Depth batched + alloc time reuse_bin time
Medium 2 304.3 ms 315.5 ms
Medium 3 332.3 ms 342.2 ms
Medium 5 4.452 s 2.095 s
Large 2 2.584 s 2.661 s
Large 3 3.551 s 3.824 s
Large 5 449.479 s 229.557 s

Other considerations

Itertools::combinations creates a Vec for each combination single threadedly, while can be a bottle neck. So the fact that we manually iterate over the combinations in each iteration is in of itself a boon as we don't allocate a Vec and the determination of the combination indices is also now parallel.

As mentioned previously, my implementation uses non-SIMD based bitsets since it was too much effort for me to modify your library in place or to extract the needed portion out. It stands to reason that a SIMD implementation of intersect_into would provide even more benefit.

willcrichton commented 10 months ago

Great suggestion! I implemented this strategy in fused.rs and found a +28% speedup over the previous record. I also gave you a shoutout in an addendum to the blog post.