The current implementation only does the accumulation phase in metal (CPU), and the rest of the steps are in arkworks (CPU). This leads to a large number of buckets having to write back to CPU and transform back to Arkworks compatible form after the accumulation.
This PR
Implement entire MSM in metal and corresponding host code in Rust. This data conversion after msm reduce from the whole buckets to only a point (the MSM result). Also there're some changes in the PR:
refactor Rust host code so that the config only needs to init once and can be reused in later MSM instances
timing for each step of MSM (i.e. init, encoding data, init_buckets, accumulation, final_accumulation) for a better view of current performance
running 1 test
Vectors already generated
Init metal (GPU) state...
Done initializing metal (GPU) state in 64.012083ms
Encoding instance to GPU memory...
Done encoding data in 56.580375ms
Init buckets time: 3.687166ms
Accumulation and Reduction time: 3.582889875s
Final accumulation time: 8.6165ms
Average time to execute MSM with 65536 points and 65536 scalars in 1 iterations is: 3.595419125s
Next
The performance still needs a lot of improvements. The optimization could be
Extract more parallelism: we do the accumulation phase in window-wise parallel for now, which has less than 85 threads dispatching for each window. By changing it to bucket-wise parallel, we can extract for parallelism by assigning each bucket to a thread.
Introducing more optimization techniques suitable for mobile-device (i.e. requires less memory)
Goals
153
Status
The current implementation only does the accumulation phase in metal (CPU), and the rest of the steps are in arkworks (CPU). This leads to a large number of buckets having to write back to CPU and transform back to Arkworks compatible form after the accumulation.
This PR
Implement entire MSM in metal and corresponding host code in Rust. This data conversion after msm reduce from the whole buckets to only a point (the MSM result). Also there're some changes in the PR:
benchmark result
Arkworks msm
Metal msm (Current)
log (on MacBook with M3 chip)
Next
The performance still needs a lot of improvements. The optimization could be