This PR experiments with implementing MSM in Metal Shader Language and host code in Rust to leverage computation power of GPU. Our MSM implementation on metal has performed 10x faster than lambdaworks/metal-pippenger and is the fastest open-source MSM implementation with Metal to the best of our knowledge. However, the result is still not ideal that it's much slower than Arkworks MSM on CPU.
To extract as much parallelism as possible, we modified the accumulation phase from window-wise fashion to bucket-wise fashion. To achieve this, we implement extra preparation steps before the accumulation.
Prepare buckets indices
Sort buckets indices
Perform accumulation in buckets-wise fashion
In this way, the number of threads equals the number of buckets ( $\text{numWindow} \times (2^{wsize} - 1)$ ), which is significantly more than the number of threads in the window-wise fashion ( only $\text{numWindow}$ ). Thus, we extract more parallelism that is suitable for GPU computation.
benchmark result (on MacBook with M3 chip)
msm time (ms) / instance size
2^10
2^16
2^18
2^20
2^22
2^24
Arkworks msm
2.68
67.96
245.38
912.32
3641.81
14254.36
Window-wise Metal msm (on GPU)
99.31
3845.10
x
x
x
x
Bucket-wise Metal msm (on GPU)
43.10
1730.26
10277.70
41019.22
555877.91
x
Future works
We've noticed that there's an approved EF grant proposal that is working on migrating ICICLE MSM from CUDA to Metal with C++ host code. Once completed, we can adapt it to Rust host code and optimize the parameters along with the BN254 curve.
Summary
This PR experiments with implementing MSM in Metal Shader Language and host code in Rust to leverage computation power of GPU. Our MSM implementation on metal has performed 10x faster than lambdaworks/metal-pippenger and is the fastest open-source MSM implementation with Metal to the best of our knowledge. However, the result is still not ideal that it's much slower than Arkworks MSM on CPU.
This PR
To extract as much parallelism as possible, we modified the accumulation phase from window-wise fashion to bucket-wise fashion. To achieve this, we implement extra preparation steps before the accumulation.
In this way, the number of threads equals the number of buckets ( $\text{numWindow} \times (2^{wsize} - 1)$ ), which is significantly more than the number of threads in the window-wise fashion ( only $\text{numWindow}$ ). Thus, we extract more parallelism that is suitable for GPU computation.
benchmark result (on MacBook with M3 chip)
(on GPU)
(on GPU)
Future works
We've noticed that there's an approved EF grant proposal that is working on migrating ICICLE MSM from CUDA to Metal with C++ host code. Once completed, we can adapt it to Rust host code and optimize the parameters along with the BN254 curve.