Summary

This PR experiments with implementing MSM in Metal Shader Language and host code in Rust to leverage computation power of GPU. Our MSM implementation on metal has performed 10x faster than lambdaworks/metal-pippenger and is the fastest open-source MSM implementation with Metal to the best of our knowledge. However, the result is still not ideal that it's much slower than Arkworks MSM on CPU.

Related issue: https://github.com/zkmopro/mopro/issues/153
Check on this report for more detail: https://hackmd.io/v8WPAG8-RsCANrbv_seSXw

This PR

To extract as much parallelism as possible, we modified the accumulation phase from window-wise fashion to bucket-wise fashion. To achieve this, we implement extra preparation steps before the accumulation.

Prepare buckets indices
Sort buckets indices
Perform accumulation in buckets-wise fashion

In this way, the number of threads equals the number of buckets ( $\text{numWindow} \times (2^{wsize} - 1)$ ), which is significantly more than the number of threads in the window-wise fashion ( only $\text{numWindow}$ ). Thus, we extract more parallelism that is suitable for GPU computation.

benchmark result (on MacBook with M3 chip)

msm time (ms) / instance size	2^10	2^16	2^18	2^20	2^22	2^24
Arkworks msm	2.68	67.96	245.38	912.32	3641.81	14254.36
Window-wise Metal msm (on GPU)	99.31	3845.10	x	x	x	x
Bucket-wise Metal msm (on GPU)	43.10	1730.26	10277.70	41019.22	555877.91	x

Future works

We've noticed that there's an approved EF grant proposal that is working on migrating ICICLE MSM from CUDA to Metal with C++ host code. Once completed, we can adapt it to Rust host code and optimize the parameters along with the BN254 curve.

Latest commit:	`bdd7797`
Status:	✅ Deploy successful!
Preview URL:	https://6f24f27c.mopro.pages.dev
Branch Preview URL:	https://feat-metal-msm-bucket-wise.mopro.pages.dev

zkmopro / mopro

Feat/bucket-wise metal msm on GPU #188

Summary

This PR

benchmark result (on MacBook with M3 chip)

Future works

Deploying mopro with Cloudflare Pages