zkmopro / mopro

Making client-side proving on mobile simple.
https://zkmopro.org
Apache License 2.0
118 stars 32 forks source link

feat(metal): execute whole msm in metal #155

Closed moven0831 closed 3 months ago

moven0831 commented 3 months ago

Goals

Status

The current implementation only does the accumulation phase in metal (CPU), and the rest of the steps are in arkworks (CPU). This leads to a large number of buckets having to write back to CPU and transform back to Arkworks compatible form after the accumulation.

This PR

Implement entire MSM in metal and corresponding host code in Rust. This data conversion after msm reduce from the whole buckets to only a point (the MSM result). Also there're some changes in the PR:

benchmark result

similar to https://github.com/zkmopro/mopro/pull/150

2^16 2^18 2^20
Arkworks msm 82.19 ms 307.24 ms 1140.88 ms
Metal msm (Current) 3.59 s 12.81 s 48.42 s

log (on MacBook with M3 chip)

running 1 test
Vectors already generated
Init metal (GPU) state...
Done initializing metal (GPU) state in 64.012083ms
Encoding instance to GPU memory...
Done encoding data in 56.580375ms
Init buckets time: 3.687166ms
Accumulation and Reduction time: 3.582889875s
Final accumulation time: 8.6165ms
Average time to execute MSM with 65536 points and 65536 scalars in 1 iterations is: 3.595419125s

Next

The performance still needs a lot of improvements. The optimization could be

  1. Extract more parallelism: we do the accumulation phase in window-wise parallel for now, which has less than 85 threads dispatching for each window. By changing it to bucket-wise parallel, we can extract for parallelism by assigning each bucket to a thread.
  2. Introducing more optimization techniques suitable for mobile-device (i.e. requires less memory)