zkp-gravity / 0g-halo2

Proof of inference of Weightless Neural Networks using Halo2
MIT License
6 stars 0 forks source link

Optimize Performance #14

Open georgwiese opened 1 year ago

georgwiese commented 1 year ago

Depends on #13

I think we should consider some basic performance optimizations, like:

I think the number of rows is already pretty optimized with #12.

georgwiese commented 1 year ago

I've done a few experiments (including the changes in #17). This is an intermediate report.

Preliminaries

I added a branch on my Halo2 fork that prints out some useful information with regard to performance.

Here's a run with the "MNIST_small" model (models/model_28input_1024entry_2hash_2bpi.pickle.hdf5):

 === Degree overview:
  Minimum degree required by permutation argument: 3
  Minimum degree required by lookup arguments: 6
    bloom filter lookup: 5
    bit_lookup: 5
    byte_decomposition: 5
    x is byte: 5
    diff is byte: 5
    lookup: 6
  Minimum degree required by gates: 4
  Minimum degree required by circuit: 1
  Total degree: 6

----------------------------------------
create_proof Total: 29.679
  Before synthesize: 0.082 (0.3%)
  synthesize: 0.186 (0.6%)
  Compute advice and challenges: 0.388 (1.3%)
  Compute lookups: 2.652 (8.9%)
  Compute permutations: 1.869 (6.3%)
  Compute lookup products: 4.278 (14.4%)
  Compute vanishing argument: 0.604 (2.0%)
  Compute advice polys: 0.282 (1.0%)
  Compute h_poly: 12.958 (43.7%)
  Before create_proof: 3.642 (12.3%)
  create_proof: 2.732 (9.2%)
----------------------------------------

Took: 29.707698257s

You can see that:

Using the V1 floor planner instead of SimpleFloorPlanner

It did not reduce rows enough to reduce k (using the model_28input_256entry_1hash_1bpi.pickle.hdf5 model), but increased synthesis time. I think that might be because I didn't Circuit::without_witnesses() such that the witness is actually None, which might lead to more computation.

In summary, I think it might still be worth doing, but will only help for some models.

Reducing the number of columns

I removed the selector_accumulator column in the ByteSelectorChip (also removing some gates), which allowed me to reduce the number of columns from 6 to 5. The speed-up was 3.9%.

In summary, I think there's not much to be gained from reducing the number of columns.

Reducing the number of gates and lookups

We currently define 11 gates and 5 table lookups. According to this lecture, table lookups are at least 4-times as expensive as gates.

To get a feel how much speed-up we can expect, I went to the GreaterThanChip and replaced:

The speed-up was 12%.

Note that for these experiments, I did increase k from 12 to 14! The additional range-checks need many rows, so we can only realize this speed-up if we manage to keep the number of rows the same.

Reducing the degree bound

Ironically, the range-check (implemented by halo2gadgets) also has the lookup with the highest degree (6, all other gates / lookups have degree of at most 5). So, on top of the previous step (removing 1 gate and 2 lookups), I removed the range checks as well (all while keeping k constant, of course).

The speed-up (compared to the previous step) was 27%.

Summary

I'd say there are two possible areas for improvement: