r3ld3v commented 2 years ago

[x] Inner product optimizations or with sum of products optimization like in https://github.com/zkcrypto/bls12_381/pull/84
[x] FFT optimizations?
Reduce in-memory cost of polynomials in poly::Evaluator
- [x] Don't store unrotated polynomials twice #642
- [x] Extract rotated chunks from the unrotated polynomials instead of caching rotated polynomials. #644
[ ] Use smaller cosets during evaluation when possible (for lower degree gates)
[ ] Don't compute cosets for all proof instances but instead defer
[ ] Multiexp optimizations
- [ ] batch affine (https://github.com/arkworks-rs/algebra/issues/60 and https://github.com/ConsenSys/gnark-crypto/pull/249)
- [ ] leverage endomorphism
- [ ] leverage wNAF
- [ ] since we're doing multiple multiexps with the same bases we can perform some precomputation steps on the bases (at the expense of some memory) #641
- [ ] maybe switch to Bos-Coster for smaller multiexps when it's beneficial
- [ ] maybe batch inversions
[ ] IPA optimizations
- [ ] parallel generator collapse could be parallelized better
- [ ] defer early steps of the generator collapse (by doing large multiexps in the early rounds, then beginning a collapse later)
[ ] assembly optimizations for pasta field arithmetic
[x] MSM currently allows duplicate generators to appear in the list (think: multiple proofs being checked in a batch, so the fixed column commitments appear multiple times in the MSM) though we don't do any batch verification right now downstream. #608
[ ] just use barycentric formula to evaluate some column polynomials at points rather than inverse ffts (which means avoiding inverse ffts by doing domain extensions?)
- [ ] can we avoid moving polynomials into the monomial basis entirely? this would require changing the inner product argument
[x] Cache the writing of the VerifyingKey into the transcript when batch-verifying proofs (debug formatting takes up a non-trivial amount of time). #607
[x] Cache the degree of the circuit in VerifyingKey, and use it in the permutation argument. #607
[x] Remove all internal calls to parallelize within the verifier to make it entirely single-threaded, and check the resulting single-proof times. #608
[x] Add a BatchVerifier object (not the strategy) that pushes single-proof verification with the BatchVerifier strategy onto the global threadpool, then accumulates their resulting MSMs. #610

ashWhiteHat commented 2 years ago

In terms of assembly optimizations for pasta field arithmetic, it's necessary to update rust version. The asm feature is stable latter than 1.59.0. If it's okay to update, I would implement asm as following and it's almost 20% speed up. https://github.com/appliedzkp/pairing/pull/5

ashWhiteHat commented 2 years ago

I analysed the functions efficiency of create_proof. https://github.com/appliedzkp/halo2/pull/36#issuecomment-1085418454 I am going to work on h() evaluation.

ashWhiteHat commented 2 years ago

Fft Optimization

implement 2 layer butterfly arithmetic

This allows us to bottom up n / 2 but it's hard to implement this with recursive. The &[G] array divide into 4 parts and join can take double args. I tried but failed.

change par_iter when the n is large enough

It's necessary to do experiment of the relation between overhead and degree.

parallelize last butterfly arithmetic

It's easy but becomes complicated.

ifft divisor

Perform the multiplication on last round of butterfly arithmetic to avoid extra iteration.

ashWhiteHat commented 2 years ago

I think the following is interesting. https://github.com/zkcrypto/bls12_381/pull/84 We can reduce Montgomery reduction for each field array arithmetic. This affects so much.

Do you have an idea how to replace with?

str4d commented 2 years ago

I think the following is interesting. zkcrypto/bls12_381#84 We can reduce Montgomery reduction for each field array arithmetic. This affects so much.
* [eval_poly](https://github.com/zcash/halo2/blob/main/halo2_proofs/src/arithmetic.rs#L269)

* [compute_inner_product](https://github.com/zcash/halo2/blob/main/halo2_proofs/src/arithmetic.rs#L279)

* [ifft_divise](https://github.com/zcash/halo2/blob/main/halo2_proofs/src/poly/domain.rs#L355)
Do you have an idea how to replace with?

Yep, I was thinking about this myself yesterday. The inner product function we use in Halo 2 is precisely the "sum of products" I implemented there, and so I'm currently working to expose this via the ff crate for generic usage (https://github.com/zkcrypto/ff/issues/79). It can have a slow provided implementation in Field, that concrete fields override with efficient limb-aware implementations.

str4d commented 2 years ago

In terms of assembly optimizations for pasta field arithmetic, it's necessary to update rust version. The asm feature is stable latter than 1.59.0. If it's okay to update, I would implement asm as following and it's almost 20% speed up. appliedzkp/pairing#5

I do not want to use raw assembly in any of our field implementations if I can avoid it. Instead, we should look at that assembly and figure out why Rust / LLVM is not generating it, and rework the Rust code to get closer. I also want to implement some proper field arithmetic backends in the pasta_curves crate so we can use vector intrinsics (the x86 / x86_64 intrinsics have been stable for a while, and aarch64 intrinsics were stabilised in 1.59). Once we have backends available, then if it were impossible to get the same performance as with raw assembly, we could conceivably have a feature-flagged asm backend for those who really want it.

ashWhiteHat commented 2 years ago

Hi @str4d Thank you for the answer.

Yep, I was thinking about this myself yesterday. The inner product function we use in Halo 2 is precisely the "sum of products" I implemented there, and so I'm currently working to expose this via the ff crate for generic usage (https://github.com/zkcrypto/ff/issues/79). It can have a slow provided implementation in Field, that concrete fields override with efficient limb-aware implementations

This is amazing. I think this logic can be applied for [F] ⚪︎ [F] or F ⚪︎ [F] operation for any arithmetic ⚪︎ as long as they keep equivalence relation. It reduces n - 1 times calculation of reduce method for additive and, reduce and montgomery reduction for multiplicative. There are many functions which can be applied this arithmetic. Do you have a plan to modularize in ff and apply to halo2?

I do not want to use raw assembly in any of our field implementations if I can avoid it

Copy that.

str4d commented 2 years ago

I think this logic can be applied for [F] ⚪︎ [F] or F ⚪︎ [F] operation for any arithmetic ⚪︎ as long as they keep equivalence relation. It reduces n - 1 times calculation of reduce method for additive and, reduce and montgomery reduction for multiplicative. There are many functions which can be applied this arithmetic. Do you have a plan to modularize in ff and apply to halo2?

The interleaved reductions likely won't have any benefit for additions, because the main speedup is due to reduced register pressure by not maintaining double-width field elements, which can be produced by multiplication but not addition. Also, additions don't use Montgomery reduction.

The sharing of reductions between accumulated ⚪︎ operations is interesting, but I'm not sure there's a good general API for that. In the additive case specifically, [F] + [F] is just sum([F].concat([F]); there is no Montgomery reduction, so the only thing we could batch is the conditional subtraction of the modulus, which would require storing an additional limb. For now, I'm inclined to just add Field::sum_of_products and then see what that gets us. If we find other operations that are common enough to justify, then we can add APIs for those (and then look at how we could perhaps generalise once we have the concrete APIs in place).

ashWhiteHat commented 2 years ago

the main speedup is due to reduced register pressure by not maintaining double-width field elements

Nice.

If we find other operations that are common enough to justify, then we can add APIs for those (and then look at how we could perhaps generalise once we have the concrete APIs in place).

Thank you. I understand. I think it's reasonable to implement poly crate for example zkcrypto/poly. [F] + [F] is nearer to poly arithmetic than field arithmetic, it should use parallelize and it's easier to optimize. The poly arithmetic is performed frequently. I am going to check how sharing of reduction improves performance.

ashWhiteHat commented 2 years ago

memo: par_iter vs normal https://github.com/rayon-rs/rayon/issues/648#issuecomment-476733073

ashWhiteHat commented 2 years ago

Parallelize

Find proper task size and methods for core, thread number and memory. Create condition branch for each environment using thread_num and task_size as following.

/// multiplication = 3 point, addition = 1 point
enum TaskSize {
    // less than 3 point
    Small,
    // more than 4 point and less than 8 point
    Middle,
    // more than 9 point
    Large,
}

fn task_chunk(task_size: TaskSize) -> u32 {
    let chunk = match task_size {
        TaskSize::Small => 8,
        TaskSize::Middle => 4,
        TaskSize::Large => 2,
    };
    chunk / log_threads()
}

Task

Find the most efficient task size, thread and rayon method in order to implement condition branch to split parallel or not.

[ ] par_iter > parallel

Bench

normal vs par_iter vs parallel

one time field arithmetic iteration Add	k	normal	par_iter
3	36.614 ns 36.616 ns 36.617 ns	8.5748 us 8.7767 us 8.9484 us	8.2590 us 8.3402 us 8.4133 us
4	69.030 ns 69.110 ns 69.231 ns	9.6449 us 9.7224 us 9.8196 us	7.5027 us 7.7373 us 7.9515 us
5	146.63 ns 146.79 ns 146.97 ns	10.100 us 10.189 us 10.305 us	7.2004 us 7.4858 us 7.8044 us
6	274.81 ns 274.83 ns 274.87 ns	11.191 us 11.256 us 11.351 us	7.9772 us 8.3716 us 8.7483 us
7	558.26 ns 558.31 ns 558.40 ns	13.600 us 13.662 us 13.730 us	9.2109 us 9.3365 us 9.4365 us
8	1.1079 us 1.1079 us 1.1080 us	15.481 us 15.524 us 15.569 us	10.029 us 10.088 us 10.139 us
9	2.2102 us 2.2104 us 2.2106 us	17.009 us 17.036 us 17.062 us	11.506 us 11.539 us 11.574 us
10	4.4262 us 4.4263 us 4.4266 us	18.809 us 18.837 us 18.861 us	12.331 us 12.356 us 12.382 us
11	8.8253 us 8.8256 us 8.8259 us	21.242 us 21.279 us 21.326 us	14.405 us 14.444 us 14.494 us
12	17.644 us 17.645 us 17.647 us	21.693 us 21.821 us 21.971 us	18.548 us 18.569 us 18.592 us
13	35.285 us 35.294 us 35.305 us	32.041 us 32.153 us 32.291 us	26.574 us 26.614 us 26.659 us
14	70.567 us 70.581 us 70.605 us	53.141 us 53.271 us 53.445 us	43.057 us 44.095 us 45.461 us
15	141.93 us 141.93 us 141.94 us	95.057 us 95.192 us 95.343 us	75.803 us 76.210 us 76.766 us
16	282.98 us 283.03 us 283.08 us	181.69 us 182.11 us 182.64 us	144.24 us 144.79 us 145.59 us
17	569.23 us 569.49 us 569.83 us	351.35 us 352.24 us 353.44 us	276.42 us 279.55 us 284.34 us
18	1.1501 ms 1.1513 ms 1.1529 ms	687.74 us 689.02 us 690.59 us	542.22 us 543.47 us 544.96 us

Mul	k	normal	par_iter
3	227.36 ns 227.38 ns 227.40 ns	9.0132 us 9.1790 us 9.3149 us	8.6094 us 8.8608 us 9.0674 us
4	454.44 ns 454.63 ns 454.89 ns	10.743 us 10.816 us 10.893 us	9.3778 us 9.4564 us 9.5292 us
5	918.76 ns 918.85 ns 918.97 ns	12.749 us 12.851 us 13.017 us	10.056 us 10.117 us 10.184 us
6	1.8231 us 1.8233 us 1.8234 us	14.813 us 14.849 us 14.886 us	11.366 us 11.428 us 11.510 us
7	3.6549 us 3.6554 us 3.6558 us	16.249 us 16.289 us 16.326 us	12.558 us 12.579 us 12.599 us
8	7.3003 us 7.3008 us 7.3016 us	18.171 us 18.199 us 18.233 us	14.454 us 14.474 us 14.499 us
9	14.637 us 14.638 us 14.640 us	21.575 us 21.644 us 21.719 us	18.658 us 18.687 us 18.723 us
10	29.251 us 29.253 us 29.256 us	26.149 us 26.234 us 26.347 us	26.632 us 26.663 us 26.699 us
11	58.909 us 58.914 us 58.919 us	40.919 us 40.985 us 41.073 us	43.039 us 43.164 us 43.326 us
12	117.64 us 117.66 us 117.68 us	71.681 us 71.912 us 72.278 us	75.645 us 75.873 us 76.141 us
13	235.37 us 235.40 us 235.44 us	132.30 us 132.60 us 132.96 us	140.63 us 141.02 us 141.55 us
14	471.41 us 471.47 us 471.55 us	253.99 us 254.66 us 255.57 us	272.04 us 273.20 us 274.69 us
15	943.76 us 943.87 us 943.99 us	493.60 us 494.64 us 496.05 us	530.97 us 532.40 us 534.24 us
16	1.8840 ms 1.8842 ms 1.8845 ms	977.83 us 980.45 us 983.85 us	1.0663 ms 1.0683 ms 1.0706 ms
17	3.7703 ms 3.7711 ms 3.7718 ms	1.9412 ms 1.9449 ms 1.9495 ms	2.1186 ms 2.1261 ms 2.1351 ms
18	7.5475 ms 7.5495 ms 7.5517 ms	3.9131 ms 3.9256 ms 3.9398 ms	4.2483 ms 4.2646 ms 4.2856 ms

Three Mul	k	normal	par_iter
3	740.03 ns 740.06 ns 740.12 ns	10.245 us 10.278 us 10.313 us	9.8069 us 9.8670 us 9.9196 us
4	1.4876 us 1.4877 us 1.4880 us	13.047 us 13.062 us 13.079 us	11.137 us 11.218 us 11.334 us
5	2.9867 us 2.9870 us 2.9874 us	14.695 us 14.729 us 14.768 us	12.249 us 12.342 us 12.473 us
6	6.0087 us 6.0097 us 6.0114 us	16.122 us 16.150 us 16.192 us	14.111 us 14.152 us 14.203 us
7	11.918 us 11.921 us 11.925 us	18.890 us 18.977 us 19.079 us	17.658 us 17.685 us 17.717 us
8	23.729 us 23.730 us 23.732 us	23.109 us 23.144 us 23.197 us	24.855 us 24.884 us 24.917 us
9	47.429 us 47.434 us 47.441 us	35.143 us 35.267 us 35.459 us	39.047 us 39.091 us 39.146 us
10	95.058 us 95.065 us 95.076 us	59.381 us 59.530 us 59.748 us	67.460 us 67.677 us 67.957 us
11	192.36 us 192.36 us 192.37 us	107.73 us 107.95 us 108.29 us	124.63 us 125.47 us 126.66 us
12	384.87 us 384.89 us 384.92 us	206.87 us 208.82 us 211.66 us	240.87 us 241.48 us 242.20 us
13	769.35 us 769.39 us 769.44 us	401.49 us 402.64 us 404.07 us	470.14 us 471.27 us 472.80 us
14	1.5384 ms 1.5386 ms 1.5388 ms	788.91 us 790.20 us 791.79 us	927.12 us 928.16 us 929.33 us
15	3.1031 ms 3.1034 ms 3.1039 ms	1.5759 ms 1.5776 ms 1.5796 ms	1.8491 ms 1.8517 ms 1.8549 ms
16	6.2126 ms 6.2131 ms 6.2136 ms	3.1819 ms 3.1865 ms 3.1911 ms	3.7340 ms 3.7428 ms 3.7531 ms
17	12.453 ms 12.456 ms 12.459 ms	6.3751 ms 6.3899 ms 6.4062 ms	7.4952 ms 7.5719 ms 7.6787 ms
18	24.969 ms 24.975 ms 24.981 ms	12.789 ms 12.827 ms 12.870 ms	14.962 ms 14.987 ms 15.014 ms

Three Add	k	normal	par_iter
3	111.25 ns 111.26 ns 111.26 ns	8.9307 us 9.1100 us 9.2478 us	6.8630 us 6.9690 us 7.0764 us
4	221.65 ns 221.67 ns 221.70 ns	10.531 us 10.599 us 10.680 us	8.3465 us 8.6197 us 8.8761 us
5	443.69 ns 443.72 ns 443.75 ns	11.817 us 11.896 us 11.985 us	9.1735 us 9.2792 us 9.3672 us
6	885.67 ns 885.76 ns 885.90 ns	13.417 us 13.505 us 13.598 us	9.8654 us 9.9676 us 10.067 us
7	1.7744 us 1.7746 us 1.7748 us	15.675 us 15.703 us 15.728 us	11.240 us 11.273 us 11.312 us
8	3.5424 us 3.5434 us 3.5451 us	16.768 us 16.794 us 16.819 us	12.511 us 12.561 us 12.635 us
9	6.9358 us 6.9374 us 6.9399 us	18.731 us 18.754 us 18.776 us	14.208 us 14.283 us 14.409 us
10	14.198 us 14.200 us 14.201 us	20.764 us 21.045 us 21.290 us	17.782 us 17.821 us 17.865 us
11	28.315 us 28.319 us 28.324 us	28.049 us 28.104 us 28.176 us	25.271 us 25.326 us 25.409 us
12	56.602 us 56.605 us 56.609 us	45.301 us 45.395 us 45.524 us	40.283 us 40.412 us 40.582 us
13	113.18 us 113.19 us 113.21 us	79.458 us 79.737 us 80.174 us	69.654 us 69.769 us 69.878 us
14	226.36 us 226.38 us 226.40 us	147.58 us 147.84 us 148.15 us	129.21 us 129.45 us 129.74 us
15	452.97 us 453.01 us 453.07 us	284.25 us 285.09 us 286.38 us	248.64 us 249.59 us 250.81 us
16	906.45 us 906.52 us 906.60 us	559.63 us 561.53 us 564.16 us	485.47 us 487.52 us 490.65 us
17	1.8145 ms 1.8147 ms 1.8150 ms	1.1069 ms 1.1092 ms 1.1120 ms	957.72 us 959.49 us 961.72 us
18	3.6520 ms 3.6529 ms 3.6540 ms	2.2157 ms 2.2272 ms 2.2444 ms	1.9031 ms 1.9068 ms 1.9114 ms

Task Chunk

Mac

Add	k	2	4	8	16	32	64	128	256	512	1024	2048	4096	8192	16384
3	8.4865 us 8.7383 us 8.9984 us	7.6687 us 8.1307 us 8.6264 us	`49.894 ns 50.056 ns 50.264 ns`
4	10.804 us 10.841 us 10.878 us	8.5535 us 8.6013 us 8.6544 us	8.1329 us 8.5694 us 9.0234 us	`79.834 ns 80.323 ns 80.943 ns`
5	17.365 us 18.353 us 19.542 us	11.666 us 12.372 us 13.345 us	9.0639 us 9.3125 us 9.6756 us	7.7428 us 8.1249 us 8.5304 us	`135.33 ns 135.54 ns 135.79 ns`
6	24.028 us 24.127 us 24.234 us	17.746 us 18.126 us 18.658 us	12.443 us 12.695 us 13.119 us	9.6657 us 9.7400 us 9.8262 us	8.3629 us 8.6792 us 9.0522 us	`273.53 ns 277.22 ns 281.66 ns`
7	35.130 us 35.498 us 35.926 us	26.501 us 26.679 us 26.875 us	18.959 us 19.016 us 19.075 us	14.307 us 14.394 us 14.484 us	11.646 us 11.726 us 11.815 us	9.1532 us 9.4182 us 9.7456 us	`481.41 ns 481.88 ns 482.42 ns`
8	40.231 us 40.861 us 41.553 us	35.265 us 36.861 us 38.881 us	27.771 us 27.876 us 27.987 us	22.508 us 22.734 us 23.001 us	18.404 us 18.509 us 18.644 us	15.065 us 15.193 us 15.333 us	11.892 us 11.978 us 12.065 us	`1.0040 us 1.0237 us 1.0510 us`
9	41.950 us 42.430 us 42.989 us	37.774 us 38.013 us 38.288 us	34.655 us 34.976 us 35.353 us	28.869 us 28.979 us 29.100 us	26.027 us 26.172 us 26.334 us	22.043 us 22.285 us 22.533 us	19.511 us 19.763 us 20.026 us	14.208 us 14.354 us 14.491 us	`1.8647 us 1.8722 us 1.8802 us`
10	55.450 us 56.708 us 58.059 us	44.576 us 48.146 us 52.999 us	42.094 us 43.535 us 45.845 us	41.633 us 43.171 us 45.192 us	37.792 us 40.577 us 43.652 us	30.572 us 31.031 us 31.491 us	24.981 us 25.107 us 25.243 us	20.775 us 20.922 us 21.082 us	16.728 us 16.861 us 16.986 us	`3.6868 us 3.6906 us 3.6948 us`
11	57.630 us 58.260 us 58.976 us	53.438 us 54.222 us 54.981 us	51.131 us 52.458 us 54.095 us	43.672 us 44.314 us 45.022 us	38.095 us 38.238 us 38.398 us	34.721 us 34.935 us 35.188 us	29.845 us 30.088 us 30.390 us	26.737 us 26.882 us 27.063 us	24.126 us 24.421 us 24.772 us	19.853 us 20.027 us 20.225 us	`7.3577 us 7.3707 us 7.3843 us`
12	74.006 us 82.503 us 96.651 us	60.490 us 61.703 us 63.196 us	53.367 us 54.647 us 56.422 us	48.560 us 49.146 us 49.862 us	47.054 us 48.111 us 49.286 us	41.027 us 41.742 us 42.582 us	40.344 us 41.569 us 43.145 us	33.403 us 33.867 us 34.372 us	32.113 us 32.943 us 33.895 us	33.725 us 36.196 us 39.246 us	26.128 us 26.535 us 26.968 us	`14.860 us 14.924 us 15.007 us`
13	76.729 us 78.650 us 81.140 us	80.344 us 85.142 us 90.723 us	74.581 us 76.099 us 77.818 us	70.706 us 74.664 us 79.865 us	59.432 us 61.664 us 64.488 us	51.299 us 52.632 us 54.048 us	43.822 us 44.417 us 45.068 us	45.089 us 46.367 us 47.964 us	37.537 us 38.859 us 40.761 us	36.239 us 36.815 us 37.442 us	34.318 us 35.158 us 36.153 us	36.640 us 38.085 us 40.029 us	`31.645 us 31.850 us 32.090 us`
14	96.109 us 99.058 us 103.02 us	82.915 us 85.327 us 88.338 us	74.106 us 75.766 us 77.721 us	65.367 us 66.130 us 67.086 us	67.780 us 69.117 us 70.520 us	65.359 us 66.814 us 68.360 us	65.653 us 68.949 us 72.797 us	64.046 us 68.780 us 74.290 us	52.229 us 59.241 us 68.364 us	43.430 us 45.169 us 47.105 us	`38.416 us 39.108 us 39.886 us`	46.860 us 49.930 us 53.646 us	51.370 us 57.463 us 67.220 us	60.318 us 60.631 us 61.072 us
15	112.70 us 115.39 us 118.70 us	108.91 us 112.34 us 116.15 us	111.99 us 118.11 us 125.32 us	89.571 us 93.145 us 97.821 us	99.211 us 105.54 us 113.43 us	87.888 us 91.208 us 95.005 us	72.750 us 74.216 us 75.886 us	67.757 us 69.687 us 71.764 us	71.626 us 78.614 us 90.932 us	63.923 us 66.347 us 68.898 us	69.236 us 74.135 us 80.049 us	67.671 us 72.314 us 78.030 us	`55.197 us 57.540 us 60.155 us`	82.056 us 85.622 us 90.324 us	121.06 us 121.57 us 122.20 us

Github Actions Add	k	2	4	8	16	32	64	128	256	512	1024	2048	4096	8192	16384
3	8.6546 us 8.8228 us 8.9699 us	8.3995 us 8.6312 us 8.8324 us	53.093 ns 53.124 ns 53.153 ns
4	8.8763 us 9.0172 us 9.0935 us	8.7933 us 8.9154 us 9.0024 us	8.6157 us 8.7592 us 8.8718 us	`88.424 ns 88.434 ns 88.447 ns`
5	9.7176 us 9.7890 us 9.8535 us	9.0014 us 9.0847 us 9.1433 us	9.0339 us 9.0610 us 9.0822 us	8.7167 us 8.8259 us 8.8992 us	`157.66 ns 157.68 ns 157.70 ns`
6	9.9716 us 10.003 us 10.034 us	9.8813 us 9.9691 us 10.070 us	9.1202 us 9.2548 us 9.3663 us	9.0368 us 9.0608 us 9.0907 us	8.6307 us 8.8124 us 8.9441 us	`297.08 ns 297.10 ns 297.12 ns`
7	11.683 us 11.753 us 11.825 us	10.635 us 10.665 us 10.697 us	10.445 us 10.482 us 10.519 us	9.6826 us 9.7893 us 9.9166 us	9.2953 us 9.4191 us 9.5374 us	8.6287 us 8.8341 us 9.0342 us	`583.53 ns 583.60 ns 583.67 ns`
8	15.075 us 15.093 us 15.110 us	13.777 us 13.812 us 13.845 us	12.781 us 12.816 us 12.851 us	12.175 us 12.210 us 12.262 us	10.641 us 10.668 us 10.701 us	10.057 us 10.084 us 10.113 us	9.8548 us 9.9329 us 10.009 us	`1.1400 us 1.1402 us 1.1404 us`
9	16.341 us 16.368 us 16.398 us	15.089 us 15.147 us 15.212 us	14.532 us 14.584 us 14.642 us	13.957 us 14.031 us 14.159 us	13.308 us 13.364 us 13.432 us	12.472 us 12.520 us 12.564 us	11.970 us 12.009 us 12.052 us	11.264 us 11.291 us 11.317 us	`2.2537 us 2.2539 us 2.2543 us`
10	18.331 us 18.403 us 18.504 us	17.327 us 17.363 us 17.398 us	16.829 us 16.862 us 16.903 us	16.357 us 16.404 us 16.455 us	15.387 us 15.421 us 15.452 us	15.018 us 15.048 us 15.082 us	14.510 us 14.552 us 14.599 us	13.661 us 13.689 us 13.715 us	13.258 us 13.306 us 13.354 us	`4.4917 us 4.4922 us 4.4929 us`
11	21.282 us 21.324 us 21.371 us	20.085 us 20.131 us 20.176 us	19.365 us 19.445 us 19.536 us	18.984 us 19.023 us 19.069 us	18.039 us 18.070 us 18.098 us	17.562 us 17.605 us 17.657 us	17.500 us 17.955 us 18.440 us	16.143 us 16.244 us 16.402 us	15.453 us 15.490 us 15.530 us	15.397 us 15.437 us 15.478 us	`8.9466 us 8.9474 us 8.9483 us`
12	24.029 us 24.103 us 24.180 us	24.317 us 24.483 us 24.686 us	23.475 us 23.547 us 23.622 us	23.031 us 23.082 us 23.127 us	22.150 us 22.211 us 22.275 us	22.000 us 22.047 us 22.095 us	21.733 us 21.795 us 21.877 us	21.030 us 21.096 us 21.179 us	20.199 us 20.332 us 20.578 us	20.032 us 20.094 us 20.171 us	19.908 us 19.949 us 20.002 us	`17.864 us 17.866 us 17.867 us`
13	35.482 us 35.561 us 35.669 us	32.658 us 32.715 us 32.792 us	31.259 us 31.329 us 31.417 us	30.324 us 30.387 us 30.470 us	30.093 us 30.197 us 30.368 us	29.818 us 29.857 us 29.896 us	30.086 us 30.138 us 30.202 us	29.723 us 29.827 us 29.983 us	29.879 us 29.943 us 30.018 us	29.552 us 29.583 us 29.616 us	29.390 us 29.543 us 29.760 us	`29.072 us 29.109 us 29.147 us`	35.701 us 35.703 us 35.705 us
14	59.379 us 59.476 us 59.589 us	53.713 us 53.901 us 54.181 us	50.974 us 51.068 us 51.173 us	49.523 us 49.660 us 49.843 us	48.536 us 48.676 us 48.859 us	48.324 us 48.509 us 48.743 us	48.409 us 48.493 us 48.588 us	48.309 us 48.394 us 48.493 us	48.247 us 48.439 us 48.739 us	48.175 us 48.270 us 48.396 us	47.640 us 47.850 us 48.151 us	47.383 us 47.524 us 47.700 us	`47.289 us 47.374 us 47.474 us`	71.345 us 71.348 us 71.352 us
15	107.67 us 107.84 us 108.03 us	95.682 us 96.364 us 97.592 us	90.079 us 90.224 us 90.388 us	87.090 us 87.324 us 87.616 us	85.794 us 86.112 us 86.499 us	85.090 us 85.281 us 85.510 us	85.549 us 85.682 us 85.829 us	84.796 us 85.649 us 87.173 us	84.602 us 84.786 us 85.009 us	84.375 us 84.505 us 84.649 us	84.195 us 84.374 us 84.600 us	84.204 us 84.365 us 84.550 us	83.856 us 84.032 us 84.259 us	`83.751 us 83.971 us 84.246 us`	143.52 us 143.55 us 143.57 us

Summary

Words

Word	Explanation
Turning Degree	The minimum degree k that the parallel starts to get benefit.
Task Size	The task that per thread processes.
Task Point	The point that expresses concrete task size as the addition is 1 and the multiplication is 5~6.
Base Turning Point	The turning degree when we perform 1 task point and degree is 13.
Chunk Size	The task point that per thread processes.

Fact

The thread number is almost nothing to do with turning degree. Benched with thread number 2 and 8.
The bigger task size, the lower turning degree.
The bigger task size, the more the parallel speeds up.
The bigger task size, degree and thread number, the smaller chunk size is efficient.
The task size is 5 times addition < 1 time multiplication < 6 times addition.
The influence rate is Task Point > Degree > Thread Number.

Variables

The turning degree can be computed approximately (Base Turning Degree) - (log(Task Point)). The chunk size can be computed approximately n / 2^{[8 * (log(Task Point)) + 2 *(k - (Base Turning Degree)) + log(thread_number)] / 8}.

Uncertainty

I think the thread number is something to do with turning degree and would like to test with higher core machine.

Environment

Resource	Github Actions	Mac
Thread Number	2	8
CPU	2 core	4 core
Memory	7 GB	8 GB
Storage	14 GB	512GB

Github Actions Mac

Reference

rayon

ashWhiteHat commented 2 years ago

Hi there. I did experiment about parallelize optimization and speeded up 20%.

Bench https://github.com/NoCtrlZ/halo2/pull/4#issuecomment-1134207070 Experiment Result https://github.com/zcash/halo2/issues/548#issuecomment-1120201797

ashWhiteHat commented 2 years ago

The improvement list and status. The bench was done on Github Actions.

Target	Branch	Status	Speed Up
Fft	https://github.com/NoCtrlZ/halo2/pull/5	Done	fft and ifft 20%
pippenger	https://github.com/NoCtrlZ/halo2/pull/6	Draft	multiexp 20%
Parallel	https://github.com/NoCtrlZ/halo2/pull/4	Done	prover 20%
h()	None	None
wnaf	None	None

daira commented 2 years ago

https://github.com/ConsenSys/gnark-crypto/pull/249 is a new implementation of batch affine MSM that claims a 10-20% speedup for gnark-crypto's MSM (which was already a good implementation).

ashWhiteHat commented 1 year ago

I think we can optimize curve scalar by naf with almost zero cost.

We convert field element to 256 length of bit array. https://github.com/zcash/pasta_curves/blob/main/src/curves.rs#L483 If we convert it to bit array with sign, we can reduce $n/6$ times elliptic curve addition where n is field bits length. And we also clean the zero bit and reduce the doubling.

This is the to_nafs implementation. https://github.com/zero-network/zero/pull/60/files#diff-15198a125f029cb977758faf8db9469e6e710d1616b5ceb08eaced67993c715dR48

And scalar arithmetic implementation. https://github.com/zero-network/zero/pull/60/files#diff-bb1d24a6ddcc81d90c6392212f713b15cac6a371be79c044302f0ea11695b02dR89

The extra cost is that to_naf operation and curve negation when bit has sign. to_naf operation performs on $n$ length bit and not heavy and curve negation is also not heavy comparing with curve addition and doubling.

ashWhiteHat commented 1 year ago

I think Karatsuba can be used for field arithmetic. We are using convolution product for limbs arithmetic. The following is example of 4 length limbs.

Naive(M: 16, A: 9)

$x^6: a_3b_3$ $x^5: a_2b_3 + a_3b_2$ $x^4: a_1b_3 + a_2b_2 + a_3b_1$ $x^3: a_0b_3 + a_1b_2 + a_2b_1 + a_3b_0$ $x^2: a_0b_2 + a_1b_1 + a_2b_0$ $x^1: a_0b_1 + a_1b_0$ $x^0: a_0b_0$

Karatsuba(M: 10, A: 6, S: 18)

$u_0v_1 + u_1v_0 = u_1v_1 + u_0v_0 - (u_1 - u_0)(v_1 - v_0)$

$x^6: a_3b_3$ $x^5: a_2b_2 + a_3b_3 - (a_2 - a_3)(b_2 - b_3)$ $x^4: a_1b_1 + a_2b_2 + a_3b_3 - (a_1 - a_3)(b_1 - b_3)$ $x^3: a_0b_0 + a_1b_1 + a_2b_2 + a_3b_3 - (a_0 - a_3)(b_0 - b_3) - (a_1 - a_2)(b_1 - b_2)$ $x^2: a_0b_0 + a_1b_1 + a_2b_2 - (a_0 - a_2)(b_0 - b_2)$ $x^1: a_0b_0 + a_1b_1 - (a_0 - a_1)(b_0 - b_1)$ $x^0: a_0b_0$

In case of 6M < 18S - 3A, we can speed up. but $(a_2 - a_3)(b_2 - b_3)$ will be overflow. $(a_2 - a_3)(b_2 - b_3)$ type should be i128 so when $(a_2 - a_3)$ and $(b_2 - b_3)$ are u64::MAX, i128 positive range is u127 so it's a problem.

ashWhiteHat commented 1 year ago

We would improve wNAF, if we implement quadruple for curve and 4 multiplication for field. In 4 multiplication, we can replace it with 2 right bit shift and skip one montgomery reduction. There are ideas for performance. Please let me know if these is interesting idea. Thank you!

ashWhiteHat commented 1 year ago

precomputing inv * mod may also reduce 4 times mul in montgomery reduction

ashWhiteHat commented 1 year ago

one of curious is one time modular reduction. bitcoin implementation doesn't perform modular reduction and manages number of arithmetic with magnitude. https://github.com/RustCrypto/elliptic-curves/blob/master/k256/src/arithmetic/field/field_impl.rs#L20

zcash / halo2

Implement performance and memory optimizations for halo2_proofs #548

Fft Optimization

implement 2 layer butterfly arithmetic

change par_iter when the n is large enough

parallelize last butterfly arithmetic

ifft divisor

Parallelize

Task

Find the most efficient task size, thread and rayon method in order to implement condition branch to split parallel or not.

Bench

normal vs par_iter vs parallel

Task Chunk

Summary

Words

Fact

Variables

Uncertainty

Environment

Reference