Fast ECC in the zk-SNARK circuit

ebfull commented 7 years ago

In 2015, Ahmed Kosba, Andrew Miller, et al. published a paper that suggested building a Montgomery curve on top of the zk-SNARK scalar field for key exchange. They were able to achieve a fixed-based exponentiation of 6 constraints per bit of the scalar. (See http://eprint.iacr.org/2015/1093 section 6.1.)

I was able to beat their record and achieve a fixed-based exponentiation of ~4.2 constraints per bit. We find an attractive Montgomery curve using the same approach used to select Curve25519. We find a birationally equivalent Edwards curve (which has no point at infinity or special case for doubling) and use a clever (x, y) lookup subcircuit to perform windowed exponentiation. At a window size of 4 this is most optimal.

ebfull commented 7 years ago

Here's is an exerpt from an email I sent Matt/Ian a couple months ago.

Currently relying on per-bit point addition allows each addition to be 5 constraints, except the first (which is just 2) and the last (which is 3, because we end up discarding the x coordinate).

It doesn't immediately seem intuitive that a window table lookup is efficient inside the circuit, both because they scale poorly to large windows and because the general point addition becomes 7 constraints (and 4 constraints for the last addition).

However, for a window size of 4, there is a sweet spot. For input bits w, x, y, z, we precompute the following constraints:

yz = y z xy = x y xz = x z xyz = xy z

Now, a window lookup can be performed for constants a, b, c, ..., o, p, by evaluating the following single gigantic R1CS constraint, which is a polynomial designed to vanish at the constants we aren't selecting:

w * (a(xyz) + a(xy) + a(xz) + a(x) + a(yz) + a(y) + a(z) - a + b(xyz) + b(xz) + b(yz) - b(z) + c(xyz) + c(xy) + c(yz) - c(y) + d(xyz) - d(yz) + e(xyz) + e(xy) + e(xz) - e(x) + f(xyz) - f(xz) + g(xyz) - g(xy) - h(xyz) -i(xyz) - i(xy) - i(xz) - i(x) - i(yz) - i(y) - i(z) + i - j(xyz) - j(xz) - j(yz) + j(z) - k(xyz) - k(xy) - k(yz) + k(y) - l(xyz) + l(yz) - m(xyz) - m(xy) - m(xz) + m(x) - n(xyz) + n(xz) - o(xyz) + o(xy) + p(xyz)) = r + a(xyz) + a(xy) + a(xz) + a(x) + a(yz) + a(y) + a(z) - a + b(xyz) + b(xz) + b(yz) - b(z) + c(xyz) + c(xy) + c(yz) - c(y) + d(xyz) - d(yz) + e(xyz) + e(xy) + e(xz) - e(x) + f(xyz) - f(xz) + g(xyz) - g(xy) - h(xyz)

[Edit by @daira: Note that there are errors in this polynomial, but the idea is valid.]

This means we can lookup the x, y coordinate pair from a window 4 table with just 6 constraints. After this window size, the cost of lookups explodes.

In our CRH, we have to do more than just base point multiplication. We need to unpack some input field elements -- for random field elements, this costs 255 boolean constraints, 1 packing constraint, and 254 constraints for performing a partial boolean subtraction to ensure the bit representation is actually in the field. This means just processing the inputs costs 1020 constraints.

By employing the windowed exponentiation above, the constraint complexity of the final output is:

1020 + 6(510/4) + (510/4 - 1) * 7 + 4 = 2674.5 constraints. (The fraction here is just because 4 does not divide 510, so the actual constraint complexity is 2675 as there are merely less terms in the final window table lookup constraints.)

This is quite a bit better than the (1020 + 5*508 + 3 + 2 = 3565) constraints before.

daira commented 7 years ago

Is it possible to reduce the 1020 constraints to process the inputs by making use of the fact that the field modulus is close to a power of two?

zookozcash commented 7 years ago

There's no reason not to split the Fast ECC Subcircuits and the Pedersen Collision-Resistant Hashes into two separate tickets, is there? I'd like to reference just the latter from a different ticket.

daira commented 7 years ago

Circuit splitting is the same idea as #2171 I believe; fast commitments just allow you to do that more efficiently. Fast ECC allows fast Pedersen commitments, but also other circuit improvements.

zookozcash commented 7 years ago

Moved the part about Pedersen commitments to https://github.com/zcash/zcash/issues/2234

daira commented 7 years ago

@ebfull wrote:

We need to unpack some input field elements -- for random field elements, this costs 255 boolean constraints, 1 packing constraint, and 254 constraints for performing a partial boolean subtraction to ensure the bit representation is actually in the field. This means just processing the inputs costs 1020 constraints.

I wrote:

Is it possible to reduce the 1020 constraints to process the inputs by making use of the fact that the field modulus is close to a power of two?

In fact the field modulus is not close to a power of two, but that's okay; I believe we can still check that an input is in the correct range for an arbitrary modulus p in somewhere around 1.5 * lg(p) R1CS constraints (the constant depending on the bit pattern of p). The trick is to view this as a recursive problem, where each step uses one booleanity constraint, uses another constraint only when the range is odd, and then reduces the range to be checked by a factor of two.

So, to check that x is in [0, c) where c > 0:

if c = 1 then check x = 0;
otherwise, let c' = ceiling(c/2) and let x' = x - c'. Check booleanity of the sign bit s of x', and recursively check that x' + c'.s is in [0, c'). If c is odd then also check x' ≠ c'-1.

To prove correctness of the recursive case, consider that x is in [0, c) iff x' is in R = [-c', c - c'). If c is even, R = [-c', c') and so x' is in R iff x' + c'.s is in [0, c'). If c is odd, R = [-c', c'-1) and so x' is in R iff x' + c'.s is in [0, c') and x' ≠ c'-1.

[Edit: this doesn't work because there's no easy way to check the correctness of the sign bit of x'.]

daira commented 7 years ago

There might be a way to reduce the constant factor from 1.5 to 1, i.e. not depend on the bit pattern of p. I'll think about that later.

daira commented 7 years ago

I see: if we just want to check a range constraint, then it isn't necessary to ensure that the sign bits are a unique representation (e.g. in the case of an odd range, we can allow either sign when x' = 0), so it is indeed possible to do it in ceiling(lg(p)) constraints. I don't know whether this is sufficient for the CRH.

ebfull commented 7 years ago

We need to ensure the bits are a unique representation for the CRH.

I wouldn't be surprised if there's a nicer/trickier way of checking for that than what I've done.

ebfull commented 7 years ago

Further discussion about the CRH unpacking stuff should probably go into #2234.

[Edit: this was later moved to #3366.]

daira commented 7 years ago

BTW, what are we calling the inner curve?

daira commented 7 years ago

Let's see, it should be Ed255-something. Ed255-Jubjub? :-)

"As to temper the Jubjub's a desperate bird, Since it lives in perpetual passion: Its taste in costume is entirely absurd— It is ages ahead of the fashion:

"But it knows any friend it has met once before: It never will look at a bribe: And in charity-meetings it stands at the door, And collects—though it does not subscribe.

"Its flavour when cooked is more exquisite far Than mutton, or oysters, or eggs: (Some think it keeps best in an ivory jar, And some, in mahogany kegs:)

"You boil it in sawdust: you salt it in glue: You condense it with locusts and tape: Still keeping one principal object in view— To preserve its symmetrical shape."

daira commented 7 years ago

So, what's the embedding degree of Ed255-Jubjub? (Relevant to https://github.com/zcash/zcash/issues/570#issuecomment-316607734 .)

daira commented 7 years ago

We can calculate this by factoring r_Ed255-Jubjub - 1.

daira commented 7 years ago

https://www.math.u-bordeaux.fr/~ballombe/talks/bordeaux-20150924.pdf

$ sudo apt-get install libgmp10 pari-gp
$ gp
[...]
                            GP/PARI CALCULATOR Version 2.9.3 (released)
[...]
? ?ellcard
ellcard(E,{p}): computes the order of the group E(Fq) for the elliptic curve E, defined over Q (for 
q=p) or a finite field.

? \\ from https://z.cash/blog/new-snark-curve.html, in decimal
? r = 52435875175126190479447740508185965837690552500527637822603658699938581184513
%1 = 52435875175126190479447740508185965837690552500527637822603658699938581184513

? \\ short Weierstrass, y^2 = x^3 + x + 9
? \\ This gave the smallest cofactor (19) that I could find with a *brief* search.
? Estrawjubjub = ellinit([1, 9], r)
[...]

? s = ellcard(Estrawjubjub)
%3 = 52435875175126190479447740508185965837868056314152911169289196539132690409273

Well that looks plausible for the curve order. Let's do a sanity check:

? P = random(Estrawjubjub)
[...]

? ellorder(Estrawjubjub, P)
%5 = 52435875175126190479447740508185965837868056314152911169289196539132690409273

? factor(s)
%6 = 
[                                                                          19 1]
[2759782903954010025234091605693998201993055595481732166804694554691194232067 1]

? l = 2759782903954010025234091605693998201993055595481732166804694554691194232067
%7 = 2759782903954010025234091605693998201993055595481732166804694554691194232067

? G = ellgenerators(Estrawjubjub)
[...]

? for (i = 1, #G, print(ellorder(Estrawjubjub, G[i])))
52435875175126190479447740508185965837868056314152911169289196539132690409273

? ellorder(Estrawjubjub, ellmul(Estrawjubjub, G[1], l))
%22 = 19

Now we compute the embedding degree (lots of trial and error with Pari-GP omitted):

? factor(l-1)
%8 = 
[                                                   2 1]
[                                                   7 2]
[                                                  29 1]
[                                                  41 1]
[                                                4889 1]
[                                                6359 1]
[                                        157897981757 1]
[4824824808201197987454032457316014046273921544910079 1]

? d = divisors(l-1)
[...]

? for (i = 1, #d, if (Mod(r, l)^d[i] == Mod(1, l), print(d[i])))
1379891451977005012617045802846999100996527797740866083402347277345597116033
2759782903954010025234091605693998201993055595481732166804694554691194232066

That is, the embedding degree for Estrawjubjub is (l-1)/2 = 1379891451977005012617045802846999100996527797740866083402347277345597116033.

daira commented 7 years ago

Oh btw Estrawjubjub has a ~250-bit subgroup order, not 255 bits (and it is not an Edwards or Montgomery curve). We might be able to do better than that; I don't know whether you'd already searched for a suitable curve @ebfull.

daira commented 7 years ago

Here's another nice curve with a cofactor of 10:

? r = 52435875175126190479447740508185965837690552500527637822603658699938581184513
%1 = 52435875175126190479447740508185965837690552500527637822603658699938581184513

? \\ search a, b for y^2 = x^3 + ax + b
? for (a = 0, 16, for (b = 1, 16, E = ellinit([a, b], r); s = ellcard(E); print("a=",a," b=",b," factors ",factor(s))))
[...]
a=3 b=3 factors [2, 1; 5, 1; 5243587517512619047944774050818596583786234134978068165542186739459688700163, 1]
[...]

? l = 5243587517512619047944774050818596583786234134978068165542186739459688700163
%6 = 5243587517512619047944774050818596583786234134978068165542186739459688700163

? d = divisors(l-1)
[...]

? for (i = 1, #d, if (Mod(r, l)^d[i] == Mod(1, l), print(d[i])))
2621793758756309523972387025409298291893117067489034082771093369729844350081
5243587517512619047944774050818596583786234134978068165542186739459688700162

I.e. this one has embedding degree (l-1)/2 = 2621793758756309523972387025409298291893117067489034082771093369729844350081.

ebfull commented 7 years ago

The embedded curve I have proposed is a twisted Edwards curve -x^2 + y^2 = 1 - (10240/10241)x^2y^2 that is birationally equivalent to the Montgomery curve y^2 = x^3 + 40962x^2 + x. The Montgomery curve is parameterized by the smallest A such that various important criteria are satisfied for things like twist security and performance.

The reason why we want this curve to be a twisted Edwards curve is that the group law doesn't have doubling/identity edge cases. I don't believe there is a performance benefit to switching to any other kind of curve. This birational equivalence also comes in handy for things like DH.

The embedded curve has ~252-bit prime subgroup order. It has cofactor 8.

sage: ec Elliptic Curve defined by y^2 = x^3 + 40962x^2 + x over Finite Field of size 52435875175126190479447740508185965837690552500527637822603658699938581184513 sage: factor(ec.order()) 2^3 6554484396890773809930967563523245729705921265872317281365359162392183254199 sage: factor(ec.quadratic_twist().order()) 2^2 * 13108968793781547619861935127046491459433433718519184348571111025184924083859

ebfull commented 7 years ago

If you want to play with Jubjub using sage, here's a starter pack:

r = 52435875175126190479447740508185965837690552500527637822603658699938581184513
Fr = GF(r)
d = -(Fr(10240)/Fr(10241))
s = 6554484396890773809930967563523245729705921265872317281365359162392183254199
h = 8
zero = (Fr(0), Fr(1))

# -x^2 + y^2 = 1 + d x^2 y^2
def is_on_curve(p):
    try:
        x = Fr(p[0])
        y = Fr(p[1])
        return (-(x^2) + (y^2)) == (1 + (d * (x^2) * (y^2)))
    except ValueError:
        return False

# y^2 (1 - d x^2) = 1 + x^2
# y^2 = (1 + x^2) / (1 - d x^2)
def find_for_x(x):
    return sqrt((1 + x^2) / (1 - (d * (x^2))))

# y^2 - 1 = d x^2 y^2 + x^2
# y^2 - 1 = x^2(1 + d y^2)
# (y^2 - 1) / (1 + d y^2) = x^2
def find_for_y(y):
    return sqrt((y^2 - 1) / (1 + d * y^2))

def add(a, b):
    x0 = a[0]
    y0 = a[1]
    x1 = b[0]
    y1 = b[1]
    x2 = ((x0 * y1) + (y0 * x1)) / (1 + d * x0 * x1 * y0 * y1)
    y2 = ((y0 * y1) + (x0 * x1)) / (1 - d * x0 * x1 * y0 * y1)
    return (x2, y2)

def double(a):
    return add(a, a)

def negate(p):
    return (-p[0], p[1])

def mul(a, b):
    acc = zero
    while b > 0:
        if (b % 2) == 1:
            acc = add(acc, a)
        a = double(a)
        b = b >> 1
    return acc

def is_prime_order(a):
    return mul(a, s) == zero

daira commented 7 years ago

Incidentally, Gauss found the addition formulae for a special case of Edwards curves (see section 2 of http://cr.yp.to/newelliptic/S0273-0979-07-01153-6.pdf), so it is easy to imagine an alternate history in which the Edwards form of elliptic curves was considered canonical, rather than the Weierstrass form, and in which Edwards curves would have been used first in cryptography.

daira commented 7 years ago

$ gp
[...]
                            GP/PARI CALCULATOR Version 2.9.3 (released)

The Jubjub curve is defined over this prime:

? r = 52435875175126190479447740508185965837690552500527637822603658699938581184513
%1 = 52435875175126190479447740508185965837690552500527637822603658699938581184513

Safe Curves requires Edwards curves to be complete. Strictly speaking it doesn't seem to cover twisted Edwards curves, but we know from Bernstein and Lange's paper Twisted Edwards Curves that the criterion for such a curve to be complete is that a.d.(a-d) != 0 and d is non-square:

? a = Mod(-1,r)
%2 = Mod(52435875175126190479447740508185965837690552500527637822603658699938581184512, 52435875175126190479447740508185965837690552500527637822603658699938581184513)
? d = -Mod(10240,r)/Mod(10241,r)
%3 = Mod(19257038036680949359750312669786877991949435402254120286184196891950884077233, 52435875175126190479447740508185965837690552500527637822603658699938581184513)
? a*d*(a-d)
%4 = Mod(33385525161591587039874916358367569977515316056912650397779712813249829138473, 52435875175126190479447740508185965837690552500527637822603658699938581184513)
? kronecker(lift(d), r)
%5 = -1

Pari/GP only supports curves in Weierstrass form directly. We're first going to check that the twisted Edwards curve -1.u² + v² = 1 - 10240/10241.u².v² is birationally equivalent to the Montgomery curve y² = x³ + 40962.x² + x.

Theorem 3.2 of "Twisted Edwards Curves" says "The twisted Edwards curve E_E,a,d is birationally equivalent to the Montgomery curve E_M,A,B, where A = 2.(a + d)/(a - d) and B = 4/(a - d)."

? A = 2*(a+d)/(a-d)
%6 = Mod(40962, 52435875175126190479447740508185965837690552500527637822603658699938581184513)
? B = 4/(a-d)
%7 = Mod(52435875175126190479447740508185965837690552500527637822603658699938581143549, 52435875175126190479447740508185965837690552500527637822603658699938581184513)
? B = lift(B)-r
%8 = -40964

So B = -40964 which is obviously not 1. But provided -40964 is a square, we can rescale the y-coordinate.

? kronecker(B, r)
%9 = 1

So it is indeed a square. Now, we can instantiate the Montgomery curve y² = x³ + 40962.x² + x and check its order, which will be equal to the order of the twisted Edwards curve.

? Ejubjub = ellinit([0, 40962, 0, 1, 0], r)
%10 = [...]
? hs = ellcard(Ejubjub)
%11 = 52435875175126190479447740508185965837647370126978538250922873299137466033592
? factor(hs)
%12 = 
[                                                                           2 3]
[6554484396890773809930967563523245729705921265872317281365359162392183254199 1]

This is the same order (and factorisation) @ebfull found with Sage. So we've checked the curve order with two independent computer algebra systems; this supports the claim that the two systems are talking about the same curve up to birational equivalence.

Pari/GP gives us a generator on the Montgomery curve. Let's do some smoke tests of the Montgomery curve arithmetic:

? G = ellgenerators(Ejubjub)
%13 = [[Mod(9599346063476877603959045752087700136767736221838581394374215807052943515113, 52435875175126190479447740508185965837690552500527637822603658699938581184513), Mod(2862382881649072392874176093266892593007690675622305830399263887872941817677, 52435875175126190479447740508185965837690552500527637822603658699938581184513)]]

? for (i = 1, #G, print(ellorder(Ejubjub, G[i])))
52435875175126190479447740508185965837647370126978538250922873299137466033592

? s = 6554484396890773809930967563523245729705921265872317281365359162392183254199
%19 = 6554484396890773809930967563523245729705921265872317281365359162392183254199
? h = 8
%20 = 8
? ellorder(Ejubjub, ellmul(Ejubjub, G[1], s))
%21 = 8
? ellorder(Ejubjub, ellmul(Ejubjub, G[1], h))
%22 = 6554484396890773809930967563523245729705921265872317281365359162392183254199

This is good; multiplying the generator by the subgroup order gives a point of order equal to the cofactor, and vice versa, as expected.

Let's map this generator back to the Edwards curve and check that it satisfies the curve equation. (The main reason for doing this is that we're going to use this as input to a version of safe curves' verify.sage script later, but it also serves as a check that we have the correct mapping.)

The Twisted Edwards Curves paper says (renaming variables to match our convention): "The map (u, v) ↦ (x, y) = ((1 + v)/(1 - v), (1 + v)/((1 - v).u)) is a birational equivalence from E_E,a,d to E_M,A,B, with inverse (x, y) ↦ (u, v) = (x/y, (x - 1)/(x + 1))."

Recall that we also scaled the y-coordinate by sqrt(-40964), so the inverse map becomes "(x, y) ↦ (u, v) = (x/(y/sqrt(-40964)), (x - 1)/(x + 1))".

? x0 = G[1][1]
%23 = Mod(9599346063476877603959045752087700136767736221838581394374215807052943515113, 52435875175126190479447740508185965837690552500527637822603658699938581184513)
? y0 = G[1][2]
%24 = Mod(2862382881649072392874176093266892593007690675622305830399263887872941817677, 52435875175126190479447740508185965837690552500527637822603658699938581184513)
? u0 = x0/(y0/sqrt(Mod(-40964,r)))
%25 = Mod(11076627216317271660298050606127911965867021807910416450833192264015104452986, 52435875175126190479447740508185965837690552500527637822603658699938581184513)
? v0 = (x0-1)/(x0+1)
%26 = Mod(44412834903739585386157632289020980010620626017712148233229312325549216099227, 52435875175126190479447740508185965837690552500527637822603658699938581184513)

And check that this point is on the Edwards curve:

? a*(u0^2) + v0^2 - 1 - d*(u0^2)*(v0^2)
%27 = Mod(0, 52435875175126190479447740508185965837690552500527637822603658699938581184513)

👍

We'll also need a generator for the subgroup of order s:

? P = ellmul(Ejubjub, G[1], h)
%28 = [Mod(37380265172535953876205871964221324158436172047572074969815349807835370906304, 52435875175126190479447740508185965837690552500527637822603658699938581184513), Mod(26055707688826178243212294438612447599848256944592175663688341250454494541524, 52435875175126190479447740508185965837690552500527637822603658699938581184513)]
? x1 = P[1]
%29 = Mod(37380265172535953876205871964221324158436172047572074969815349807835370906304, 52435875175126190479447740508185965837690552500527637822603658699938581184513)
? y1 = P[2]
%30 = Mod(26055707688826178243212294438612447599848256944592175663688341250454494541524, 52435875175126190479447740508185965837690552500527637822603658699938581184513)
? u1 = x1/(y1/sqrt(Mod(-40964,r)))
%31 = Mod(8076246640662884909881801758704306714034609987455869804520522091855516602923, 52435875175126190479447740508185965837690552500527637822603658699938581184513)
? v1 = (x1-1)/(x1+1)
%32 = Mod(13262374693698910701929044844600465831413122818447359594527400194675274060458, 52435875175126190479447740508185965837690552500527637822603658699938581184513)
? a*(u1^2) + v1^2 - 1 - d*(u1^2)*(v1^2)
%33 = Mod(0, 52435875175126190479447740508185965837690552500527637822603658699938581184513)

Now let's check the embedding degree:

? factor(s-1)
%34 = 
[                             2 1]
[                             3 1]
[                         12281 1]
[           1710050753150114629 1]
[      203928654140967434528233 1]
[255074062430788457494141376149 1]

? d = divisors(s-1)
%35 = [...]

? for (i = 1, #d, if (Mod(r, s)^d[i] == Mod(1, s), print(d[i])))
6554484396890773809930967563523245729705921265872317281365359162392183254198

In other words, the embedding degree is s-1 which is maximal.

Further evidence that Jubjub meets the Safe Curves criteria is at https://github.com/daira/jubjub .

daira commented 6 years ago

For the Pedersen commitments (so that they are perfectly hiding), we need fixed-base scalar multiplication [k] P with a scalar k spanning the full range {0..s-1}. This can be done either in Montgomery form for most of the computation and then converting to Edwards to finish off the last few bits, or entirely in Edwards form.

Mostly-Montgomery

Our strategy would be to use 3-bit windows, and compute all but the most significant window (83 of them, for 249 bits) in Montgomery, then convert to Edwards for the last window. In order to avoid zero points, we add 1 to each digit (i.e. add P to each entry) for the Montgomery windows. To compensate for these additions, for each entry in the Edwards window we subtract [adj] P where adj = ∑_{i ∈ {0..82}} 8ⁱ.

This works because {adj .. adj+2²⁴⁹-1} is a subrange of {1 .. (s-1)/2}, and all possible sums-of-indices for the Montgomery part of the computation are distinct, so the distinct-x criterion applies.

The cost is

83 Montgomery and 1 Edwards 3-bit window lookups: 84*3 constraints
Montgomery point additions: 82*3 constraints
Montgomery to Edwards conversion: 2 constraints
Edwards addition: 6 constraints
Total: 506 constraints

Edwards-only

Again with 3-bit windows, the cost is

Edwards 3-bit window lookups: 84*3 constraints
Edwards point additions: 83*6 constraints
Total: 750 constraints

(4-bit windows are not more efficient: they would also require 63*6 + 62*6 = 750 constraints, but the lookups are more complicated.)

So the more complicated mostly-Montgomery approach saves 244 constraints per fixed-base scalar multiplication. There are three of those in the Spend circuit (two for commitments and one for [rsk] G), and one in the Output circuit (for [esk] G). Based on the cost estimates here, this is a saving of 0.79% for the Spend circuit and 3.6% for the Output circuit, which is probably not worth the complexity.

(Note that none of these figures include the cost of boolean-constraining the scalars.)

daira commented 6 years ago

Now let's do a similar analysis for variable-base scalar multiplication. This time the scalar does not need to be full-range. We'll assume that it is clamped to the low-order 249 bits with bit 250 set and bit 249 clear; this allows us to use incomplete addition in the mostly-Montgomery case, and incomplete doubling (5 constraints instead of 6) in the Edwards-only case. (Edit: we don't actually need this in the Edwards case because we can use complete 5-constraint doubling.)

Mostly-Montgomery

It turns out that the best approach for mostly-Montgomery is naive double-and-add. Using 2-bit windows is expensive because the lookups of non-constant table entries require 7 constraints per window (taking into account the fact that we cannot add the zero point due to the incomplete addition formulae).

Double-and-add works like this. Convert the base point P to Montgomery. Set the accumulator to P corresponding to bit 250, and double once. Apply double-and-add 249 times. Then convert the result to Edwards. The clamping ensures that the distinct-x criterion is met.

The cost is:

Convert to Montgomery: 2 constraints
250 Montgomery doublings: 250*4 constraints
249 Montgomery additions: 249*3 constraints
249 selections of either [2] A or [2] A + P: 249*2 constraints
Convert to Edwards: 2 constraints
Total: 2249 constraints

(How to achieve 7-constraint lookups for 2-bit windows: given window index w = 2.w1 + w0, we select Q = w0 ? [3] P : [2] P. Then we select R = w1 ? Q : P. This gives us R = [w] P when w is {1,2,3} and R = junk when w is 0. Then we calculate S = [4] A and T = S + R, and set A := NOR(w1, w0) ? S : T for the next window. But this ends up costing (4*2 + 7 + 3)/2 = 9 constraints/bit overall which is the same as above, but more complicated, so there is no advantage to 2-bit windows.)

Edwards-only

For double-and-add, the cost is:

250 Edwards incomplete doublings: 250*5 constraints
249 Edwards additions: 249*6 constraints
249 selections: 249*2 constraints
Total: 3242 constraints

In order to use Hışıl–Wong–Carter–Dawson formula (2) for the doublings, we need to ensure that the input point is of odd order. This is always the case when P is in the prime-order subgroup and is not the zero point.

For 2-bit windows, the cost is

1 Edwards incomplete doubling to precompute [2] P: 5 constraints
1 Edwards addition to precompute [3] P: 6 constraints
250 Edwards incomplete doublings: 250*5 constraints
125 window lookups (one is cheaper): 124*6 + 2 constraints
125 Edwards additions: 125*6 constraints
Total: 2757 constraints

So the mostly-Montgomery approach saves 508 constraints per variable-base scalar multiplication, relative to Edwards with 2-bit windows. There are two of these in the Spend circuit (for [ivk] g_d and [nr] ak). Based on the cost estimates here, this is a saving of 1.1% for the Spend circuit, which is probably not worth the complexity.

Using double-and-add for Edwards rather than 2-bit windows costs 485 constraints more per variable-base scalar multiplication, which is a 1.0% cost increase for the Spend circuit relative to the current estimate. If we were to also drop the Hışıl–Wong–Carter–Dawson formula and use unified Edwards addition, that would be a total 1.6% cost increase for the Spend circuit relative to the current estimate.

Note that I've rejected using the signed digit technique here, because I don't like that it requires the scalar to be represented in a funky format. 3-bit signed digits would cost roughly 2537 constraints, given that Edwards addition is now 6 constraints and we're not counting boolean-constraining the scalar (which is reasonable because both ivk and nr are already boolean-constrained by being BLAKE2s outputs). The Montgomery ladder would cost 2771 constraints, so it does not beat either mostly-Montgomery or Edwards with 2-bit windows.

For completeness, there's a signed-digit-with-mostly-Montgomery option using digit set [0, 1, 2, -1]. Given window index (w1, w0), we select Q = w0 ? P : [-2] P. Then we select R = w1 ? -Q : Q. This gives us R = [[junk, 1, 2, -1]_w] P. Then we calculate S = [4] A and T = S + R, and set A := NOR(w1, w0) ? S : T for the next window. This ends up costing (4*2 + 6 + 3)/2 = 8.5 constraints/bit overall which is approximately 2159 constraints. Please let's not use this one because it makes my head hurt.

daira commented 6 years ago

I wrote:

In order to use Hışıl–Wong–Carter–Dawson formula (2) for the doublings, we need to ensure that the input point is of odd order. This is always the case when P is in the prime-order subgroup and is not the zero point.

Actually it's not necessary to use those formulae, because there is a complete 5-constraint Edwards doubling method. If we specialize the complete addition method here:

(v₂ - a.u₂) × (u₁ + v₁) = (T) (v₂) × (u₁) = (A) (u₂) × (v₁) = (B) (d.A) × (B) = (C) (1 + C) × (u₃) = (A + B) (1 - C) × (v₃) = (T - A + a.B)

then B = A, so we arrive at:

(v - a.u) × (u + v) = (T) (u) × (v) = (A) (d.A) × (A) = (C) (1 + C) × (u₃) = (2.A) (1 - C) × (v₃) = (T + (a - 1).A)

(This means that if we decide to use an Edwards-only algorithm, we don't need to clamp bit 250 to 1.)

daira commented 6 years ago

Sapling uses Edwards-only fixed and variable base scalar multiplication, except inside the windowed Pedersen hashes. We're not going to change this until the next circuit upgrade, so this ticket can be closed.

daira commented 6 years ago

For completeness, there's a way to do "mostly Montgomery" without clamping:

Let n be the maximum bit size of the scalar (the most significant bit need not be set), such that 2ⁿ is less than the order of the base point P.
Convert the base point P to Montgomery.
Do n-1 doublings to get each of the powers [2ⁱ] P for i = 1..n-1.
Start with P and conditionally add each [2ⁱ] P for i = 1..n-1 when b_i is 1.
Then convert to Edwards, and conditionally subtract P if b₀ is 0.

This costs (2 + (n-1)*4 + (n-1)*(3+2) + 2 + 6 + 2)C = (9n + 3)C.

If we need the result in Montgomery under the assumption that the result is not the zero point, then in the last step we do not convert to Edwards, but instead we check that if b₀ = 0 then we are not subtracting P from P. If the result before the last step is Q, then in practice we need only check that if b₀ = 0 then X(Q) ≠ X(P), since it is not possible that Q could be −P (due to the restriction on n). That check takes 1 constraint, using (X(Q) - X(P)) × Xinv = (1 - b₀).

This variant costs (2 + (n-1)*4 + n*(3+2) + 1 = (9n - 1)C.

zcash / zcash