Adversarial threat model

Hi Lei,

I just came across your RIBLT paper and for the most part I thought it was compelling and clear, and the scheme described has tremendous potential. However, I believe it contains a misunderstanding about the adversarial threat model:

This setting may create an "adversarial workload," where the hash of the symbol representing the user's input does not distribute uniformly. If the user injects into Bob's set a source symbol that hashes to the same value as another source symbol that Alice has, then Bob will never be able to reconcile its set with Alice. This is because Bob will XOR the malicious symbol into the coded symbol stream it receives from Alice, but it will only cancel out the hash of Alice's colliding symbol from the checksum field, and will corrupt the sum field.

If this was the threat in question then it would be trivially solvable by using a cryptographically secure hash function, or at least detectable via the corrupted sum field. Instead, the primary threat we are worried about is a malicious user selecting a set of non-colliding values chosen such that combined they cancel out one or more victim elements, silently censoring them by preventing their reconciliation. These malicious values can be selected at negligible cost using gaussian elimination.

In case it is useful to you, I made a test-case to demonstrate this. In example_test.go, replace the sets to be reconciled with the following:

alice := []item{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32}
bob := []item{3,7,9,13,15,19,21,25,27,29,33,35,43,51,57,59,63,67,69,83,91,95,97,99,101,105,109,111,117,123,127,131}

None of these values result in a SipHash collision, and the sum field is never corrupted. Your algorithm incorrectly signals that these sets are equivalent and that no items need to be transferred. Note that this same attack would be possible even if you replaced SipHash with a secure 256-bit hash function.

Your suggestion to use a keyed hash function and for Alice and Bob to coordinate a session-specific secret-key is valid and does prevent this attack. However, this comes at the cost of breaking universality. As you mention several times, the universal applicability of coded symbols is highly desirable because for large data-sets, re-keying every session would have unacceptable CPU, memory, and I/O costs.

In my opinion, your paper is not clear enough about this trade-off and a casual reader is likely to come away with the impression that RIBLT currently provides both universality and censorship-resistance, when it's actually one or the other.

Last thing: Is the code for your Ethereum synchronisation evaluation available anywhere? I didn't notice a link in the paper and couldn't find it on your github.

Thank you!

Doug

Hi Doug,

Thanks for your interest and the comments!

Let me make sure that I understood the attack correctly. I guess what you did to generate the malicious workload is

Pre-compute the SipHash for a bunch of numbers, say, 1 to 10000.
Generate Alice's set arbitrarily. You used {1, 2, ..., 32}.
Find Bob's set {x1, x2, ..., xn}, a subset of {1, 2, ..., 10000}, such that Hash(x1) ^ Hash(x2) ^ ... ^ Hash(xn) == Hash(1) ^ Hash(2) ^ ... ^ Hash(32). This is a linear system; each bit converts to a linear equation and the coefficients are 0 or 1.

I believe that the gist of the attack is still hash collision––finding a set of items such that the sum of their hashes collides with the hash of the victim. (In your attack the victim is the entire set of Alice.) Maybe I was misunderstanding but this is not fundamentally different from the attack we described in the paper, which is crafting hash collisions? I do understand that such a collision is easier to craft, since you can precompute the candidates and use Gaussian Elimination to find which ones to use.

You correctly pointed out that such attacks are mitigated using a keyed hash function. Indeed, if I change the key for SipHash (line 24) to something else, the malicious workload stops working.

In my opinion, your paper is not clear enough about this trade-off and a casual reader is likely to come away with the impression that RIBLT currently provides both universality and censorship-resistance, when it's actually one or the other.

That's a fair point! I will update the paper accordingly. Thanks for the suggestion!

Last thing: Is the code for your Ethereum synchronisation evaluation available anywhere? I didn't notice a link in the paper and couldn't find it on your github.

No I have not published the code.

On second thought, would you elaborate on how to generalize the attack? For example, is it possible to censor just one item in Alice's set, using a crafted set of items injected into Bob's set?

As you pointed out, for any target item t, it is easy to craft x1, x2, ..., xn such that Hash(x1) ^ Hash(x2) ^ ... ^ Hash(xn) = Hash(t). We can probably even ensure that x1 ^ x2 ^ ... ^ xn = t. However, imagine a coded symbol to which t is mapped and a proper subset of {x1, x2, ..., xn} is mapped. Then, this coded symbol is not affected by the attack––it takes the entirety of the set {x1, x2, ..., xn} to cancel out t, but only a proper subset of {x1, x2, ..., xn} is mapped to the set.

On a high level, to properly cancel out a victim item t, we need to cancel out t in every coded symbol to which t is mapped. I assume this can only be done using a single item that has the same hash as t. Your attack exploits the fact that the algorithm stops as soon as the first coded symbol is decoded. However the attack

Can be mitigated by checking more coded symbols to determine when to stop (or we can just use a proper cryptographic hash over the entire set)
Requires knowing the entirety of Alice's set and controlling the entirety of Bob's set, which is significantly harder than injecting a few items into their workload.

I might be misunderstanding parts of your attack though. Let me know if that's the case!

I guess what you did to generate the malicious workload is Pre-compute the SipHash for a bunch of numbers, say, 1 to 10000.

Yes, that's pretty much right. You need way less than 10000 though. With high probability, just a few more than the number of bits of your hash. My script generates 128 (and that is overkill).

I do understand that such a collision is easier to craft, since you can precompute the candidates and use Gaussian Elimination to find which ones to use.

Great, I'm glad you're familiar with this. I suggest expanding section 4.3 a bit to make it clear this has been considered, since the entire discussion seems to be focused on the security of the hash function itself. It wasn't clear which set reconciliation literature you are referring to, but I'd think typically a cryptographic hash function would be assumed.

I believe that the gist of the attack is still hash collision

Good point! XORing hashes together is an example of an incremental hash function, and you could rightly call this a collision in that context. BTW this construction is called XHASH in this classic paper: https://cseweb.ucsd.edu/~mihir/papers/inc-hash.pdf

Your attack exploits the fact that the algorithm stops as soon as the first coded symbol is decoded. However the attack

Can be mitigated by checking more coded symbols to determine when to stop

Very interesting! I haven't done much analysis beyond just getting the first symbols to match. I'd think it would be possible to target later symbols as well, by additionally filtering the generated elements (and maybe adding some "spacer" elements?). But I am not sure about this, and either way it sounds like it would get computationally expensive.

Requires knowing the entirety of Alice's set and controlling the entirety of Bob's set, which is significantly harder than injecting a few items into their workload.

I don't know about this for sure, but it sounds plausible. In general though, I agree. A lot of these censorship attacks are highly theoretical and it's difficult to imagine them ever being applied in the real-world. (If the protocol is badly impacted by collisions in SipHash though, I would address that, since those can easily be found).

No I have not published the code.

No worries. One thing I am curious about: How do you decide how much of the coded symbols to cache? Do you extend this lazily over time? If so, when does this happen, during a reconciliation as it's generated, or preemptively when more data is added?

Appreciate your time!

PS. I just finished watching your presentation "Why Ethereum Sync Sucks" -- congratulations on your doctorate!

I suggest expanding section 4.3 a bit to make it clear this has been considered

Nice point! Will definitely do when we prepare the next version.

BTW this construction is called XHASH in this classic paper

Interesting. We got the construction from regular IBLTs which we build on––thanks for the pointer! I guess the linearity in this scheme is exactly what your attack exploited.

I'd think it would be possible to target later symbols as well

I think a catch-all solution would be to pre-transmit the cryptographic hash of the entire set and compare that as the stopping condition. On the other hand this would be pretty computationally expensive. We basically need incremental hash functions that an adversary cannot forge collisions for any given image. As you showed, XHASH (XORing hashes) is not such a scheme.

On the Ethereum experiment: I pre-generated a large number of symbols (say, enough for staleness of weeks) and stream them as static files during reconciliation. In a deployed system the symbols can be lazily extended. I do plan to release the code soon!

Same here––really appreciate your time and the discussion!

Hi Lei (and hi Doug, with whom I have discussed the following already in earlier correspondence),

to expand a bit on "We basically need incremental hash functions that an adversary cannot forge collisions for any given image. As you showed, XHASH (XORing hashes) is not such a scheme."

IBLTs do generalize quite nicely to more secure incremental hash functions: instead of considering symbols as bitstrings, insertion as xor and intersection as xor, consider symbols as elements of a finite, commutative, additive group, insertion as group addition, and intersection as element-wise group subtraction. XOR is a particular instance of this, but you can also use more secure options. Section 5B of this paper (disclaimer: I'm the author) lists alternatives to XOR - everything except for the non-commutative Cayley hash functions is a candidate. That paper also expands a bind on the threat model that Doug introduced here. By relying on secure functions, the paper achieves censorship-resistance and universality not of communication itself but of a data structure that can be used to efficiently compute the messages for any possible peer.

Cheers, Aljoscha

yangl1996 / riblt

Adversarial threat model #3