Change collection types in `ArithmeticCircuit` from `Map`s to `Trie`s

TurtlePU commented 1 month ago

Trie is a search tree for string-like datatypes. As variables are now 32-byte SHA2 digests, migrating to tries might give a performance boost.

(Or might not, this needs to be investigated.)

There is a bytestring-trie package on Hackage, but it suffers from bugs and its Semigroup and Monoid instances are a little cumbersome. Maybe I didn't find a better package; maybe we need to roll out our own trie.

echatav commented 1 month ago

I was thinking of MemoTrie but it doesn't have a HasTrie ByteString instance.

TurtlePU commented 1 month ago

I was thinking of MemoTrie but it doesn't have a HasTrie ByteString instance.

Nice package! However it is more useful for memoization and it is not what would help in our case: MemoTries build a memoized total map, so they would not benefit from early returns which can be implemented in efficient tries. We could write down instance HasTrie ByteString as if ByteString was a [Word8], but the implementation for it is not as efficient in MemoTrie as it could be (once again because of totality). At this point, we can roll out our own ByteString :->: a, which would amount to implementing our own trie from scratch.

TurtlePU commented 1 month ago

We need something akin to

data Trie a = TTrie
  { keys :: Set ByteString
  , root :: TrieNode a
  }

data TrieNode a = TNode
  { value :: Maybe a
  , children :: IntMap (TrieNode a) -- used like Map Word8, but is more efficient
  }

Taking into account the constant length of our keys and optimizing the arches of the form singleton x (TNode Nothing (singleton y (TNode Nothing ...))), we arrive at

data Trie a = TTrie
  { keys :: TrieNode ByteString
  , values :: TrieNode a
  }

data TrieNode a = TNode
  { arc :: ByteString
  , children :: Either a (IntMap (TrieNode a))
  }

echatav commented 1 month ago

Can we use hashmaps or hashtables? Either should be more efficient than Ordered Maps.

TurtlePU commented 1 month ago

Can we use hashmaps or hashtables? Either should be more efficient than Ordered Maps.

@echatav, I was discussing this on one of the previous meetings and @zlonast mentioned that actually Maps are often faster because of the big constant overhead hashmaps introduce.

Also hashmaps still need to call hash function on the key which runs at least linearly in the length of a key.

With IntMaps, the running time is practically $O(1)$: https://hackage.haskell.org/package/containers-0.7/docs/Data-IntMap.html

echatav commented 1 month ago

I think IntMap (Patricia tree) is a good option.

IntMap uses machine sized big endian integers which should be big enough.

echatav commented 1 month ago

Another option is a Radix tree.

TurtlePU commented 1 month ago

Oh, that's exactly a ByteString trie though The only performance gain we might have over it is if:

we account for the constant length of keys
we change the "word size" to be equal to size of Int when using IntMaps

Also I don't know if they account for degree-1 branches (they should)

echatav commented 1 month ago

I vote for IntMap then

No new dependency
No new container type code
Very efficient and simple

TurtlePU commented 1 month ago

Sadly, just IntMap wouldn't do because digests are 32 byte, so they wouldn't fit into an Int

echatav commented 1 month ago

Hmmm, what about a critbit Map? Docs

TurtlePU commented 1 month ago

critbit and Radix tree both would do, yeah The question is how efficient are they in comparison with each other, with Map, hashmap and with the container we can roll out ourselves

vlasin commented 1 month ago

Sadly, just IntMap wouldn't do because digests are 32 byte, so they wouldn't fit into an Int

We could truncate the hash but I am guessing there would be a small chance of a collision then.

echatav commented 1 month ago

One thing about all these trie types is that they exhibit performance based on key prefix ordering and degrade when inserts happen using random out-of-order keys. But SHA2 digests will be generated as random out-of-order values. We may want to look into existing merkle tree libraries if only for design inspiration if we can't find a suitable one.

TurtlePU commented 1 month ago

On the contrary, I'd say that tries perform better when keys are random-order which means that there is a good chance that the first bit where they are different is in the beginning of the key

TurtlePU commented 1 month ago

Although looking for existing solutions is never a bad idea, yeah

zkFold / zkfold-base

Change collection types in `ArithmeticCircuit` from `Map`s to `Trie`s #274