Open TurtlePU opened 1 month ago
I was thinking of MemoTrie but it doesn't have a HasTrie ByteString
instance.
I was thinking of MemoTrie but it doesn't have a
HasTrie ByteString
instance.
Nice package! However it is more useful for memoization and it is not what would help in our case: MemoTries build a memoized total map, so they would not benefit from early returns which can be implemented in efficient tries. We could write down instance HasTrie ByteString
as if ByteString
was a [Word8]
, but the implementation for it is not as efficient in MemoTrie
as it could be (once again because of totality). At this point, we can roll out our own ByteString :->: a
, which would amount to implementing our own trie from scratch.
We need something akin to
data Trie a = TTrie
{ keys :: Set ByteString
, root :: TrieNode a
}
data TrieNode a = TNode
{ value :: Maybe a
, children :: IntMap (TrieNode a) -- used like Map Word8, but is more efficient
}
Taking into account the constant length of our keys and optimizing the arches of the form singleton x (TNode Nothing (singleton y (TNode Nothing ...)))
, we arrive at
data Trie a = TTrie
{ keys :: TrieNode ByteString
, values :: TrieNode a
}
data TrieNode a = TNode
{ arc :: ByteString
, children :: Either a (IntMap (TrieNode a))
}
Can we use hashmaps or hashtables? Either should be more efficient than Ord
ered Map
s.
Can we use hashmaps or hashtables? Either should be more efficient than
Ord
eredMap
s.
@echatav, I was discussing this on one of the previous meetings and @zlonast mentioned that actually Map
s are often faster because of the big constant overhead hashmaps introduce.
Also hashmaps still need to call hash function on the key which runs at least linearly in the length of a key.
With IntMap
s, the running time is practically $O(1)$: https://hackage.haskell.org/package/containers-0.7/docs/Data-IntMap.html
I think IntMap
(Patricia tree) is a good option.
IntMap
uses machine sized big endian integers which should be big enough.
Another option is a Radix tree.
Oh, that's exactly a ByteString trie though The only performance gain we might have over it is if:
Int
when using IntMap
sAlso I don't know if they account for degree-1 branches (they should)
I vote for IntMap
then
Sadly, just IntMap
wouldn't do because digests are 32 byte, so they wouldn't fit into an Int
critbit and Radix tree both would do, yeah
The question is how efficient are they in comparison with each other, with Map
, hashmap and with the container we can roll out ourselves
Sadly, just
IntMap
wouldn't do because digests are 32 byte, so they wouldn't fit into anInt
We could truncate the hash but I am guessing there would be a small chance of a collision then.
One thing about all these trie types is that they exhibit performance based on key prefix ordering and degrade when inserts happen using random out-of-order keys. But SHA2 digests will be generated as random out-of-order values. We may want to look into existing merkle tree libraries if only for design inspiration if we can't find a suitable one.
On the contrary, I'd say that tries perform better when keys are random-order which means that there is a good chance that the first bit where they are different is in the beginning of the key
Although looking for existing solutions is never a bad idea, yeah
Trie is a search tree for string-like datatypes. As variables are now 32-byte SHA2 digests, migrating to tries might give a performance boost.
(Or might not, this needs to be investigated.)
There is a
bytestring-trie
package on Hackage, but it suffers from bugs and itsSemigroup
andMonoid
instances are a little cumbersome. Maybe I didn't find a better package; maybe we need to roll out our own trie.