Supporting keys > 127 bytes long

yahoo / HaloDB

A fast, log structured key-value store.

Apache License 2.0

508 stars 100 forks source link

HaloDB encodes key length as a signed byte on disk, and in memory. It rejects keys longer than 127 bytes.

Although most of my string keys in different databases are between 10 and 80 bytes, I have some rare outliers as large as 780 bytes. Every other K/V store I work with (I have an API abstraction that wraps about 10) can support large keys; only one other breaks at 512 bytes.

There are some options for longer keys:

Easy: read the byte as an unsigned value, and then sizes from 0 to 255 are supported. However, this will make it even harder to support larger keys.

Harder: Allow for larger sizes, perhaps up to 2KB.

Index/Record/Tombstone files: Steal the top 3 bits from the version byte. Since the version byte is currently 0, new code versions would interpret existing data files the same ( top 3 bits of existing version byte | key size byte ). Old code would interpret any 'extended' keys as a version mismatch and thus still be safe. Therefore I think this can remain version 0. If versions were to get up to 32, a different format would be needed at that time.
SegmentWithMemoryPool: No change for now, it will not support key sizes larger than its configured fixedKeySize which would be 127 or less. It could be extended to support key overflow when fixedKeySize is set to 8 or larger. In this case, when the key length is larger than fixedKeySize, then the slot holds a pointer to extended key data, plus whatever prefix of the key fits in the remaining slot (fixedKeySize - 8). An alternative when fixedKeySize is large enough is to keep a portion of the key hash in this area as well, so that the pointer to the extended key data does not need to be followed for most lookups. Even just one byte of the hash that was not used for accessing the hash bucket would decrease the chance that the pointer is followed by a factor of 256 on a miss.
SegmentNonMemoryPool: Since all hash entries are individually allocated, (it appears to be closed addressing with a linked chain of entries), the allocated entry in memory can either use a variable length integer encoding for the key/value lengths, or a constant two bytes for the key.

I have a PR available now (PR #49)

The solution I came to for the SegmentWithMemoryPool differs from my ideas above. I used the existing chain structure within the slots inside the memory pool to store additional 'overflow' key data.

An alternate solution might be to 'branch' the chain so that the additional key data is in a branch off of the main linked list, instead of inlined into it. This however, requires that 'fixedKeyLength' be at least 4, or else there will be no room in the pool slot for the additional pointer.

The drawback is that if significantly large keys (> 66 bytes) are in use at the same time as there is a large load factor, lookups will have to regularly skip past existing keys that take up 3 or more links in the chain, slowing lookups down. But if load factors are 'normal' -- below 0.75 -- it probably won't make much of a difference. At load factor 0.75, the average chain length of a filled table is about 1.5, and so about half the lookups will have an additional key to skip over when searching, and half the time it will be the second key, so about 25% of the time a successful lookup will need to follow 2 more linked list links. Theoretically this is about a 12% increase in slots scanned, at the cost of not being able to support large keys unless fixedKeySize is at least 4.

So I think that is open to discussion. I could also attempt such a variant, and see how much better it performs on my large key workload. But that will take some time to build. (I have one from a 7.2GB data set using 60 to 600 byte keys -- most under 90 bytes, that I've been running benchmarks with -- my other datasets are 8 to 12 byte keys and 8.3GB and one with 2.1GB with 4 to 70 byte keys, mostly less than 20).

The overall performance is about 3% (non pooled) to 6% (pooled) better than before for query throughput with my data sets.

yahoo / HaloDB

Supporting keys > 127 bytes long #42