Add Serialisation/Deserialisation

domhauton commented 5 years ago

In order to aggregate tdigests from multiple services it's necessary to serialise and deserialise the data to be sent.

This change implements the serialisation from AVLTree in tdunning's original implementation with some added safety features.

Both libraries can now process and swap data using the small and large encoding types.

https://github.com/tdunning/t-digest/blob/e8ca8479b7c98deb4bfe381a4e465e33e7063262/core/src/main/java/com/tdunning/math/stats/AVLTreeDigest.java#L340

The raw data used to generate the test data with tdunning's library is provided.

welch commented 5 years ago

very cool, I will have some time later this week to look this over. serial interop with another library is very appealing.

as with the previous PR, I'm not a fan of defending against bad input when doing so embeds new assumptions, so the overflow protection will be a hard sell.

domhauton commented 5 years ago

I have changed the assumptions to be based on the input buffer length. This will always be true unless the data is corrupted, in which case the protection is valid and it requires no extra configuration from the user.

It's valuable to have this to prevent OOM problems and wasted compute on receiving corrupted data. It's fairly likely to happen within t-digest because of the different encoding between histogram types. If you put an encoded MergingDigest into the AVLTreeDigest decoder it reads an incorrect (usually very high) centroid count and proceed to fail or OOM. If collecting digests from thousands of services written in varying languages it makes this kind of mistake very likely.

It would be great to save people from production outages due to crossed wires if we can.

welch / tdigest

Add Serialisation/Deserialisation #11