welch / tdigest

tdigest: javascript implementation of Dunning's T-Digest for streaming quantile approximation
MIT License
68 stars 10 forks source link

apply to big data? #7

Open maodouchen opened 5 years ago

maodouchen commented 5 years ago

Does this apply to big data? 500 million?

dt-rush commented 5 years ago

Consider how the internal data structure (an RB tree of centroids) will grow with N, M, where

N = number of input values M = range of values

If you look at the implementation or the paper, you'll see that at a simplified level, we can say that if the data point is within a certain distance of an existing centroid, it will add to its weight. Else we create a new centroid or possibly merge two centroids.

So basically the number of centroids scales with M. Given that M is fixed, you have constant storage. But data points are not individually stored.

You can always run a test to confirm your suspicions about a data structure being usable for large input or not ;)

It's generally a safe bet, as well, that online algorithms (processing streams of data) tend to be pursued exactly because storage scaling with input size is a problem at scale, and the online algorithm will have some space guarantees.