apply to big data? - Githubissues

Consider how the internal data structure (an RB tree of centroids) will grow with N, M, where

N = number of input values M = range of values

If you look at the implementation or the paper, you'll see that at a simplified level, we can say that if the data point is within a certain distance of an existing centroid, it will add to its weight. Else we create a new centroid or possibly merge two centroids.

So basically the number of centroids scales with M. Given that M is fixed, you have constant storage. But data points are not individually stored.

You can always run a test to confirm your suspicions about a data structure being usable for large input or not ;)

It's generally a safe bet, as well, that online algorithms (processing streams of data) tend to be pursued exactly because storage scaling with input size is a problem at scale, and the online algorithm will have some space guarantees.

welch / tdigest

apply to big data? #7