Open maodouchen opened 6 years ago
Consider how the internal data structure (an RB tree of centroids) will grow with N, M
, where
N
= number of input values
M
= range of values
If you look at the implementation or the paper, you'll see that at a simplified level, we can say that if the data point is within a certain distance of an existing centroid, it will add to its weight. Else we create a new centroid or possibly merge two centroids.
So basically the number of centroids scales with M. Given that M is fixed, you have constant storage. But data points are not individually stored.
You can always run a test to confirm your suspicions about a data structure being usable for large input or not ;)
It's generally a safe bet, as well, that online algorithms (processing streams of data) tend to be pursued exactly because storage scaling with input size is a problem at scale, and the online algorithm will have some space guarantees.
Does this apply to big data? 500 million?