world-federation-of-advertisers / cardinality_estimation_evaluation_framework

Evaluation framework and methods for estimating cardinalities of groups of sets
Apache License 2.0
22 stars 9 forks source link

Speed Improvement By Replacing np.random.choice #14

Closed pfhgetty closed 4 years ago

pfhgetty commented 4 years ago

np.random.choice with replace=False allocates large amounts of memory linear with number of items to choose from. This PR replaces np.random.choice with Robert Floyd's No-Replacement Sampling algorithm which is linear with how many items we are choosing (which is generally a much smaller number than the items we can choose from for our use case).

-- This method is nearly 1000x faster when the population is greater than 10^9 -- The runtime of the simulation for vector_of_counts-4096-ln3-sequential on the smoke_test scenarios is about 3 seconds with this fix on a Google Pixelbook. Without this fix, the runtime is over 3 minutes.