Closed etiennedi closed 4 years ago
Sweet spot seems to be around 10,000 items which is just 12MB on a 300d vector model (24MB on a 600d).
Since both numbers are considerably smaller than the mem requirements the mmaped files usually take up, this is a no-brainer.
Gained performance on a local test data set was between 25-30%. This isn't really representative as we could by now be in a region where something else is the bottleneck. So depending on the use case the performance gain might be even bigger.
released in xxx-v0.4.10
The assumption is that when importing a lot of texts in a row (e.g. articles) there is going to be a lot of overlap in words. Right now we are already saving on disk reads through using a mmapped file for the lookup. However, there might be a chance to speed up vectorization additionally by keeping an inmemory cache. As c11y vectors are immutable, we do not have to worry about cache invalidation.
Experiments with a HNSW vector-cache have shown that a viable approach might be to a have a cache with a fixed length (to control for memory usage) that simply gets purged once it runs full. This might also help with the contextionary to speed up imports.