Experiment with vectorization cache

etiennedi commented 4 years ago

The assumption is that when importing a lot of texts in a row (e.g. articles) there is going to be a lot of overlap in words. Right now we are already saving on disk reads through using a mmapped file for the lookup. However, there might be a chance to speed up vectorization additionally by keeping an inmemory cache. As c11y vectors are immutable, we do not have to worry about cache invalidation.

Experiments with a HNSW vector-cache have shown that a viable approach might be to a have a cache with a fixed length (to control for memory usage) that simply gets purged once it runs full. This might also help with the contextionary to speed up imports.

etiennedi commented 4 years ago

Sweet spot seems to be around 10,000 items which is just 12MB on a 300d vector model (24MB on a 600d).

Since both numbers are considerably smaller than the mem requirements the mmaped files usually take up, this is a no-brainer.

Gained performance on a local test data set was between 25-30%. This isn't really representative as we could by now be in a region where something else is the bottleneck. So depending on the use case the performance gain might be even bigger.

etiennedi commented 4 years ago

released in xxx-v0.4.10

weaviate / contextionary

Experiment with vectorization cache #32