yeraydiazdiaz / lunr.py

A Python implementation of Lunr.js 🌖
http://lunr.readthedocs.io
MIT License
188 stars 16 forks source link

Minimizing index file size? #80

Closed chrisspen closed 4 years ago

chrisspen commented 4 years ago

Can you recommend any ways to minimize the size of the serialized index file?

I'm testing creation of a Lunr index with 30,000 documents, and even after 300 documents are loaded, the serialized index file is over 40 MB. I was hoping to use lunr.py to build an index file server-side, then load this client-side to implement site wide text search. However, this large filesize makes Lunr infeasible. The index filesize for all 30,000 documents would likely be in the GB, that would be impractical to load in the browser.

yeraydiazdiaz commented 4 years ago

Hi @chrisspen, I think you're probably hitting the usability limits of Lunr, as you mention trying to create such a large index, serialize it, and serve it to each client is not ideal, you may want to consider other service-based solutions.

That being said, and to answer your question, the only way I can think of reducing the file size would be to tweak the index configuration to reduce the number of words being indexed. This could be accomplished by enhancing the list of stop words, which is currently fairly naive. There is currently no easy way of doing this as I there is no customisation support yet and even so there might be differences in the search results and possibly not a lot of benefit in terms of file size given the number of documents you are indexing.

Again, given this and https://github.com/yeraydiazdiaz/lunr.py/issues/79 I would suggest using a different search solution in your case. Good luck!