yeraydiazdiaz / lunr.py

A Python implementation of Lunr.js 🌖
http://lunr.readthedocs.io
MIT License
187 stars 16 forks source link

Incremental index build? #114

Closed chrisspen closed 2 years ago

chrisspen commented 2 years ago

Currently, it looks like the build method works as a batch process, requiring all documents to be added to the builder before the index can be created.

Unfortunately, this limits the scalability of Lunr, since every addition of a new document requires the the builder to be re-run.

I'm investigating how to apply Lunr to large collection (~100k documents) that's constantly changing, while routinely updating a search index efficiently.

Is there any current support in the codebase, that I've overlooked, that allows incremental updates to the index?

On first glance, it looks like most things done in Builder.add() could be modified to work incrementally and moved directly into the Index class, which would make Lunr tremendously more useful. Is there anything I'm missing?

yeraydiazdiaz commented 2 years ago

Hi @chrisspen, good question, there is no way at the moment to do what you're suggesting in an efficient way, but there is a workaround which may be acceptable depending on your constraints.

The Index mostly holds the search logic while the Builder object is the one doing the heavy lifting of actually indexing the documents. The Builder holds the inverted index and passes it to the Index when it creates it, so you could keep the builder instance in memory and add new documents to it and call build on the same instance to create new instances of Index. Something like the following:

import json
from lunr import get_default_builder

docs = [...]
builder = get_default_builder()
builder.ref("id")
for field in ('text', 'title'):
    builder.field(field)

for doc in docs["docs"]:
    builder.add(doc)

idx = builder.build()

# later

new_doc = {
    "id": "/new-doc",
    "text": "I'm a new document, definitely not in the index before: foobarbaz",
    "title": "NEW DOCUMENT foobarbaz"
}

builder.add(new_doc)
idx2 = builder.build()

assert idx.search("foobarbaz") == []
assert idx2.search("foobarbaz")[0]["ref"] == "/new-doc"

There is an overhead when creating the new Index instances but it's probably not much compared to having to build the whole thing again. Note, however, there is no way of de-indexing a document, so if documents are removed from your set you'd have to rebuild everything, not sure if that's included in 'constantly changing'.

yeraydiazdiaz commented 2 years ago

I'll go ahead and close this issue, hopefully my suggestion was helpful.

chrisspen commented 2 years ago

Thanks.