whoosh-community / whoosh

Whoosh is a fast, featureful full-text indexing and searching library implemented in pure Python.
Other
240 stars 36 forks source link

Concurrency Search #554

Closed celsofranssa closed 4 years ago

celsofranssa commented 4 years ago

I have an index (size = 3.5 GB) of 5 million small documents indexed using Whoosh.

As my documents have only name and content, therefore my Schema is very simple and has only two fields: id and content.

schema = Schema(name = ID(stored=True),
                content = TEXT(stored=True),
                )

To test performance, I'm using a set of 70,000 queries, but Whoosh is taking about 20 seconds to execute each one.

index = open_dir("../data/search/bm25_index/")
query_parser = QueryParser("content", schema=index.schema)
q = query_parser.parse("some query")
with index.searcher(weighting=scoring.TF_IDF()) as searcher:
    results = searcher.search(q)

Since the index is stateless, how could I perform a multi-thread search?

stevennic commented 4 years ago

I think this is more a client-side issue. Due to the GIL, multi-threading isn't an option, unless you use Cython. All you can do with CPython is use multiprocessing to achieve parallelism. That should work for you.

orenovadia commented 4 years ago

I don't think one search should take as long as 20 seconds. Do you have an idea why it is so slow?

stevennic commented 4 years ago

Not off hand, but in my experience I've also found that Whoosh doesn't scale well. Once you get to substantial data volumes, its performance is quite disappointing. I could benchmark it and look for the bottleneck, but I'm guessing it's just Python itself. If you really wanted to stick with Whoosh, you could try manually sharding it into parallel indexes, but once you've hit this wall, you should really be considering alternatives like Solr.

celsofranssa commented 4 years ago

@orenovadia I guess it's because of the customization which I must use.

celsofranssa commented 4 years ago

@stevennic indeed I ended up migrating my project to Lucene. Which is incredibly fast.