Closed celsofranssa closed 4 years ago
I think this is more a client-side issue. Due to the GIL, multi-threading isn't an option, unless you use Cython. All you can do with CPython is use multiprocessing to achieve parallelism. That should work for you.
I don't think one search should take as long as 20 seconds. Do you have an idea why it is so slow?
Not off hand, but in my experience I've also found that Whoosh doesn't scale well. Once you get to substantial data volumes, its performance is quite disappointing. I could benchmark it and look for the bottleneck, but I'm guessing it's just Python itself. If you really wanted to stick with Whoosh, you could try manually sharding it into parallel indexes, but once you've hit this wall, you should really be considering alternatives like Solr.
@orenovadia I guess it's because of the customization which I must use.
@stevennic indeed I ended up migrating my project to Lucene. Which is incredibly fast.
I have an index (size = 3.5 GB) of 5 million small documents indexed using Whoosh.
As my documents have only name and content, therefore my
Schema
is very simple and has only two fields:id
andcontent
.To test performance, I'm using a set of 70,000 queries, but Whoosh is taking about 20 seconds to execute each one.
Since the index is stateless, how could I perform a multi-thread search?