whiskyechobravo / kerko

A web application component that provides a faceted search interface for bibliographies managed with Zotero.
https://whiskyechobravo.github.io/kerko/
GNU General Public License v3.0
293 stars 35 forks source link

Kerko with a large database - Incremental sync? #26

Open emanjavacas opened 2 months ago

emanjavacas commented 2 months ago

Hi there,

first of all thanks for this amazing piece of software. For a project I am working in, we need to publish a relatively large zotero database and make it searchable. Kerko seems to be the best fit for the job, but apparently we may have to index up to 450k items. I am wondering if there are any issues you may envision deploying this with kerko. I've been syncing some libraries I have (about 5k items) and I see that this takes considerable time. I am assuming that syncing 450k probably would take weeks, which is in principle not a big deal, as long as future syncs are incremental. But I am unsure about this...

Looking to hear your opinion on this.

best regards,

davidlesieur commented 2 months ago

Glad to hear that Kerko is of interest for your project. Regarding the size of your database, I think there might be a few issues:

  1. Zotero: I have never tested it with that many items. Perhaps it can handle them, but with 450k items it is likely to become laggy and unpleasant to use. My understanding is that performance will be improved in Zotero 7 (currently beta), but 450k items is still an order of magnitude more items than what usually works comfortably in Zotero.
  2. Incremental sync: Kerko has two databases. The first one is a cache it builds by retrieving items from Zotero. Kerko's cache sync from Zotero is incremental. The second database is the search index, which Kerko builds from its cache. It is rebuilt in full when the cache has changed. Building the search index is much faster than synchronizing from Zotero, but for a large database it can still take significant time. I plan to make indexing incremental in the future, but this is not a trivial task (the search index is a denormalized database, thus to implement incremental indexing we have to take dependencies into account, e.g., relations between items, relations between items and collections). I do not yet have funding for this work.
  3. Search engine: Kerko's search engine is Whoosh, which is likely to be slow given the size of your database (both at searching, and at indexing). Also, the Whoosh project has become moribund. People with large databases have encountered search and indexing issues that have never been addressed. I have a nice plan for replacing Whoosh with a higher performance solution in the future, but this needs funding too.

I'll be happy to work on the above issues 2 & 3 when I get sufficient funding, but there's not much we can do about issue 1. It seems to me that your project requires that all 3 be addressed.

I hope this helps!

emanjavacas commented 2 months ago

Thanks a lot for your quick and thorough reply. I will consult and see what happens, especially considering issue 1, which seems to be the bottleneck.

emanjavacas commented 1 month ago

Hi!

I've been syncing and testing the app with a 200k database and the main issue I can see is with there being over 30k topics. This generates an index html file of about 20mb which of course is suboptimal. I am trying to see if there's a way to deactivate the facets (or at least the topics, since the other facets are fine).

I have two questions about it.

Thanks for your work!