wmgeolab / scope

6 stars 4 forks source link

Research other search algorithms/APIs #147

Closed l-zheng24 closed 1 month ago

sigaloid commented 1 year ago

Meilisearch just hit version 1.0, it's a search/ranking library that would allow us to insert documents into it and search. It's also very low-resources, has a free hosted tier (but can easily be self-hosted), and has plugins to integrate with a web framework so that a search bar can be more easily added to the front-end.

A tentative workflow we could have: Add every document (metadata + fulltext, presumably we could derive a ranking from the metadata based off of trustworthiness) that are the results of any existing queries into the search database, then serve searches. This can be embedded into a Docker container.

Another alternative is Typesense. They provide query-time ranking options which is good because we need to handle insert-time ranking and query-time ranking.

It actually looks like Typesense is the way to go - Meilisearch doesn't scale past 100k documents very well. The rustacean in me is sad but alas.

sigaloid commented 1 year ago

I've made a short write-up analyzing our options.

Meilisearch vs Typesense - main competitors

These two options are our main options for a full-scale search engine/database, where we will store full-text of our pulled data. This allows for a more full search integration with SCOPE, rather than searching tags, URL, or title.

Open source considerations

Typesense and Meilisearch are both fully open source. The code license has ramifications on our ability to embed it within SCOPE, which plans on being a single shippable container. Since both are able to be embedded within the container due to their license, either is acceptable.

Fault tolerance

Typesense supports multi-tenant. However, since we don't plan on implementing this architecture, this is irrelevant :p

Query field weights/boosting

Typesense supports this whereas Meilisearch does not. We desire this functionality, so we can weight at a higher accuracy based on the query.

Winner: Typesense

Record ID generation

Typesense supports this whereas Meilisearch does not. This functionality is a nice-to-have.

Negative queries/similarity search

Typesense supports this whereas Meilisearch does not. We desire this functionality, so we can weight queries lower, exclude certain text, and find similar query results

Search suggestions

Typesense wins again - has it whereas Meilisearch does not.

Winner: Typesense

Ultimately the winner here is Typesense. I'm going to work on a write up describing how we can add to its database and properly integrate with it .