zadam / trilium

Build your personal knowledge base with Trilium Notes
GNU Affero General Public License v3.0
26.82k stars 1.87k forks source link

(Feature request) More signals for search & autocomplete ranking #3498

Open agentydragon opened 1 year ago

agentydragon commented 1 year ago

Describe feature

Search by keyword match consistently has a few bad behaviors in my database. I'd like to solve this by adding more signals into the ranking than just keyword match. Specifically:

My thinking is to combine all these signals / orderings, for example by Borda count or something.

Additional Information

I've already had someone draw up a prototype of part of this that might be OK to merge. And might maybe invest a bit of time into it myself later.

I think end-game would be to build a simple ML system for this.

Ideally I'd also try to be smarter about retrieval, like fetch by closest embeddings, so that e.g. "dogs" would also match to "doggies".

zadam commented 1 year ago

I've been noticing problems in the autocomplete, but interestingly, my thinking was heading in a different direction - to reduce the number of "signals" or at least reduce their "scoring". The problem with many signals is that it's difficult to create a scoring model which works well for all use cases, and there are often some really bad degenerated cases.

I think there's a value in having a simple, but predictable autocomplete method. Being simple means being easy for mental debugging - "why can't I find the note?" - it's easier to fix "Because I don't remember the name" instead of "The keywords I use are correct, but there are some notes with many relations that bury the note I was looking for".

One step already implemented in 0.58 is to give (even bigger) score for note title matches and especially exact matches.

agentydragon commented 1 year ago

My approach would be starting to collect test cases, and evaluating the relevancy model against how well it works for them. That way we can get a metric out of these failure modes and tweak to find an algorithm that works well.

When I can't find the note, it's often because I'm just spelling it slightly differently, e.g. "my computer" vs. "my laptop". A reasonable embedding model should have no problems understanding that those 2 are related. I'd prefer having it suggested over having to try a couple different ways of spelling "my computer".