Open agentydragon opened 1 year ago
I've been noticing problems in the autocomplete, but interestingly, my thinking was heading in a different direction - to reduce the number of "signals" or at least reduce their "scoring". The problem with many signals is that it's difficult to create a scoring model which works well for all use cases, and there are often some really bad degenerated cases.
I think there's a value in having a simple, but predictable autocomplete method. Being simple means being easy for mental debugging - "why can't I find the note?" - it's easier to fix "Because I don't remember the name" instead of "The keywords I use are correct, but there are some notes with many relations that bury the note I was looking for".
One step already implemented in 0.58 is to give (even bigger) score for note title matches and especially exact matches.
My approach would be starting to collect test cases, and evaluating the relevancy model against how well it works for them. That way we can get a metric out of these failure modes and tweak to find an algorithm that works well.
When I can't find the note, it's often because I'm just spelling it slightly differently, e.g. "my computer" vs. "my laptop". A reasonable embedding model should have no problems understanding that those 2 are related. I'd prefer having it suggested over having to try a couple different ways of spelling "my computer".
Describe feature
Search by keyword match consistently has a few bad behaviors in my database. I'd like to solve this by adding more signals into the ranking than just keyword match. Specifically:
~x
, then rank higher pages that have more~x
relations pointing at themMy thinking is to combine all these signals / orderings, for example by Borda count or something.
Additional Information
I've already had someone draw up a prototype of part of this that might be OK to merge. And might maybe invest a bit of time into it myself later.
I think end-game would be to build a simple ML system for this.
Ideally I'd also try to be smarter about retrieval, like fetch by closest embeddings, so that e.g. "dogs" would also match to "doggies".