Search Megaissue - Githubissues

YourFin commented 3 years ago

Opening this issue as a discussion point for search and brain dumping ground.

Doing search at a half-decent clip probably means keeping an index of all the files capable of being served, and we should re-use that functionality for the directory browser. Keeping it up to date, however, will be annoying. We'll probably need to hook into the filesystem events during runtime, and re-validate the old cache on each startup. If we want to provide a truly excellent user experience we may also want to be able to gracefully degrade and verify that folder listings are correct manually per-request on startup until indexing is finished. But that's a whole other problem.

There are roughly two camps that I see wrt how to improve search beyond the braindead "does it match /.*${search_query}.*/i" approach: the fd/ivy/helm/fzf/ide "let's get clever about how we do partial character matches across the query string" approach that, for example, strongly matches "FeedMeSeymore" when the query is "memor feed", and the more google/elasticsearchy approach that tries to get more intelligent about acting on a higher level than individual characters.

Personal pet peeves to avoid:

By default, search under the current folder instead of the whole file system. I don't know why everybody else fucks this up so badly (looking at you windows and osx O.o). From a ui perspective: HAVE PLACEHOLDER TEXT THAT TELLS THE USER THIS IN THE SEARCH BOX. Also probably means indexing needs to reflect the parent folders.
If something matches a folder with a bunch of children, don't return all of the god damn files under that folder unless they add significantly to the match. Ugggg.
Big-ass folder names in search results. Need to figure out some way to intelligently collapse them to fit the available space.
I don't want to have to pull out the advanced search fields every time I use the search. By default it should be good at just handling the search box, and I don't want to have to reach past that unless I'm searching for something pissy like last-modified-date or files bigger than 1gb.

Fun ideas:

Include last modified date in search weight
Special tokenization for file extensions, doing things like treating queries for .doc and .docx as nearly equivalent, a search for one video file extension gives more weight to other things that look like videos, etc.
Weight folders more heavily the less content they have.
Win-dir-stat-like visualization?

Things to ponder:

Do we want to give more weight to things that get searched more?
Do we want to rip search out into it's own library at some point if it ends up being good?
Tokenize on CamelCase, snake_case, skewer-case automatically? Probably yes, but what else?
Do we want to support the ability to pierce the veil and search actual file contents at some point in the future?
Searching can get a little ugly with read access control, and probably needs to be designed with it in mind from the start.
If we ultimately want to support cross-server search in the ui, we probably need to design the system with that in mind, at the very least such that the client can do last-mile ranking.

yashdavisgupta commented 3 years ago

Thoughts in no particular order:

I 100% think search, streaming, auth, etc will eventually be broken out into their own projects/packages. Part of the reason I like this project.
I've heard really good things about mlocate, maybe that's a place to start, would be way better than find at least. Does some form of indexing.
Have been thinking about datavis, but it ain't gonna look like windirstat. Probably will start with piecharts and gauges first and will build from there.
Tokenization is an interesting issue. As a first go at it, I'd find some package that does close to what we want already and use what they've done. Since I assume none of us are search experts let's use other people's a/b (or hopefully more insightful) testing to our advantage and twerk if we don't like how it works.
I really like the fun ideas, think many of them should try to be in the MVP.
Agree with all your pet peeves. I think the search function should take the search directory as a parameter. As far as good default search goes: assuming current directory, weigh more recently modified files first, weigh files types differently (.mp4 and .docx ranked higher than .dll or .lock) with the ability to change the default in library. Files inside folders shouldn't be served anyway (unless relevant) because that can become very bandwidth expensive.
As far as using the search indexing in the directory browser, I'm not sure what you mean? It would seem providing a path and retrieving all the files at that path wouldn't benefit from indexing.
Agreed on cross-server search. As we get there, the number of results should also be a parameter. Probably even before that, we should tell the search library/function to stop searching after 50 or so results. Do you think it's reasonable to serve 50 results, wait for the client to ask for more and return the next 50? I can imagine a scenario where a client asks for all of the .txt files and gets returned thousands of results.

YourFin commented 3 years ago

For future reference: https://atg.netapp.com/wp-content/uploads/2012/12/FS_crawler_Bisson.pdf https://www.cs.nmsu.edu/~misra/papers/sc12paper.pdf

yashdavisgupta / Asperitas

Search Megaissue #7