Improve search quality by adding entity indexes and likelihood counters

wwoast commented 5 years ago

When you type free text into the RPF search box, it returns whatever version of what you typed has the most hits. There are a ton of problems with this method (more as I think of them):

ichi returns Ichikawa, Ichihara, and other place names, since searching by location has the most hits. Actually, you probably want the panda named ichi
It's hard to disambiguate between numeric panda ids and year numbers that overlap.
Searching for "China" matches location addresses, but a better set of results might be for zoo entities that have that as a flag value, rather than detecting "China" as part of the location.
baby the keyword should be higher priority than baby the panda photo tag, unless in a compound query with the name of a panda.

One way around this problem is to create an entity indexing system. Placenames, panda names, zoo names, and other searchable entity text would get classified with three pieces of data: an entity type (what it is), an entity priority per type, and a hit count in the graph for that entity. All of these values can be built into an index at publish-time, and can be used to suggest either subsets of content, or alternate content to display.

To illustrate how this might work, I would bump the panda entity priority such that exact matches for that type would take precedence over the partial string searches for locations or other things. There would be similar precedence bumps for numbers in a "year" date range, or for country matches outside of location strings.

I suspect this entity index system may be required for good performance and UX feedback as to what the search is doing, once RPF has a proper parsing strategy for search queries.

wwoast commented 5 years ago

The first post described the backing data to have better search UX. The front-end side of this, I want to look like the "assisted search" menu when you look for music or reviews on https://pitchfork.com.

wwoast commented 5 years ago

This feature needs to wait on a better search parser, before it can be useful. https://github.com/wwoast/redpanda-lineage/issues/194

wwoast commented 4 years ago

So the initial hack towards this idea is in https://github.com/wwoast/redpanda-lineage/commit/8801cf3cf824a2e3c64875df3c3e84ecc7639477, which created something called polyglots. Currently the only polyglot is the word baby, and that's a keyword when followed by a numeric year -- but a tag when followed by a panda name. It took a lot of hacky code to make this polyglot work, including a result set just for baby photos.

To clean up this code, I think actually what I want is the polyglot system to track entries in both the keyword and tag lists. But I don't see how to implement the scoring to say what interpretation of a polyglot should be chosen. I could track polyglots as a Parse value, and give preferential scores to what a polyglot might be, given how the other terms in the parse tree were classified. The goal is to make decisions on searching when the input string that has potentially conflicting meanings for one or more terms. Examples:

year 1991 versus panda id number 1991
bamboo as a panda name, or bamboo as a panda tag

Instead of a scoring system whose behavior is highly non-obvious, I could also alert to the user when it's unclear how a term should be processed. A good UX for this would be to eventually have a report of what each search parameter was classified as. When a term has more than one classification, RPF can provide an unordered-list-prompt where the user disambiguates what the terms represent.

wwoast / redpanda-lineage

Improve search quality by adding entity indexes and likelihood counters #188