Creating a scorer for words

Midnighter commented 5 years ago

I have a SQL table with 2 M rows of individual 'words' and have built an index using the NGRAM analyzer. I would now like to search this index and produce scores using simple Levenshtein distance (I don't need anything more fancy since these are not full documents).

Is Whoosh overkill here or is this a valid use case? If it is valid, can you guide me in creating the correct scorer? The docs are a bit sparse on this topic. What I have so far unfortunately only computes the weight and all results end up with the same score.

from Levenshtein import ratio
from whoosh import scoring

def test_fn(searcher, fieldname, text, matcher):
    return ratio(b"glycerol", text)

levenshtein_weighting = scoring.FunctionWeighting(test_fn)
searcher = idx.searcher(weighting=levenshtein_weighting)
parser = QueryParser("name", schema=idx.schema)
parsed = parser.parse("glycerol")
results = searcher.search(parsed, limit=None)

print(len(results))
print([(r["compound_id"], r.score) for r in results[:5]])
print([(r["compound_id"], r.score) for r in results[-5:]])

Output

527,284

[(77627, 3.333333333333333),
 (77628, 3.333333333333333),
 (78036, 3.333333333333333),
 (78036, 3.333333333333333),
 (78038, 3.333333333333333)]

[(75300, 3.333333333333333),
 (7706, 3.333333333333333),
 (7707, 3.333333333333333),
 (7708, 3.333333333333333),
 (7709, 3.333333333333333)]

stevennic commented 4 years ago

It might be overkill because you're:

Strictly scoring matches by the ratio function
Not utilizing any of Whoosh's linguistic analysis features for indexing or querying
Only using single word documents

Whoosh might still be useful if you expect heavy querying or more complex queries.

It makes sense that all matches are getting the same score because test_fn is hard-coded to compare your query (text) with the word glycerol, regardless of what's in the document. Since both of those are fixed for each query, all results will yield the same score. Instead, you probably intended to compare the query text with the matching terms in each document, which you can retrieve using matcher.matching_terms().

What I'm still unclear about is why you're using the NgramAnalyzer and why the score you're getting is 3.3333.

Midnighter commented 4 years ago

Thank you for your answer @stevennic. That definitely highlights some gaps in my knowledge of using Whoosh.

My idea in using Whoosh was actually to reduce the number of terms that need to be scored to only those where an n-gram matches. I'm still unclear on how to do that exactly. Would the matching_terms iterator simply have no results? I'll give that a try in a moment.

Using more complex queries is a nice-to-have.

I've used the NgramAnalyzer because it is a database of chemical compound names. So I didn't think that typical English language word stemming would be a good fit.

Midnighter commented 4 years ago

I have changed to scoring function to:

def test_fn(searcher, fieldname, text, matcher):
    return max((ratio(text, term) for _, term in  matcher.matching_terms()), default=0.0)

That worked but now all the results have a score of 5.0.

I changed the function to actually print text and term and found that both contain only the n-grams where a match occurs. So those are always equal and the computed score is 1.0. Thus I will need to change two things:

Instead of modifying just the weighting function, ideally I'd like to influence the score directly.
Instead of computing the distance between an n-gram of my query (text) and the matching n-gram (term), I would like to compute the distance between a full query word and the full string of the matching term.

stevennic commented 4 years ago

I believe this solution is influencing the score directly, regardless of the word weighting in the class name. The description for the FunctionWeighting class is : Uses a supplied function to do the scoring.

As you've figured, Ngram and Levenshtein are conflicting strategies, so I don't think it makes sense to combine them. Have you tried just a regular Ngram search with the standard analyzer? That would give you a score proportional to the number of matching Ngrams. Would that work? Make sure to circumscribe its min and max in accordance with the size range of chemical Ngrams.

Levenshtein is built into fuzzy search, but that would be more commonly used for misspellings. Presumably you're considering Ngrams because there are common roots and suffixes, whereas by the time you increased the scope of fuzzy search to capture common ngrams, you'd also potentially be catching completely irrelevant matches.

Midnighter commented 4 years ago

The regular analyzer does not yield the desired results. Since it is normally used to rank entire documents by the frequency of the query, the top result for "glycerol" is "glycerophosphoglycerol" and not "glycerol", for example. I created the index with a max N=5, though, so maybe increasing that would improve results.

Presumably you're considering Ngrams because there are common roots and suffixes, whereas by the time you increased the scope of fuzzy search to capture common ngrams, you'd also potentially be catching completely irrelevant matches.

I was mainly considering n-grams because of the stemming issues described above. And I was thinking that building an index of common n-grams would allow me to quickly identify partially matching nodes. However, simply calculating the Levenshtein distance as a score over all 2.2 M names only takes ~2 s compared to over 20 s using Whoosh. It won't allow me to use more complex queries out-of-the-box but I think I can live with that.

Thank you very much for your insights here.

stevennic commented 4 years ago

You'd have to also increase the minsize for n-grams for that to work for this example, but it could get fiddly if you also have short roots.

You could try to split the index into an n-gram field and an original field and boost the original one so exact matches would count higher than n-gram matches. Good luck!

Midnighter commented 4 years ago

You could try to split the index into an n-gram field and an original field and boost the original one so exact matches would count higher than n-gram matches. Good luck!

That's a good idea. I might try that with a sample set. Thanks!

whoosh-community / whoosh

Creating a scorer for words #545