Closed Midnighter closed 4 years ago
It might be overkill because you're:
ratio
functionWhoosh might still be useful if you expect heavy querying or more complex queries.
It makes sense that all matches are getting the same score because test_fn
is hard-coded to compare your query (text
) with the word glycerol
, regardless of what's in the document. Since both of those are fixed for each query, all results will yield the same score. Instead, you probably intended to compare the query text
with the matching terms in each document, which you can retrieve using matcher.matching_terms()
.
What I'm still unclear about is why you're using the NgramAnalyzer
and why the score you're getting is 3.3333.
Thank you for your answer @stevennic. That definitely highlights some gaps in my knowledge of using Whoosh.
My idea in using Whoosh was actually to reduce the number of terms that need to be scored to only those where an n-gram matches. I'm still unclear on how to do that exactly. Would the matching_terms
iterator simply have no results? I'll give that a try in a moment.
Using more complex queries is a nice-to-have.
I've used the NgramAnalyzer
because it is a database of chemical compound names. So I didn't think that typical English language word stemming would be a good fit.
I have changed to scoring function to:
def test_fn(searcher, fieldname, text, matcher):
return max((ratio(text, term) for _, term in matcher.matching_terms()), default=0.0)
That worked but now all the results have a score of 5.0.
I changed the function to actually print text
and term
and found that both contain only the n-grams where a match occurs. So those are always equal and the computed score is 1.0. Thus I will need to change two things:
text
) and the matching n-gram (term
), I would like to compute the distance between a full query word and the full string of the matching term.I believe this solution is influencing the score directly, regardless of the word weighting in the class name. The description for the FunctionWeighting
class is : Uses a supplied function to do the scoring.
As you've figured, Ngram and Levenshtein are conflicting strategies, so I don't think it makes sense to combine them. Have you tried just a regular Ngram search with the standard analyzer? That would give you a score proportional to the number of matching Ngrams. Would that work? Make sure to circumscribe its min and max in accordance with the size range of chemical Ngrams.
Levenshtein is built into fuzzy search, but that would be more commonly used for misspellings. Presumably you're considering Ngrams because there are common roots and suffixes, whereas by the time you increased the scope of fuzzy search to capture common ngrams, you'd also potentially be catching completely irrelevant matches.
The regular analyzer does not yield the desired results. Since it is normally used to rank entire documents by the frequency of the query, the top result for "glycerol" is "glycerophosphoglycerol" and not "glycerol", for example. I created the index with a max N=5, though, so maybe increasing that would improve results.
Presumably you're considering Ngrams because there are common roots and suffixes, whereas by the time you increased the scope of fuzzy search to capture common ngrams, you'd also potentially be catching completely irrelevant matches.
I was mainly considering n-grams because of the stemming issues described above. And I was thinking that building an index of common n-grams would allow me to quickly identify partially matching nodes. However, simply calculating the Levenshtein distance as a score over all 2.2 M names only takes ~2 s compared to over 20 s using Whoosh. It won't allow me to use more complex queries out-of-the-box but I think I can live with that.
Thank you very much for your insights here.
You'd have to also increase the minsize for n-grams for that to work for this example, but it could get fiddly if you also have short roots.
You could try to split the index into an n-gram field and an original field and boost the original one so exact matches would count higher than n-gram matches. Good luck!
You could try to split the index into an n-gram field and an original field and boost the original one so exact matches would count higher than n-gram matches. Good luck!
That's a good idea. I might try that with a sample set. Thanks!
I have a SQL table with 2 M rows of individual 'words' and have built an index using the NGRAM analyzer. I would now like to search this index and produce scores using simple Levenshtein distance (I don't need anything more fancy since these are not full documents).
Is Whoosh overkill here or is this a valid use case? If it is valid, can you guide me in creating the correct scorer? The docs are a bit sparse on this topic. What I have so far unfortunately only computes the weight and all results end up with the same score.
Output