whoosh-community / whoosh

Whoosh is a fast, featureful full-text indexing and searching library implemented in pure Python.
Other
247 stars 37 forks source link

Phrase query matching partial result #486

Open fortable1999 opened 6 years ago

fortable1999 commented 6 years ago

Original report by Laurent Tramoy (Bitbucket: Lautram, GitHub: Unknown).


Hi,

My query is a simple phrase query "python library" , and I want to find the exact matches of this phrase, without counting random "python" or "library". Here is my code:

#!python

from whoosh import fields, scoring, analysis, query
from whoosh.filedb.filestore import FileStorage

def search(text, q):
    storage = FileStorage("tests/index")
    # This regex is the same as the default, except that it does not split on
    # dashes.
    regex_expr = '\\w+((\\.?|-?)\\w+)*'
    analyzer = analysis.StandardAnalyzer(expression=regex_expr, stoplist=[])
    schema = fields.Schema(
        authors=fields.KEYWORD(commas=True, stored=True),
        description=fields.TEXT(analyzer=analyzer, stored=True)
    )
    index = storage.create_index(schema, indexname="usages")
    w = index.writer()
    w.add_document(authors='unkwnown', description=text)
    w.commit()
    searcher = index.searcher(weighting=scoring.Frequency)
    return searcher.search(q, terms=True)

If the two terms don't appear next to each either, there is not hit, as expected:

#!python
text1 = "bla bla library bla bla bla python"
q = query.Phrase("description", ["python", "library"])
search(text1, q)
# returns <Top 0 Results for Phrase('description', ['python', 'library'], slop=1, boost=1.000000) runtime=0.000455269000667613>

And if they do, we have a hit:

#!python
text2 = "bla bla python library bla bla"
q = query.Phrase("description", ["python", "library"])
search(text2, q)
# returns <Top 1 Results for Phrase('description', ['python', 'library'], slop=1, boost=1.000000) runtime=0.000455269000667613>

So far, nothing surprising. But my problem is the partials matches when both the phrase and the single terms appear in the document:

#!python
text3 = "bla bla python" + " bla "*100 + "bla bla python library" + " bla "*100 + "library" 
res = search(text3, q)[0]
res.highlights("description")
# returns 'bla bla <b class="match term0">python</b> bla  bla  bla  bla  bla...bla  bla bla bla <b class="match term0">python</b> <b class="match term1">library</b> bla  bla  bla  bla  bla...bla  bla  bla  bla <b class="match term1">library</b>'
res.score()
# returns 4.0

I guess the behavior is expected, but is there a way to highlight, and score, only when the terms are right next to each other?

I believe I read the doc thoroughly, as well as the previous issues, so I apologize if this was already answered somewhere.

thanks

stevennic commented 5 years ago

Unfortunately this option doesn't exist. highlight.py:set_matched_filter() is a module-level function that, given the field tokens and a termset of individual query terms, marks each token as matching against any of the terms. Once a token is marked as matched, Higlighter.highlight_hit() calls ContextFragmenter.fragment_tokens(), which builds fragments for each of the matched tokens and top_fragments() scores them and picks the top n.

In my opinion, a feature to accommodate this request would look like this: searching.highlights() takes an additional optional parameter strict_phrase = False. When set to True, it would highlight only phrase matches for phrase queries by inspecting isinstance(results.q, Phrase) and having set_matched_filter() only mark a token as matched if it's part of the whole phrase, so only phrase matches would become fragments. This would take care of scoring too, since it applies only to fragments generated in the previous step.

SpanNear queries and slop should also be considered, as well as queries combining phrases with non-phrase terms.

stevennic commented 5 years ago

Since #528 implemented this feature, I propose we close this issue.