whoosh-community / whoosh

Whoosh is a fast, featureful full-text indexing and searching library implemented in pure Python.
Other
251 stars 37 forks source link

Facets and Terms #332

Open fortable1999 opened 11 years ago

fortable1999 commented 11 years ago

Original report by Anonymous.


When trying to use both sorting and terms, I get a KeyError when calling hit.matched_terms(). Is it not possible to use both sorting and terms?

results = searcher.search(query, sortedby='ID', reverse=True, terms=True) for hit in results: print hit.matched_terms()

fortable1999 commented 11 years ago

Original comment by Matt Chaput (Bitbucket: mchaput, GitHub: mchaput).


In version 2.4 the generation of sorting information was done at first search and cached on disk. In version 2.5 this was changed (I would say fixed) to be done at indexing time -- you need to add sortable=True to fields you want to be able to sort on, otherwise, the sorting info will still be generated at the first search but not cached (since the "proper" way to do it is in the index). I recommend you try adding sortable to your schema. I will try to backport the fix to 2.4 though.

fortable1999 commented 11 years ago

Original comment by rholloway (Bitbucket: rholloway, GitHub: rholloway).


Switching to 2.5.1 worked. However, it seems to have been significantly slower on searching the index when using sorting and limit.

My previous workaround was to search the index with limit=None and do some python post-processing against the results.

#!python

# get results
results = searcher.search(query,limit=None, Terms=True)
# sort
results = sorted(results, key = lambda k: int(k['vid']), reverse=True)

# return (limited) results
return results[0:limit]

My understanding is this should be roughly the equivalent of

#!python

searcher.search(query,limit=limit, sortedby='vid', reverse=True, Terms=True)

and initially thought it would be better (cleaner at least) and likely faster to do the latter, However, that doesn't seem to be the case. I have a 10 second timeout which times out for the latter, but response takes ~2 seconds for the first method.

Is this supposed to be the case? Not too sure how the combination of limiting results on a sorted index should or does work (would think it would need all results first to sort anyways).

In any case, the combination of sorting/limiting along with Terms does appear fixed in 2.5.1.

fortable1999 commented 11 years ago

Original comment by Matt Chaput (Bitbucket: mchaput, GitHub: mchaput).


Is it possible to try with the current version to see if it works there? Thanks!

fortable1999 commented 11 years ago

Original comment by rholloway (Bitbucket: rholloway, GitHub: rholloway).


Still trying to track down issue. Running 2.4.1.

Creating a test script using your code above (and similar variations, such as adding NUMERIC field which is what I wish to sort by), it runs successfully. Running against what I have indexed on disk, I get KeyError on line 1399 within searching.py.

Running results.has_matched_terms() returns true.The key it fails on is "40784" which I am not sure where it comes from (it isn't the match on sort NUMERIC field, anyways). Modifying searching.py to print self.results.docterms.keys() prints [417], so only one key listed in there. Again, not sure what that references or what should be in there.

My schema is a bit larger than below, but essentially

Schema(vid=NUMERIC(stored=True,unique=True),entered=DATETIME(stored=True),name=TEXT(stored=True,analyzer=ana),...)

is an example of types of fields. Want to sort by vid.

code to test it is as simple as:

#!python

ix = open_dir("index")
with ix.searcher() as searcher:
  query = QueryParser("name", ix.schema).parse(u"whoosh")
  results = searcher.search(query, sortedby='vid', reverse=True, terms=True)
  print results.has_matched_terms()
  for r in results:
    print r.matched_terms()

Crashes on first iteration of loop. Without trying to print matched terms, everything works fine. Without trying to sort, I can get matched terms no problem.

fortable1999 commented 11 years ago

Original comment by Matt Chaput (Bitbucket: mchaput, GitHub: mchaput).


This works for me. What version are you using?

#!python

def test_sorted_result_terms():
    schema = fields.Schema(id=fields.KEYWORD(sortable=True),
                           body=fields.TEXT)
    ix = RamStorage().create_index(schema)
    with ix.writer() as w:
        w.add_document(id=u("one"), body=u("alfa bravo charlie"))
        w.add_document(id=u("two"), body=u("bravo charlie delta"))
        w.add_document(id=u("three"), body=u("charlie delta echo"))
        w.add_document(id=u("four"), body=u("delta echo alfa"))
        w.add_document(id=u("five"), body=u("echo alfa charlie"))

    with ix.searcher() as s:
        q = query.Or([query.Term("body", "charlie"), query.Term("body", "alfa")])
        r = s.search(q, sortedby="id", reverse=True, terms=True)

        assert ([hit["id"] for hit in r]
                == ["two", "three", "one", "four", "five"])

        assert ([hit.matched_terms() for hit in r]
                == [[("body", "charlie")],
                    [("body", "charlie")],
                    [("body", "alfa"), ("body", "charlie")],
                    [("body", "alfa")],
                    [("body", "alfa"), ("body", "charlie")],
                    ])