whoosh-community / whoosh

Whoosh is a fast, featureful full-text indexing and searching library implemented in pure Python.
Other
246 stars 37 forks source link

Query excluding certain fields seems to resolve incorrectly. #254

Closed fortable1999 closed 12 years ago

fortable1999 commented 12 years ago

Original report by melinath (Bitbucket: melinath, GitHub: melinath).


I have a test [1] that passes fine on haystack's elasticsearch backend, but fails on whoosh. Looking through the code, the problem seems to manifest itself when iterating over search results. Specifically:

-> results = SmartSearchQuerySet().auto_query(query)
(Pdb) n
-> results = dict((unicode(r.pk), r) for r in results)
(Pdb) results.query.build_query()
u'NOT (feed:(1))'
(Pdb) len(results)
26
(Pdb) len([r for r in results])
20

Diving into the code a bit, it looks (to me) like the problem goes straight into whoosh. Specifically, searcher.search(parsed_query, limit=30) always returns the top 20 results, rather than the top thirty. I am hesitant to dig any deeper, since I don't really understand how exactly whoosh works.

[1] https://github.com/pculture/mirocommunity/blob/develop/localtv/tests/unit/search/query.py#L222

fortable1999 commented 12 years ago

Original comment by melinath (Bitbucket: melinath, GitHub: melinath).


The test case that you posted passes fine for me; I'll go looking for other causes. Thanks!

fortable1999 commented 12 years ago

Original comment by melinath (Bitbucket: melinath, GitHub: melinath).


Yeah, we should do a better job of putting the requirements in setup.py. Here are the current installation instructions: http://readthedocs.org/docs/mirocommunity/en/latest/installation.html

It looks like the test case moved a bit since I posted it. Here's where it should have pointed: https://github.com/pculture/mirocommunity/blob/a86cffde287cc0ac0d601ff6ee4731735cf342e9/localtv/tests/unit/search/query.py#L222

Line 236 is what was failing (self.assertQueryResults('-feed:blender', expected)) Essentially what's going on there is: the Feed model instance with the name "blender" gets fetched, then a search is done which excludes everything with that instance's pk in the 'feed' field. Hence, NOT (feed:(1)).

I'll see how the test case you posted behaves for me.

fortable1999 commented 12 years ago

Original comment by Matt Chaput (Bitbucket: mchaput, GitHub: mchaput).


Are the dependencies for mirocommunity listed somewhere? They're not in setup.py. I can't run the tests because I don't have the required packages. I figured out "mptt" and "compressor" but gave up after that.

fortable1999 commented 12 years ago

Original comment by Matt Chaput (Bitbucket: mchaput, GitHub: mchaput).


I can't reproduce this with a simple test case:

#!python

def test_not_feed_1():
    schema = fields.Schema(id=fields.ID(stored=True), feed=fields.NUMERIC)
    ix = RamStorage().create_index(schema)
    with ix.writer() as w:
        # Make 40 documents, with 26 of feed != 1
        for i in xrange(40):
            w.add_document(id=u(str(i)), feed=(0 if i < 26 else 1))

    with ix.searcher() as s:
        qp = qparser.QueryParser("id", schema)
        # Find documents where feed != 1
        q = qp.parse("NOT (feed:(1))")

        r = s.search(q, limit=30)
        assert_equal(len(r), 26)  # Total number of matched documents
        assert_equal(r.scored_length(), 26)  # Number of docs in the results

The test you linked to makes me wonder... I don't know what your code or Haystack does with an empty query string.

I'll see if I can set up your dev environment so I can run your tests myself. I haven't had much luck in the past building other peoples' Django projects...

fortable1999 commented 12 years ago

Original comment by Thomas Waldmann (Bitbucket: thomaswaldmann, GitHub: thomaswaldmann).