whoosh-community / whoosh

Whoosh is a fast, featureful full-text indexing and searching library implemented in pure Python.
Other
240 stars 36 forks source link

Fix #375: Clean up writer for failed document add when exceptions wer… #543

Closed stevennic closed 5 years ago

stevennic commented 5 years ago

Fix for #375 : "Called start_doc when already in a doc"

I encountered this problem when adding to Whoosh from inside a pandas apply() function. Pandas does a first pass that absorbs exceptions.

This has also been observed by others, such as: https://github.com/django-haystack/django-haystack/issues/952

Reproduction:

schema = fields.Schema(id=fields.ID())
st = RamStorage()
ix = st.create_index(schema)

with ix.writer() as w:
    try:
        # Integer value is invalid, but absorbed exception causes silent failure
        w.add_document(id=1)
    except:
        pass

    w.add_document(id=2)
  1. Add a document that will fail (such as with an integer value instead of the expected Unicode or sequence)
  2. Absorb the exception while doing so
  3. Try to add another document that will fail

Expected Result: An appropriate exception for the problem with document 2: Exception: 2 is not unicode or sequence Actual Result: Exception: Called start_doc when already in a doc

When the first document fails, the exception is not allowed to bubble up by the repro code. However, this leaves the PerDocumentWriter in an unfinished state (_indoc==True). When the second document add is attempted, the writer finds itself in an invalid state and throws the Called start_doc when already in a doc exception, which is cryptic and meaningless to the user.

The fix ensures that the PerDocumentWriter is properly cleaned up in the event of a failure by implementing its own try/except block. This allows the correct exception to bubble up for the second problem.

codecov[bot] commented 5 years ago

Codecov Report

Merging #543 into master will increase coverage by 0.01%. The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #543      +/-   ##
==========================================
+ Coverage   83.01%   83.02%   +0.01%     
==========================================
  Files         132      132              
  Lines       29621    29641      +20     
==========================================
+ Hits        24589    24610      +21     
+ Misses       5032     5031       -1
Impacted Files Coverage Δ
src/whoosh/codec/whoosh3.py 95.09% <100%> (+0.01%) :arrow_up:
tests/test_writing.py 99.36% <100%> (+0.02%) :arrow_up:
src/whoosh/writing.py 86.8% <100%> (+0.09%) :arrow_up:
src/whoosh/fields.py 81.62% <0%> (+0.14%) :arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update f846e4a...e31f17e. Read the comment docs.

fortable1999 commented 5 years ago

thanks for your code!