Change in PDF Extraction Results

TheTechromancer commented 1 week ago

Hi, today I noticed a sudden change in the way text is extracted from PDFs. It seems like a lot of the binary content is being included. This is causing our tests to fail:

We've been able to resolve this quickly on our end by downgrading the package version; but just wanted to give you guys a heads-up.

EDIT: On further investigation, it looks like a change in the python API caused the issue:

Traceback (most recent call last):
  File "/home/bls/Downloads/code/bbot/bbot/modules/extractous.py", line 135, in extract_text
    buffer = reader.read(4096)
             ^^^^^^^^^^^
AttributeError: 'tuple' object has no attribute 'read'

nmammeri commented 1 week ago

Thanks for @TheTechromancer reporting this. In version 0.2.0, we changed the API to return a tuple of reader and metadata. add this to your extract call: reader, metada = extractor.extract_ ... Please look at the updated Docs

TheTechromancer commented 1 week ago

Thanks yeah we were able to fix it. Is there a chance there will be another breaking API change without a major version increase? If so, going forward we can pin the version on our side.

nmammeri commented 1 week ago

I don't see any breaking changes coming up, you can pin your version

yobix-ai / extractous

Change in PDF Extraction Results #30