yobix-ai / extractous

Fast and efficient unstructured data extraction. Written in Rust with bindings for many languages.
Apache License 2.0
448 stars 17 forks source link

tests: Tests with different file formats #8

Closed KapiWow closed 1 month ago

KapiWow commented 2 months ago

Issue-ID: 2

The test files were taken from an unstructured repository, and the expected result files were also generated by the unstructured library. Hopefully their library works well with their test files.

I used cosine_similarity because Levenshtein takes about 20 seconds to process the similarity of the extracted PDF text.

KapiWow commented 1 month ago

All comments are adressed