yobix-ai / extractous

Fast and efficient unstructured data extraction. Written in Rust with bindings for many languages.
Apache License 2.0
448 stars 17 forks source link

Extracting text from a specific page of the document #6

Closed bm777 closed 1 month ago

bm777 commented 2 months ago

How can I load a specific page and extract text?

And, identify and retrieve the paragraph with the highest occurrence of the target word from a multi-paragraph page.

nmammeri commented 2 months ago

There are 2 parts to this question:

  1. Loading a specific pdf page: The short answer: it is not possible with the current version. However, I'm thinking how can this be handled in future versions.
  1. Retrieving a paragraph with the highest occurrence of the target word: I think this out of scope of Extractous and is something that can be performed post extraction. The extracted text is delimited by spaces, to get the paragraphs I would split the string then perform a search on all paragraphs and select the one with the highest occurrence of the the target word.
bm777 commented 2 months ago

Okay, understood. Future release -<

For the second question, no way to use Regex there? in the preprocessor.

Giving this task to a different library will increase the processing time, no?

nmammeri commented 2 months ago

If I understood correctly, the use cases you want are kind of:

Interesting use case we'll definitely look into this but I'm wondering if unstructured has such functionality.

bm777 commented 2 months ago

Looking forward to play with Extractous