Extracting text from a specific page of the document

yobix-ai / extractous

Fast and efficient unstructured data extraction. Written in Rust with bindings for many languages.

Apache License 2.0

448 stars 17 forks source link

Closed bm777 closed 1 month ago

bm777 commented 2 months ago

How can I load a specific page and extract text?

And, identify and retrieve the paragraph with the highest occurrence of the target word from a multi-paragraph page.

nmammeri commented 2 months ago

There are 2 parts to this question:

Loading a specific pdf page: The short answer: it is not possible with the current version. However, I'm thinking how can this be handled in future versions.

One way would be to split the pdf file using a different library then pass the byte[] stream to Extractous.
The second way would be to add a specific Extractor function to return a list of pages.

Retrieving a paragraph with the highest occurrence of the target word: I think this out of scope of Extractous and is something that can be performed post extraction. The extracted text is delimited by spaces, to get the paragraphs I would split the string then perform a search on all paragraphs and select the one with the highest occurrence of the the target word.

bm777 commented 2 months ago

Okay, understood. Future release -<

For the second question, no way to use Regex there? in the preprocessor.

Giving this task to a different library will increase the processing time, no?

I was just suggesting since you guys are using the speed of Rust, why not include it as an option? But by default, it returns just text.
And lastly, people are lazy which is why they use unstructured.

nmammeri commented 2 months ago

If I understood correctly, the use cases you want are kind of:

Interesting use case we'll definitely look into this but I'm wondering if unstructured has such functionality.

bm777 commented 2 months ago

Extracting text from specific page is available in Unstructured [here]. (I'm not using it, I'm using a different tool)
Search for a target word, and return the max occurrence in paragraphs. (Currently, I use regex it is just a naive way)

Looking forward to play with Extractous