yobix-ai / extractous

Fast and efficient unstructured data extraction. Written in Rust with bindings for many languages.
Apache License 2.0
415 stars 17 forks source link

Feature/3 return tika metadata #26

Closed s4zuk3 closed 1 week ago

s4zuk3 commented 1 week ago

Here is the new revised version of the Tika Metadata implementation. Any comments and/or changes are welcome.

Once again, thanks to @nmammeri and @KapiWow for their help with the HashMap configuration. And thank you for the opportunity to contribute a small part to this great project!

nmammeri commented 1 week ago

Was reviewing this today and found out that Tika metadata is actually Map<String, List>. For example the key "pdf:charsPerPage" returns the number of characters per page for all pages inside the document. The current implementation returns only the first element and ignores the rest.

I'm halfway through making the necessary changes by adding apache.tika.Meatada to StringResult and ReaderResult then converting to HashMap<String, Vec> from Rust. the java parseMeatada() function is redandunt because we can loop through the metada from the rust side

s4zuk3 commented 1 week ago

Hey! Thanks for the recommendations, I applied the changes that were instructed. I changed the Java step from HashMap to Tika.Metadata, fixed the HashMap to <String, Vec>, and extended the functionality for all the other extraction methods.

Please let me know if any additional changes are needed, Thanks again @nmammeri and @KapiWow .

nmammeri commented 1 week ago

Thanks again @s4zuk3 for your the updates. amazing work. I've just made the checking logic of metadata tests more stringent. made it so

In previous code if I remove some values of lists it doesn't detect it. Also If I remove some keys from the extracted metadata it does not get detected.

I changed it so that we check that all keys in expected metadata are found and the values for every expected key are similar to the extracted values.