yobix-ai / extractous

Fast and efficient unstructured data extraction. Written in Rust with bindings for many languages.
Apache License 2.0
448 stars 17 forks source link

Failed Extraction - cmap font missing #33

Open s4zuk3 opened 1 week ago

s4zuk3 commented 1 week ago

Hello! While trying to extract content from a PDF, I got the following error with very little information:

After modifying the code, I was able to extract the full error, which is as follows:

Stack trace: org.apache.tika.exception.TikaException: Unable to extract PDF content
\tat org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:130)
\tat org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:219)
\tat org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
\tat org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
\tat org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:203)
\tat ai.yobix.TikaNativeMain.parseToStringWithConfig(TikaNativeMain.java:199)
\tat ai.yobix.TikaNativeMain.parseFileToString(TikaNativeMain.java:103)
**Caused by: java.io.IOException: Error: Could not find referenced cmap stream Identity-V**
\tat org.apache.fontbox.cmap.CMapParser.getExternalCMap(CMapParser.java:508)
\tat org.apache.fontbox.cmap.CMapParser.parsePredefined(CMapParser.java:99)
\tat org.apache.pdfbox.pdmodel.font.CMapManager.getPredefinedCMap(CMapManager.java:54)
\tat org.apache.pdfbox.pdmodel.font.PDType0Font.readEncoding(PDType0Font.java:287)
\tat org.apache.pdfbox.pdmodel.font.PDType0Font.<init>(PDType0Font.java:204)
\tat org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:97)
\tat org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:171)
\tat org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:66)
\tat org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:959)
\tat org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:532)
\tat org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:507)
\tat org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:151)
\tat org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:155)
\tat org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:363)
\tat org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:137)
\tat org.apache.tika.parser.pdf.AbstractPDF2XHTML.processPages(AbstractPDF2XHTML.java:1369)
\tat org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:238)
\tat org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:108)
\t... 6 more
")

From what I can see, several font files from fontbox.cmap are missing and need to be included in the Tika native image. I only see a few of them in the configuration. Is it possible to include all of them in the configuration?

Thanks!

KapiWow commented 5 days ago

Hello @s4zuk3! Yes, we can add the missing entities to the config. Could you provide a PDF for testing? Perhaps a short example of when it doesn't work would also help.

s4zuk3 commented 4 days ago

cmap_issue.pdf

Thanks!