yan74 / afplib

JAVA Library for reading & writing AFP (Advanced Function Presentation) Files.
Apache License 2.0
37 stars 22 forks source link

Unable to parse seemingly valid AFP file #22

Closed mattcg closed 7 years ago

mattcg commented 7 years ago

Thank you for open-sourcing this library. I'm currently working on an Apache Tika parser for AFP files that wraps afplib.

However, none of the documents in my corpus can be parsed. The parser looks for the 5A magic byte and fails to find it before the threshold constant. If I fast-forward the stream to the first 5A, then it parses the structured field as an unknown structure, then errors out before the end of the stream.

Opening the file in AFPWorx renders nothing - 0 pages and 0 resources. Nothing is logged in the error log.

The file appears to be valid. I can see the magic bytes D3A8 at the beginning and D3A9 at the end.

Is this some other variant of the format which is unsupported?

I've attached one of the files in question, encrypted using your public key as it's semi-confidential. You'll first have to unzip it, then decrypt it.

D1-106.afp.gpg.zip

yan74 commented 7 years ago

@mattcg,

it looks like there are structured fields in your file - just there are no 5a magic bytes in front of them. All AFP I have seen so far have 5a bytes in front of each sf introducer:

5a <2bytes length><3 bytes id> ....

Some host append an additional length field:

<2 or 4 bytes of length> 5a <2bytes length> ....

Yours looks like this:

<2bytes length><3bytes id>...

So again I have never seen this before. I also ran your file through PSF (IBMs print service facility) and it errors. So not quite sure what to do here ... what is the source of your data?

mattcg commented 7 years ago

Thanks for checking it with the PSF! Can you view it using IBM's viewer plugin? It isn't available for Mac so I can't do so myself.

yan74 commented 7 years ago

I will check later this week when I am back home - I only have my macbook with me :) But I am pretty certain this won't show anything.

yan74 commented 7 years ago

Well I stand corrected - this is a valid AFP file - I could print it. So I need to change the code to not require 5a magic bytes

mattcg commented 7 years ago

Interesting. How are you printing it on a Mac?

yan74 commented 7 years ago

I have a virtual machine running psf and one that is running an IPDS virtual printer

yan74 commented 7 years ago

latest commit can parse your afp

yan74 commented 7 years ago

one more thing: I noticed your afp contains only images (BIM) - who render ok but you probably won't get any useful information out for tiki - unless it can do OCR ...

yan74 commented 7 years ago

Having said that I think your project is a very interesting one - i'd like to see an AFP tiki module - would be very useful!

mattcg commented 7 years ago

Success! And yes, the Tika standard way to handle images would be to extract the stream, then to delegate parsing of each one to Tika, which would invoke the TesseractOCRParser if the image format is supported.

yan74 commented 7 years ago

That's interesting. Will your module be open source and do you need help writing the image extraction part?