PDF417: encoding issue with one specific character (hex 60)

uwolfer commented 2 years ago

(Extracted this out of #91 because #91 is closed already.)

I have noticed an encoding issue with one specific character: ` (hex 60) gets encoded in the barcode into ? (hex 3f).

Edit: I cannot reproduce this issue with ZXing as used in your test cases, but I'm able to reproduce this issue with "J4L Barcode Vision" for example. You can reproduce the issue with these symbols:

Generated with Okapi: oki Screenshot_20220818_105529

Generated with RBarcode (this symbol contains a few more bytes at the end): Screenshot_20220818_105445

Is the information already useful? Or do you need more context?

Originally posted by @uwolfer in https://github.com/woo-j/OkapiBarcode/issues/91#issuecomment-1219379218

gredler commented 2 years ago

First pass, this seems to be a decoding bug in J4L Barcode Vision. I've checked the output of Pdf417Test.testSetContentBytes() against ZXing, a phone app, and a hardware barcode scanner, and none of them have issues with the Okapi-generated image (i.e. the backtick, hex value 0x60, decodes correctly). Let me know if you disagree or have more information, otherwise I don't think there's anything else we can do here.

uwolfer commented 2 years ago

Thank you for the analysis @gredler.

I assume you are right. This issue is again related to the Swiss tax reporting barcodes. In their reference implementation, they suggest to use J4L Barcode Vision for encoding. Now if we would replace barcode generation with Okapi, tools which follow the reference impl for decoding would fail to handle such reports. That means this issue is a blocker, and I would really like to find a workaround. Some ideas which come to my mind:

Since the barcode which is generated by RBarcode can be read by both J4L and ZXing, could we use the same encoding pattern? This would IMO improve compatibility of Okapi, even when it is not required as per spec (or even is because of a bug in RBarcode?). I miss basic knowledge of PDF417 to see what is RBarcode doing differently from Okapi in this specific case.
Could we decode the input differently to not use this one char before passing it Okapi? In my use case, we are deflating (zip) text input, and encoding this as binary into the PDF417 symbol. I don't think we can workaround on this level, but probably you have an idea here?

gredler commented 2 years ago

Have you reported the (possible) J4L Barcode Vision decoding bug to the J4L team? I think fixing that defect would be the ideal scenario, rather than trying to add workarounds to the encoder (Okapi). There's probably a tradeoff here between optimal encoding size (which affects all users) and encoding which is compatible with the current J4L Barcode Vision decoding behavior (which affects a smaller number of users).

uwolfer commented 2 years ago

Have you reported the (possible) J4L Barcode Vision decoding bug to the J4L team?

No, I have not yet. But to be honest, I also do not think there is a big chance of a change on their side since the last update to the library was in 2007 (!)...

There's probably a tradeoff here between optimal encoding size...

Do you mean with this how much data can be placed into one symbol? The odd thing is that the PDF417 symbol generated by RBarcode has even a few more bytes included than the Okapi one (you can see it in the two sample barcodes above). It's just a few bytes if I remember correctly, but still.

I would really like to understand what RBarcode does differently from Okapi in encoding the symbol, since that symbol is readable by any scanner I have tested (incl. J4L). I still think it would be worth investigating that, and probably we could find a solution which works in all cases and does not have any negative side effects. Unfortunately I miss knowledge about PDF417 barcode structure.

Side note: in the long run, it would for sure make sense to replace J4L for the Swiss tax solution with for example the open sourced ZXing, but that is a process which takes a very long time since many different tax processing software are in the wild, and I assume many of these are based on the J4L reference implementation. As long as this is the case, I see no chance that we can replace the RBarcode PDF417 implementation with the Okapi one. It would be really great if we could replace it with Okapi, and having it fully backwards compatible.

uwolfer commented 9 months ago

@gredler How should we proceed with this issue? I have not continued trying to further replace the barcode generator in the Swiss tax reporting solution because of this blocker. I still think it would be nice to look into this, and understand it, but my knowledge here is quite limited. As far as I remember, this was the last blocking issue for a drop-in replacement.

But I would also understand if you decide to close this issue, because it seems to be a corner case / issue in a different application.

gredler commented 9 months ago

I'm not sure, trying to remember where we left this... do you have the sample content and settings used to generate the test barcode above?

uwolfer commented 9 months ago

I was able to reproduce the above now again.

Using Okapi:

byte[] code = new byte[1];
code[0] = (byte) 0x60;

Pdf417 symbolTemplate = new Pdf417();
symbolTemplate.setPreferredEccLevel(4);
symbolTemplate.setBarHeight(1);
symbolTemplate.setRows(35);
symbolTemplate.setDataColumns(13);
symbolTemplate.setStructuredAppendIncludeSegmentCount(true);
symbolTemplate.setContent(code);
String fileName = "file";
symbolTemplate.setStructuredAppendFileName(fileName);
int fileId = Double.valueOf((Math.random() * 899)).intValue();
symbolTemplate.setStructuredAppendFileId(fileId);
BufferedImage img = new BufferedImage(symbolTemplate.getWidth(),
        symbolTemplate.getHeight(),
        BufferedImage.TYPE_BYTE_BINARY);
Graphics2D g2d = img.createGraphics();
Java2DRenderer renderer = new Java2DRenderer(g2d, 1, Color.WHITE, Color.BLACK);
renderer.render(symbolTemplate);

Generates the following barcode: oki

Extracting it with java4less.PDF417Reader produces the following (which is different from the input and thus wrong): 0000: 3f ?

When encoding the same data with java4less.BarCode2D, it generates this one: java4l

Extracting it with java4less.PDF417Reader produces the following: 0000: 60 `

Other barcode readers are able to extract in both cases the backtick (hex 60). But since other barcode readers are also able to extract the java4l encoded version properly, it would be interesting to understand if there is a way to change Okapi encoding to produce the output generated by java4l. I do not understand enough details how data is encoded and where exactly the difference between the two barcodes is. @gredler Does this information help you to understand the issue better?

gredler commented 9 months ago

Thanks, I'll take a look in the next few days.

Do you have a realistic sample Swiss e-tax return barcode(s) without any PII? It might be good to add something realistic to the test suite if we do end up making any changes here.

gredler commented 9 months ago

The difference between the two is in the compaction mode.

There are a few options for how PDF417 can encode data internally, and the most optimal encoding mode is chosen dynamically based on the data (e.g. if part of the data is a long series of numbers, you can switch to numeric compaction mode to more efficiently represent those numbers, and then switch back to a different mode).

In the images above, J4L chooses byte compaction mode, while Okapi chooses text compaction mode. It's a mistake to extrapolate the type of data from the compaction mode chosen, but it sounds like the J4L decoder may be doing this, or at least has some bug related to compaction modes.

We can perhaps adjust the setContent(byte[]) method that we added a while back to also force the use of byte compaction mode, but this is super sketchy and I would at the very least want to add a big warning that this should usually not be necessary.

Here's a sample image representing all 0-255 byte values in forced byte compaction mode, can you check it with the J4L decoder?

bytes

uwolfer commented 9 months ago

This is the J4L encoding output of your sample - I think it looks good! Thanks a lot for your support!

0000:  00 01 02 03 04 05 06 07  08 09 0a 0b 0c 0d 0e 0f  ~~~~~~~~~~~~~~~~
0010:  10 11 12 13 14 15 16 17  18 19 1a 1b 1c 1d 1e 1f  ~~~~~~~~~~~~~~~~
0020:  20 21 22 23 24 25 26 27  28 29 2a 2b 2c 2d 2e 2f   !"#$%&'()*+,-./
0030:  30 31 32 33 34 35 36 37  38 39 3a 3b 3c 3d 3e 3f  0123456789:;<=>?
0040:  40 41 42 43 44 45 46 47  48 49 4a 4b 4c 4d 4e 4f  @ABCDEFGHIJKLMNO
0050:  50 51 52 53 54 55 56 57  58 59 5a 5b 5c 5d 5e 5f  PQRSTUVWXYZ[\]^_
0060:  60 61 62 63 64 65 66 67  68 69 6a 6b 6c 6d 6e 6f  `abcdefghijklmno
0070:  70 71 72 73 74 75 76 77  78 79 7a 7b 7c 7d 7e 7f  pqrstuvwxyz{|}~~
0080:  80 81 82 83 84 85 86 87  88 89 8a 8b 8c 8d 8e 8f  ~~~~~~~~~~~~~~~~
0090:  90 91 92 93 94 95 96 97  98 99 9a 9b 9c 9d 9e 9f  ~~~~~~~~~~~~~~~~
00a0:  a0 a1 a2 a3 a4 a5 a6 a7  a8 a9 aa ab ac ad ae af  ~~~~~~~~~~~~~~~~
00b0:  b0 b1 b2 b3 b4 b5 b6 b7  b8 b9 ba bb bc bd be bf  ~~~~~~~~~~~~~~~~
00c0:  c0 c1 c2 c3 c4 c5 c6 c7  c8 c9 ca cb cc cd ce cf  ~~~~~~~~~~~~~~~~
00d0:  d0 d1 d2 d3 d4 d5 d6 d7  d8 d9 da db dc dd de df  ~~~~~~~~~~~~~~~~
00e0:  e0 e1 e2 e3 e4 e5 e6 e7  e8 e9 ea eb ec ed ee ef  ~~~~~~~~~~~~~~~~
00f0:  f0 f1 f2 f3 f4 f5 f6 f7  f8 f9 fa fb fc fd fe ff  ~~~~~~~~~~~~~~~~

I have digged a bit into J4L with you comment in mind, and was able to see that it allows you to define a PDFMode out of PDF_BINARY, PDF_TEXT, PDF_NUMERIC. Default is PDF_BINARY, which explains the output.

We can perhaps adjust the setContent(byte[]) method that we added a while back to also force the use of byte compaction mode, but this is super sketchy and I would at the very least want to add a big warning that this should usually not be necessary.

What do you think about adding a new method setForcedEncodingMode(EncodingMode) which disables dynamic selection of the encoding mode?

Do you have a realistic sample Swiss e-tax return barcode(s) without any PII? It might be good to add something realistic to the test suite if we do end up making any changes here.

Not sure yet how you have test cases in mind, but once I fully integrated Okapi, I could also share the relevant code for generating such series of barcodes. Sample PDF

gredler commented 9 months ago

Thanks for the confirmation!

I'm hesitant to add full-fledged compaction mode selection support. First, none of this should be necessary, so keeping the support as minimal as possible is attractive. Second, while forcing byte compaction is simple because byte compaction can be used for any input data, forcing numeric or text compaction is more complicated because not all data can use these compaction modes, so we'd have to detect and report mismatches between requested modes and input data.

Let's keep the change minimal for now, and if a broader or more generic solution is needed in the future for other use cases then we can revisit. I'll commit this change in the next few minutes, please give it a try and let me know if you run into any issues. I plan to cut a new release of Okapi early next week, so any additional feedback would be best before then.

uwolfer commented 9 months ago

Tests look good so far, thanks a lot!

I now need to rework this one a bit, it sees to not work in all cases yet: https://github.com/woo-j/OkapiBarcode/issues/92

Another issue: J4L seems to fails extracting StructuredAppendTotal (it cannot find it, while StructuredAppendPosition works). Unfortunately I have not found any other PDF417 reader which shows me this marco value. Do you have an easy way to check if total is properly included when using it like this? I have not debugged this in detail yet.

symbol.setStructuredAppendPosition(i);
symbol.setStructuredAppendTotal(structuredAppendTotal);

gredler commented 9 months ago

Tests look good so far, thanks a lot!

Great, thanks for checking.

I have not found any other PDF417 reader which shows me this macro value

ZXing provides this information in the result metadata, and the Okapi tests actually check that the structured append metadata returned by ZXing matches the symbol input (when applicable):

https://github.com/woo-j/OkapiBarcode/blob/master/src/test/java/uk/org/okapibarcode/backend/SymbolTest.java#L455

EDIT: Are you calling setStructuredAppendIncludeSegmentCount(true)?

uwolfer commented 9 months ago

Are you calling setStructuredAppendIncludeSegmentCount(true)?

Thanks - that did the trick! Good news: I was able to produce the first valid tax barcodes for some test cases!

Now a bit a bit longer tax report failed like this - I have not found the time yet to debug in detail, but probably you can see the issue right away. I have the feeling that the library should not fail like this (I would have expected some OkapiException for the case when my input is not valid).

java.lang.ArrayIndexOutOfBoundsException: Index 2700 out of bounds for length 2700
    at uk.org.okapibarcode.backend.Pdf417.processBytes(Pdf417.java:1670)
    at uk.org.okapibarcode.backend.Pdf417.processPdf417(Pdf417.java:817)
    at uk.org.okapibarcode.backend.Pdf417.encode(Pdf417.java:778)
    at uk.org.okapibarcode.backend.Symbol.setContent(Symbol.java:547)
    at uk.org.okapibarcode.backend.Pdf417.setContent(Pdf417.java:763)

gredler commented 9 months ago

@uwolfer That does indeed look like a bug, can you create a new issue with sample data to reproduce? Thanks!

woo-j / OkapiBarcode

PDF417: encoding issue with one specific character (hex 60) #97