yurelle / Base45Encoder

Standalone Java implementation of the RFC-9285 Base45 Standard.
The Unlicense
12 stars 2 forks source link

Encoded string is not compatible with other base45 decoders #2

Closed hvico closed 1 year ago

hvico commented 1 year ago

The produced string doesn't seem to be compatible with other base45 decoders out there (for instance the Python base45 lib, or the base45 online decoders out there). Checking python's base45 implementation, it seems the library checks that the len(input_buffer) % 3 == 1 and in my case that is not true for some encodings produced by this JAVA encoder. Besides that it doesn't seem to be compatible at all. Is this really base45 or some custom near-base45 implementation? Thanks

yurelle commented 1 year ago

I was not aware that other Base45 encoders were a thing; so, I don't know what their encoding standard is. This encoder was written specifically to cater to an internal Base45 implementation that I stumbled upon inside the ZXING QR Code library source code, here: https://github.com/zxing/zxing/blob/master/core/src/main/java/com/google/zxing/qrcode/encoder/Encoder.java

On line 43, the ALPHANUMERIC_TABLE int array defines the lookup table for conversion between binary & letters. The comment claims that it was copied from "table 5" of some official standard document called "JISX0510:2004" on page 19.

On line 257 in the "chooseMode(...)" function, it detects if the string content which has been passed into the QR Code matches the bounds of the ALPHANUMERIC_TABLE. And if so, then the function "appendAlphanumericBytes(...)" on line 576, compresses 2 of these alphanum digits into 11-bits.

Again, my library was designed specifically to convert binary data into something that this internal ZXING code would recognize and accept, and that would be consistent with itself, encoding & decoding. I called it Base45, just because it was 45 possible values. I had never seen a base45 implementation before, so I didn't know I was clobbering the SEO. Maybe I should have called it something more specific.

When you say that my encoded string doesn't match the python/etc. implementation, do you mean that mine uses different characters than they do, to represent the 45 possible values? Or that we're both using the same letters/symbols, but in a different order (i.e. when I encode some value, like 34, they decode it as 16 or something)? Or am I getting the byte order wrong?

Actually, now that I think about it, it might be the byte order. I didn't think about the byte order when I was writing this, since I was doing both the encoding & decoding, so I could just be consistent, and I didn't know that other implementations existed, so I wasn't trying to be compatible with anything. But, I'm pushing the bytes into the long buffer sort of like a First-In-Last-Out stack. So, I guess it would jumble the bytes, if another implementation is expecting the bytes to be in the same order. It works if you use my code to do both encoding & decoding, but I guess you're saying that if you use my code to encode something, python decodes it into gibberish?

I'm also just settling up the leftover slack at the end of each long buffer into its own byte, rather than trying to keep the data flowing continuously. If other implementations do something more elegant, then I would imagine that that would also cause problems.

hvico commented 1 year ago

Hi. Thanks for your response. What I mean is that this encoding doesn't match what is defined as base45 as a standard There is some recent development related with the term base45 and the encoding related with that term, due to some COVID vaccination certificate used in the EU that encoded data using that standard.

But your response is clear, so your implementation is not base45 as the definition by the RFC 9285 (https://datatracker.ietf.org/doc/html/rfc9285), but rather a 45 character custom mapping that is optimized by a particular QR generation lib. Really useful BTW, just need to reference this library instead of the standards to let others decode it.

Some refs: https://dencode.com/string/base45 https://www.dcode.fr/base45-encoding https://pypi.org/project/base45/ https://www.researchgate.net/publication/367218172_Improving_data_embedding_capacity_into_Base45_encoded_strings

Thanks again for the response and for sharing your code!

yurelle commented 1 year ago

Interesting. So, I looked into that standard document you linked, and apparently my code actually IS compatible with the standard, except that I was processing the bytes in FILO stack order (i.e. reverse order within each chunk), and I was processing the data in 7-byte chunks. The standard maintains source byte order, and only processes in 2-byte chunks.

Making those 2 changes to my code, and my output matches the examples in the standard.

However, only processing in 2-byte chunks makes the algorithm much simpler. I had actually written my original code with the intention of making it base agnostic, and able to convert to & from any arbitrary base, but then abandoned that mid-way through. So, it actually turned out to be more complicated than it needed to be.

So, instead of just modifying my existing functions to match the standard, I decided to rewrite them from scratch to make them cleaner. The code's a lot simpler now, and apparently it also increased the storage efficiency. My old algorithm (with 7-byte chunks) had a storage efficiency loss of 8% behind raw binary, but the new code only has a loss of 3%. I guess those open standards guys really know what they're doing. Or maybe my efficiency benchmark code is wrong. lol IDK.

I pushed a new release; v2.0.0, since this is a breaking change. Version 2 should be compatible with the Base45 Standard. Let me know if you have any issues.

yurelle commented 1 year ago

By the way. Sorry if I spammed you with repo notifications; not sure how quickly git sends them out. I kept screwing up creating a new release, and had to do it & delete it & redo it, like 10 times. I forgot that github doesn't delete the release tag when you delete the release. oops.

hvico commented 1 year ago

Wow, thanks a lot! I'll try the new release ASAP and let you know!

hvico commented 1 year ago

Hello. I've tested the new release and worked like a charm. Was able to decode the produced string using this site: https://www.dcode.fr/base45-encoding Downloaded the file and recovered the data payload. Great job, thanks!

yurelle commented 1 year ago

Thanks. Glad I could help.