Challenge: store 4000 characters in 4000 bytes

xem commented 10 years ago

Yesterday, @p01 noticed a (huge) problem about JSciissors:

the compression reduces the number of characters but increase the final size in bytes.

Of course, this side effect was not intentional.

In fact, we already made compressors based on the number of characters, which is way more efficient than JSciissors: https://gist.github.com/xem/7086007 => 2 chars in one https://gist.github.com/xem/7584765 => 2.85 chars in one

But the goal of JSciissors wasn't to do that. It was only to strip the eightth bit of ascii chars (which is always "0"), concatenate their binary code, cut it in chunks of 8 bytes, and make a 8-bit character from it.

Turns out, this failed, because if we save the compressed code in a UTF-8 file, all non-ascii characters are stored on 2 bytes (or more).

But all hope is not lost, because the encoded file can be saved as ANSI a.k.a latin-1 a.k.a iso-8859-1, and if we do that, each character is really stored on 8 bits.

Unfortunately, the decoding is broken because the 8-bit characters order is different between UTF-8 and in ASCII.

So the challenge is to generate an encrypted string which, when it's read in ANSI, has the right characters, (and the right binary code), in order to be correctly decoded.

Challenge accepted.

cc. @aemkei @subzey

xem commented 10 years ago

So, I kept searching what was going on...

If you encode alert(123), the encoded string is 'Ã³/.²f¤S and the decoded string is still alert(123).

The problem is that if you save this in UTF-8, this string takes more bytes than its number of characters.

So the idea is to save it in a iso-8859-1 (latin-1) file, since both charsets have the same first 256 characters, and latin1 stores each of them on 8 bits.

Unfortunately, when we decode 'Ã³/.²f¤S in latin-1, the decoded string is alers�123).

So close! But what happened?

In the binary code, the 5th block of 8 bits has changed during the conversion:

encoded: 11000011 10110011 00101111 00101110 10001010 00011000 10110010 01100110 10100101...
decoded: 11000011 10110011 00101111 00101110 01100000 00011000 10110010 01100110 10100101...

The big question is: why did this character 0x8A become a character 0x70 ?

...

xem commented 10 years ago

Maybe @bananascript can help with that, he had a similar issue... :)

michaelbeckius commented 10 years ago

I don't have a solution unfortunately. For high ascii characters, charCodeAt() returns the unicode value for the character, not the latin1 value. This totally messes up the bits and is why decompression fails. For example, for the character with ascii code 0x8a, charCodeAt will return the value 352.

michaelbeckius commented 10 years ago

If anyone wants to try their luck: http://bay16.com/charcodeat/ view source for more.

xem commented 10 years ago

@bananascript, thanks for your answer! I understand that the charCodeAt function can only output unicode. Still, I was hoping to fix that, here are some ideas I have that could maybe work, but I need help to implement them. Can you please tell me what you think about them?

Solution 1: fix the encoder to handle latin-1

In the encoder, we have the binary code 0x8a to store in a character. Unfortunately, if that character is written in a string in a latin-1 file, it is not decoded as 0x8a. So the idea would be to output another character that will really be decoded as 0x8a. The big challenge is to find that other character, and to generalize this fix for the 256 characters that can possibly exist in the string.

Solution 2: fix the decoder

The decoder gets the character 0x70 instead of 0x8A So the idea would be to find all the 0x70 chars in the encoded string, and replace them with 0x8A (in binary) during the decoding. (and generalize that for all the buggy characters)

Solution 3: Use directly binary instead of a string

Instead of a string, that suffers from those charsets limitations, we could store the encoded JS in a binary file. The decoder would just have to read this binary file with AJAX, and decode it to retrieve the original JS code. I actually made a demo for that and it works. This 10.6kb HTML file reads its own binary code with AJAX, decodes it, and executes the decoded JS. The result is: console.log("...(a 12000 chars string)..."). That's a compression ratio near 10%! Here it is: http://xem.github.io/JSciissors/test.html

xem commented 10 years ago

@bananascript here's my implementation of the solution 2 for your test file:

var fixes = {352: 138}
for(i in a){
    b = a.charCodeAt(i);
    if(fixes[b]){
      b = fixes[b];
    }
    debug.value += "\r\n" + b;
}

The ideal would be to do implement the solution 1 (if it's possible), to keep the decoder as short as possible.

xem commented 10 years ago

Update: I made a full decoder corresponding to the solution 2. It fixes all the differences between latin-1 and unicode.

The source contains all latin1 chars from 0 to 255 and the decoder. The results are in the console:

http://xem.github.io/JSciissors/fix.html

this solution works, but it adds about 250 chars to the decoder present on the JSciissors homepage...

michaelbeckius commented 10 years ago

How about this, just a little bit shorter (32 bytes or so):

http://bay16.com/charcodeat/fix2.html

I skipped the 0x00 char. It couldn't possibly be an issue?

xem commented 10 years ago

Nice job, but please take a look at my new version:

http://xem.github.io/JSciissors/fix.html

I managed to fit the unicode-to-latin1 thing in just 156 chars instead of 271. (see line 20)

Now THAT's code golfing :D

Feel free to do better though :)

PS: the char 0x00 is also taken care of. That's mandatory, because JSciissors can output it just as any other char.

michaelbeckius commented 10 years ago

Awesome!

xem commented 10 years ago

And 8 more bytes removed: http://xem.github.io/JSciissors/fix.html

michaelbeckius commented 10 years ago

Impressive. By the way, that issue that I had with Opera seems to be gone. Maybe it disappeared when they switched to webkit. At least one less problem to worry about. :)

xem commented 10 years ago

So... after a lot of maths, my final word is: 139 chars. Less than a tweet!

http://xem.github.io/JSciissors/fix.html

@p01 @aemkei @subzey @rlauck @bburky, I invite you to golf it even more if you can :)

For a reminder, the goal is to write a function that returns the charCode of a character in a latin-1 string.

Here are the differences with the regular charCodeAt function:

65533 => 0 8364 => 128 8218 => 130 402 => 131 8222 => 132 8230 => 133 8224 => 134 8225 => 135 710 => 136 8240 => 137 352 => 138 8249 => 139 338 => 140 381 => 142 8216 => 145 8217 => 146 8220 => 147 8221 => 148 8226 => 149 8211 => 150 8212 => 151 732 => 152 8482 => 153 353 => 154 8250 => 155 339 => 156 382 => 158 376 => 159

xem commented 10 years ago

@bananascript was the bug you're talking about present on opera 12? I think it's still important to support that version.

michaelbeckius commented 10 years ago

For several years up until november 2013, there has been a test for this running for every visitor at http://bananascript.com and reporting back the results to the server. I checked the logs and the last version of Opera that had this problem was version 12.00. From version 12.01 and later, there has not been any more problems.

The problem was that Opera could not see a difference between the characters with ascii codes 129,141,143,144 and 157. They were all seen as the same charatcer by Opera with the same ascii code.

xem commented 10 years ago

Oh, cool! if everything works on opera 12.X (X>0) then it's fine for me. :)

xem commented 10 years ago

The issue is now fixed (but another one appeared...)

http://xem.github.io/JSciissors/

If you encode "alert(1)", and save the result in a latin-1 JS file, it is executed correctly!

If you encode jquery.min.js, it's screwed up... to be continued. #6

xem / JSciissors