zxing-js / text-encoding

Polyfill for the Encoding Living Standard's API. Implemented TextEncoder and TextDecoder with TypeScript and JavaScript.
Other
7 stars 2 forks source link

"iso-8859-1" (latin1) not "windows1252" #5

Open username1565 opened 4 years ago

username1565 commented 4 years ago

Look here https://github.com/zxing-js/text-encoding/commit/c1193bd1cc3fe52f826256a6b4985869526c50b9 the changes and tests for latin-1 encoding.

Here https://github.com/username1565/text-encoding/blob/dc7b6481e47e731d3ddae0fb0f4cffe876b1efa9/src/encoding/encodings.ts#L313 latin-1 (and synonyms) is switched to windows-1252.

Test:

<script src="https://unpkg.com/@zxing/text-encoding@0.8.2/umd/encoding-indexes.js"></script>
<script src="https://unpkg.com/@zxing/text-encoding@0.8.2/umd/encoding.js"></script>

<script>

var s = ''; for(var i = 0; i<256; i++){s+= String.fromCharCode(i);} console.log('s: \n'+s); //generate string with all latin-1 characters

var latin1_bytes = new TextEncoding.TextEncoder('iso-8859-1', { NONSTANDARD_allowLegacyEncoding: false }).encode(s);    //try to encode this
console.log('latin1_bytes', latin1_bytes);  //but receive 384 bytes, not 256 bytes.

var allBytes = new Uint8Array(256); for(var i = 0; i<256; i++){allBytes[i] = i;} console.log('allBytes', allBytes);     //generate all consecutive bytes

var latin1 = new TextEncoding.TextDecoder('iso-8859-1', { NONSTANDARD_allowLegacyEncoding: true }).decode(allBytes);    //try to decode this as latin-1 string
console.log('latin1: ', latin1, '(latin1 === s)', (latin1 === s));  //show the string and compare it with previous string. ---> received windows-1252 string.

</script>

UPD: Seems, like I'm already fixed this, in this commits: https://github.com/username1565/text-encoding/commit/61d721dfaeb1de0cdc706b1525d079f24dc91df6 https://github.com/username1565/text-encoding/commit/d438e5326995276193cea701d1ab5b7b70e6804f https://github.com/username1565/text-encoding/commit/f2d46a7a2e34d628fcf97e4ac89b056c732311a7 https://github.com/username1565/text-encoding/commit/82e412c5c0f1e1a5bd21e0c53f0e3374e512de29 https://github.com/username1565/text-encoding/commit/728f622b709b20d325cc0073a89c58abb4d37b6a https://github.com/username1565/text-encoding/commit/30e36c01dc3b33a572272d60ff0f567df26ddf68 https://github.com/username1565/text-encoding/commit/dbfbc88450710fc915fa1211e1c170b8057f510f

You can see full differences, by compare across forks: https://github.com/zxing-js/text-encoding/compare/master...username1565:master

username1565 commented 4 years ago

Also, you can open Github Pages for your master-branch in the settings of your repositary, to see the results of browser tests in the browser.

username1565 commented 4 years ago

And can you remove TextEncoding, to don't call new TextEncoding.TextEncoder() and new TextEncoding.TextDecoder() and leave just old new TextEncoder() and new TextDecoder() and do not override this when this already defined in browser, but override this by this command: https://github.com/zxing-js/text-encoding/blob/master/README.md#non-standard-behavior

?

odahcam commented 4 years ago

latin-1 (and synonyms) is switched to windows-1252.

Awesome! In fact I did created this repo based on these changes, but I had to step back to original implementation for being able to fix all the tests in here. I do plan to update Latin1 and ISO-8859-1 indexes and I'm sure this will help. Thanks.

Also, you can open Github Pages for your master-branch in the settings of your repositary, to see the results of browser tests in the browser.

I do see it running the project locally, I'm not very interested right now. I will create some pages in the future.

And can you remove TextEncoding, and do not override this when this already defined in browser,

Yeah definitely, also I thought I had already published a version where this was done. Here's a script that checks for the polyfill in the latest version: https://codepen.io/odahcam/pen/abvepmQ?editors=1010 Edit: fixed.


It took me a little while to answer, it happens I'm a little busy right now, but I'll keep this work soon.

username1565 commented 4 years ago

https://github.com/zxing-js/text-encoding/issues/5#issuecomment-633007528

Also, you can open Github Pages for your master-branch in the settings of your repositary, to see the results of browser tests in the browser.

You can see the changes, and the comment for this commit: https://github.com/username1565/text-encoding/commit/5eb0906fa093989db9a8ea66b47ea0efec38ca23 I just uploaded the compiled JavaScripts from TypeScript, and upload this into ./lib-folder, to make this compatible with @sinonjs's repositary, then open Github Pages, and add master-branch there, then created another ./test/browser/libTEST.html, where pathways for already compiled and uploaded ./lib/*.js used for testing. After this all, this tests available online here, and as you can see, all tests passed.

After this all, I think, we can open Pull Request for @sinonjs, where JS-files contains in his ./lib-folder.

Also, I did add some another tests there, And you can see, all commits - here: https://github.com/username1565/text-encoding/network

It took me a little while to answer, it happens I'm a little busy right now, but I'll keep this work soon.

No problems. I think we should not rush anywhere. But we should think about the quality of the code, because we leave this code for posterity, for centures, and maybe, as stantardizated polyfill of etalon-library - forever! Which is already included into many-many browsers. Hehheh.

So, as I said here: https://github.com/zxing-js/text-encoding/pull/1#issuecomment-626975416 in that code, there is many another encodings, which can be encoded, decoded and tested, by using TextEncoder/TextDecoder, to make this code full, complete, and reversive.

And of course, you can fix it, and add, and do this only in your free time, and with your patience, and just for your fun.


P.S.: I did add your changes to my fork, fix "CRLF", draft new release, then publish NPM-pachage. All is works fine! After this, I did create new branch, change @username1565 to @zxing-js and opened this Pull Request for you with the minimal changes. All conflicts is resolved, and you can merge this changes, after see the differences there.

odahcam commented 4 years ago

FYI: I'm a little away right now, I pretend to come back in the next semester.

odahcam commented 4 years ago

I'm trying to understand better the key differences between Windows-1252 and ISO-8859-1. I ran into this answer, which is pretty straight forward and interesting: https://stackoverflow.com/a/31800761/4367683

Also, I found this very elegant table which compares characters differences between both encodings: https://www.i18nqa.com/debug/table-iso8859-1-vs-windows-1252.html

image

I'd like to let it here for documentation reasons.

Is there something else you'd like to add?

username1565 commented 4 years ago

The both those encodings, this was been an extended ASCII. So, the first 128 characters (0x00-0x7F in range 0-127) this is an ASCII-characters for the both encodings iso-8859-1 (latin1), and Windows-1252. The second part is different, and differences you can see here, in that charset tables Also, as you can see, latin-1 is the oldest encoding, and some characters was not been included, in the first version of windows-1252.

On your picture, I see windows-1252 chars which is represented as charcodes in iso-8859-15, Unicode, utf-8 bytes, and NCR. But, iso-8859-15 is not iso-8859-1, moreover, the some characters from iso-8859-1-charset table, are not contains in windows-1251.

Also, you can compare the differences, by this way, compare_latin1_and_cp1252.html:

<script src="https://unpkg.com/@username1565/text-encoding@0.8.12/umd/encoding-indexes.js"></script>
<script src="https://unpkg.com/@username1565/text-encoding@0.8.12/umd/encoding.js"></script>

<script>
//  generate latin-1 string
var s = ''; for(var i = 0; i<256; i++){s+= String.fromCharCode(i);} console.log('s: \n'+s); //generate string with all latin-1 characters
//  get consecutive bytes by decoding this
var latin1_bytes = new TextEncoding.TextEncoder('iso-8859-1', { NONSTANDARD_allowLegacyEncoding: true }).encode(s); //try to encode this
console.log('latin1_bytes', latin1_bytes);  //show this
//  generate consecutive bytes as Uint8Array
var allBytes = new Uint8Array(256); for(var i = 0; i<256; i++){allBytes[i] = i;} console.log('allBytes', allBytes);     //generate all consecutive bytes
//  Decode this as latin-1 encoded string
var latin1 = new TextEncoding.TextDecoder('iso-8859-1', { NONSTANDARD_allowLegacyEncoding: true }).decode(allBytes);    //try to decode this as latin-1 string
console.log('latin1: ', latin1, '(latin1 === s)', (latin1 === s));  //show the string and compare it with previous string.          //true

//  decode bytes as windows-1252 chars
var windows1252 = new TextEncoding.TextDecoder('windows-1252', { NONSTANDARD_allowLegacyEncoding: true }).decode(allBytes); //try to encode bytes as windows-1252 encoded string
console.log('\n\n'+     'windows1252: ', windows1252);  //show the string
//  get consecutive bytes by decoding this
var bytes = new TextEncoding.TextEncoder('windows-1252', { NONSTANDARD_allowLegacyEncoding: true }).encode(windows1252);    //try to encode this to bytes
console.log('bytes', bytes, '(windows1252 === decoded): ', (windows1252 === new TextEncoding.TextDecoder('windows-1252').decode(bytes)));   //show this bytes, encode this back and compare with string

//compare strings, encoded as iso-8859-1 (latin-1) and windows-1252, and write diff
var diff = [];                              //in empty array with diff
for(var i=0; i<allBytes.length; i++){       //for each byte
    if(latin1[i] !== windows1252[i]){       //if symbol is not equal
        diff.push({ 'i': i, 'latin-1 char': latin1[i], 'windows1252 char': windows1252[i]});    //write charcode, latin1-char and cp1252-char, as one JSON-object, as item of array.
    }
}
console.log("diff: ", JSON.stringify(diff, null, 1));   //show array with differences, as formatted-indented JSON.
</script>

And, as result, there is 27 different characters, from your image, charcodes, and the chars for both encodings:

diff:  [
 {
  "i": 128,
  "latin-1 char": "€",
  "windows1252 char": "€"
 },
 {
  "i": 130,
  "latin-1 char": "‚",
  "windows1252 char": "‚"
 },
 {
  "i": 131,
  "latin-1 char": "ƒ",
  "windows1252 char": "ƒ"
 },
 {
  "i": 132,
  "latin-1 char": "„",
  "windows1252 char": "„"
 },
 {
  "i": 133,
  "latin-1 char": "…",
  "windows1252 char": "…"
 },
 {
  "i": 134,
  "latin-1 char": "†",
  "windows1252 char": "†"
 },
 {
  "i": 135,
  "latin-1 char": "‡",
  "windows1252 char": "‡"
 },
 {
  "i": 136,
  "latin-1 char": "ˆ",
  "windows1252 char": "ˆ"
 },
 {
  "i": 137,
  "latin-1 char": "‰",
  "windows1252 char": "‰"
 },
 {
  "i": 138,
  "latin-1 char": "Š",
  "windows1252 char": "Š"
 },
 {
  "i": 139,
  "latin-1 char": "‹",
  "windows1252 char": "‹"
 },
 {
  "i": 140,
  "latin-1 char": "Œ",
  "windows1252 char": "Œ"
 },
 {
  "i": 142,
  "latin-1 char": "Ž",
  "windows1252 char": "Ž"
 },
 {
  "i": 145,
  "latin-1 char": "‘",
  "windows1252 char": "‘"
 },
 {
  "i": 146,
  "latin-1 char": "’",
  "windows1252 char": "’"
 },
 {
  "i": 147,
  "latin-1 char": "“",
  "windows1252 char": "“"
 },
 {
  "i": 148,
  "latin-1 char": "”",
  "windows1252 char": "”"
 },
 {
  "i": 149,
  "latin-1 char": "•",
  "windows1252 char": "•"
 },
 {
  "i": 150,
  "latin-1 char": "–",
  "windows1252 char": "–"
 },
 {
  "i": 151,
  "latin-1 char": "—",
  "windows1252 char": "—"
 },
 {
  "i": 152,
  "latin-1 char": "˜",
  "windows1252 char": "˜"
 },
 {
  "i": 153,
  "latin-1 char": "™",
  "windows1252 char": "™"
 },
 {
  "i": 154,
  "latin-1 char": "š",
  "windows1252 char": "š"
 },
 {
  "i": 155,
  "latin-1 char": "›",
  "windows1252 char": "›"
 },
 {
  "i": 156,
  "latin-1 char": "œ",
  "windows1252 char": "œ"
 },
 {
  "i": 158,
  "latin-1 char": "ž",
  "windows1252 char": "ž"
 },
 {
  "i": 159,
  "latin-1 char": "Ÿ",
  "windows1252 char": "Ÿ"
 }
]

Also, as you can see, some chars of iso-8859-1 encoding, are not copyable, and this was been replaced to replacement character. And all windows-1252-encoded chars, are copyable, in this diff. But windows-1252 contains not copyable chars too:

(null-byte skipped) 

 !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ ¡¢£¤¥¦§¨©ª«¬­®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ

Anyway, the both encodings are reversive, and (latin1 === s) and (windows1252 === decoded) return true, after encoding and decoding between strings and bytes. Also, as you can see, on your image, some chars from windows-1252 can be encoded as two-bytes unicode, while iso-8859-1-chars encoded by 1 byte, because unicode is ascii-compatible and moreover iso-8859-1-compatible encoding (first 128 characters, 00-7F - ASCII-chars, and first 256 characters there, 00-FF - this is latin-1 characters). So there is better to use iso-8859-1 instead of windows-1252, to encode bytes as string and decode string into bytearray. In this case, n bytes can be converted to n symbols, and back, because one byte converts only to 1 char, and 1 char convert to 1 byte back, for each byte value from 0 up to 255 (256 chars there). In this case, the encoded strings have the same bytelength, as bytearrays, and no one character not converting to more than 1 byte, like some windows-1252-chars.

This makes it possible to work with byte arrays as with strings, and without exceeding the byte lengths for this encoded strings, then transfer this bytearrays as strings to methods and functions, that accept only strings as arguments, and returns strings only. Then, inside that methods and functions, there is possible to convert those strings into bytearrays, process this bytes, and encode the result into a string, and return it as a string, without exceeding the byte length of the encoded string. In this case no need to write another methods, or add optional arguments, to accept bytes directly, and no need to working with binary data. And just input-output of iso-8859-1-encoded strings is enough, as text.