zxcvbn-ts / zxcvbn

Low-Budget Password Strength Estimation
https://www.usenix.org/conference/usenixsecurity16/technical-sessions/presentation/wheeler
MIT License
858 stars 68 forks source link

Proposal: Reduce the dictionary size footprint #254

Open lifthrasiir opened 6 months ago

lifthrasiir commented 6 months ago

In my knowledge, the size of word lists remains a major pain point for the original zxcvbn (e.g. dropbox/zxcvbn#196, dropbox/zxcvbn#268, dropbox/zxcvbn#285, dropbox/zxcvbn#338). Even though Brotli Content-Encoding does allow for more compression than gzip, it is significant enough that it is worthwhile to fix in the library itself as well.

I have written two concrete proposals that can significantly reduce the size footprint of the built-in dictionary. I believe they are easy to review and apply without any specialized knowledge about compression. Both also present some decisions for maintainers to make, so I tried to give enough information to aid that process.

Rank quantization

For example, packages/languages/en/src/commonWords.json starts with the following:

["you","the","to","'s","it","and","that","'t","of","is","in","what","we","me","this",...

...and ends with the following:

...,"battalions","motorcade","kelley","beaks","tomás","t-minus","sleek","vividly","koch"]

This list of 26,758 words is automatically generated from the frequency list of top 50K words. The cutoff is currently set to at least 500 occurrences, so the final entry koch should have occurred exactly 500 times. But there are many other words with the exactly same number of occurrences; in fact it's the case for all words shown above. Data compression algorithms don't and can't realize that and try to reproduce the exact order---you need to preprocess the data in the way that algorithms can work with an easier data.

Furthermore, while the public API does return the exact rank as a part of Feedback, it is probably meaningless for the aforementioned reason. We can for example imagine that ranks for first 100 words are exact, but next 100 words have the same rank (to be displayed as "101–200th common word"), and so on. This kind of rank quantization allows for an arbitrary reordering within the same rank and can greatly improve the compression without any additional code.

Benchmarks

I've experimented with several quantization schemes. I assumed that the relative rank of first 100 words is significant enough and never altered that. After that point, each tested scheme has the following quantized ranks:

Words with the same quantized rank have been simply sorted in the Unicode codepoint order, so that similar words can be placed much more closely. While other shuffling is possible, this is good enough for our purpose and also helps the next proposal.

I also wanted to make sure that this optimization is visible throughout different compression algorithms and parameters, so I have used the following methods:

Results for @zxcvbn-ts/language-en:

Quantization Uncomp. Gzip -1 Gzip -6 Gzip -9 ECT -3 ECT -9 Brotli -11
Baseline 1,216,551 617,735 566,285 565,783 519,082 513,281 488,580
Groups of 100 0.0% -0.2% -0.1% -0.1% -0.6% -0.3% -1.0%
2 sig digits 0.0% -1.8% -1.2% -1.1% -2.9% -2.9% -2.8%
1 sig digit 0.0% -8.0% -8.8% -8.8% -12.7% -12.5% -11.0%
½ sig digits 0.0% -11.9% -13.2% -13.1% -18.8% -18.5% -17.2%
0 sig digits 0.0% -16.9% -19.8% -19.7% -26.5% -26.4% -24.5%

Combined results for all @zxcvbn-ts/language-* packages:

Quantization Uncomp. Gzip -1 Gzip -6 Gzip -9 ECT -3 ECT -9 Brotli -11
Baseline 7,739,269 3,945,996 3,503,060 3,496,903 3,325,270 3,291,158 2,993,904
Groups of 100 0.0% -0.6% -0.4% -0.4% -0.7% -0.7% -1.5%
2 sig digits 0.0% -2.9% -1.7% -1.7% -4.2% -4.1% -3.9%
1 sig digit 0.0% -10.3% -10.1% -10.1% -14.7% -14.5% -13.0%
½ sig digits 0.0% -15.2% -14.4% -14.3% -20.9% -20.6% -19.5%
0 sig digits 0.0% -19.8% -20.6% -20.4% -26.4% -27.3% -25.9%

The uncompressed size remains identical because the only change here is reordering. Still, it is possible to reduce the compressed size by 10% with a reasonable rank quantization, and that reduction is more or less consistent for all tested parameters. (There is a reason for this consistency, but that's way too technical to explain here.)

Delta coding

It should be noted that packages/languages/en/src/firstnames.json among others is already sorted:

["aaren","aarika","abagael","abagail","abbe","abbey","abbi","abbie","abby","abbye","abigael",...

...and many consecutive words share the same prefix (for example, 6 words start with abb). We can exploit this by simple delta coding like this:

["aaren","___ika","_bagael","_____il","__be","____y","___i","____e","___y","____e","__igael",...

Here the number of preceding underscores represents the number of first characters to retain. And it's not necessary to repeat each underscore; we can use some reserved character for two consecutive underscores, some other for three, and so on. Since uppercase letters are removed during the generation, let's use them in place of two or more underscores:

["aaren","Cika","Abagael","Eil","Bbe","Dy","Ci","De","Cy","De","Bigael",...

At this point we don't even need any comma, because those reserved characters can only occur at the beginning of each word. So we can further simplify the scheme by replacing a lone , with A, ,_ with B, ,__ with C and so on instead:

"AaarenEikaCbagaelGilDbeFyEiFeEyFeDigael...

While compression algorithms can pick this kind of repetitions by themselves, we already know the list is sorted so we can make use of that fact much more efficiently in this case.

Benchmarks

Results for @zxcvbn-ts/language-en:

Quantization Uncomp. Gzip -1 Gzip -6 Gzip -9 ECT -3 ECT -9 Brotli -11
Delta coding -20.1% -12.6% -12.8% -12.8% -8.4% -7.9% -9.0%
Groups of 100 -25.7% -17.1% -16.4% -16.4% -11.7% -11.5% -11.6%
2 sig digits -29.6% -21.3% -20.4% -20.3% -20.3% -15.3% -11.5%
1 sig digit -39.8% -32.1% -31.0% -30.9% -30.8% -26.5% -23.1%
½ sig digits -45.0% -38.0% -37.0% -37.0% -33.4% -33.2% -33.0%
0 sig digits -51.3% -45.4% -44.6% -44.5% -41.1% -41.3% -41.2%

Combined results for all @zxcvbn-ts/language-* packages:

Quantization Uncomp. Gzip -1 Gzip -6 Gzip -9 ECT -3 ECT -9 Brotli -11
Delta coding -7.1% -3.2% -3.0% -2.9% -1.6% -1.5% -1.3%
Groups of 100 -18.3% -11.9% -9.9% -9.8% -8.1% -8.1% -6.2%
2 sig digits -25.0% -18.5% -16.2% -16.1% -20.6% -13.9% -6.1%
1 sig digit -39.8% -33.5% -30.8% -30.7% -34.5% -28.7% -22.3%
½ sig digits -45.5% -39.6% -37.2% -37.1% -35.8% -35.9% -34.2%
0 sig digits -51.6% -46.6% -44.6% -44.6% -43.5% -43.5% -42.3%

In general this delta coding scheme reduces another 1–20% of compressed data on top of rank quantization. The best case results in 50% reduction!

Caveats

It should be noted that such transformation has to be inversed every time the script is being loaded. The following is the actual code for delta coding I've used, and you can see that it returns an immediately invoked function that does the job.

const prefixCompress = (parsed) => {
  if (!Array.isArray(parsed)) {
    return
  }
  if (!parsed.every((e) => typeof e === 'string' && isCompactDoubleQuotedString(e))) {
    // Should be rare enough, so don't bother escape them.
    return
  }

  const deltas = []
  let last = ''
  for (const e of parsed) {
    let prefixLen = 0
    const maxPrefixLen = Math.min(e.length, last.length, 25)
    while (prefixLen < maxPrefixLen && e.charAt(prefixLen) === last.charAt(prefixLen)) {
      ++prefixLen
    }
    deltas.push(String.fromCharCode(65 + prefixLen) + e.slice(prefixLen))
    last = e
  }

  return `((compressed) => {
  const parts = compressed.split(/([A-Z])/g)
  const recons = []
  let last = '', i
  for (i = 1; i < parts.length; i += 2) {
    last = last.slice(0, parts[i].charCodeAt(0) - 65) + parts[i + 1]
    recons.push(last)
  }
  return recons
})("${deltas.join('')}")`
}

const isCompactDoubleQuotedString = (s) => {
  return !s.match(/[\x00-\x1f\u2028\u2029\\"]/)
}

In my testing they are not very slow, but not extremely quick either. For @zxcvbn-ts/language-en both browsers I've tested (Firefox and Chrome) took 5–10 ms to undo the transformation, even with some additional optimization to avoid memory allocation.

Amazingly enough, Chrome's V8 does seem to specially handle String.split with a string argument and it was 10,000x faster than any other tested code, while String.split was only ~3x faster in Firefox. I haven't tested a direct character-wise iteration but I doubt it would be as fast as String.split in V8, so that's one thing to take account for.

Raw data for benchmarks

No delta coding: ``` Baseline: orig gz1 gz6 gz9 ect3 ect9 br11 ar 6,862 3,172 2,978 2,978 2,908 2,712 2,496 common 465,160 253,525 232,312 232,228 220,834 220,155 204,319 cs 599,591 301,578 269,458 269,247 254,141 251,068 210,749 de 816,936 420,148 366,005 365,077 350,751 349,062 308,770 en 1,216,551 617,735 566,285 565,783 519,082 513,281 488,580 es-es 745,200 385,911 340,519 339,767 328,460 324,498 287,752 fi 475,403 234,329 202,773 202,091 193,452 191,702 178,976 fr 614,101 309,871 271,975 271,536 257,617 253,922 233,795 id 175,088 90,338 81,153 80,870 76,898 76,160 70,747 it 641,884 320,221 281,083 280,840 268,337 264,190 239,377 ja 215,448 105,875 94,967 94,710 89,440 88,576 85,722 nl-be 426,557 221,474 196,483 195,992 188,467 186,850 174,940 pl 839,810 424,912 371,342 370,328 357,855 354,141 316,316 pt-br 500,678 256,907 225,727 225,456 217,028 214,841 191,365 7,739,269 3,945,996 3,503,060 3,496,903 3,325,270 3,291,158 2,993,904 Groups of 100: orig gz1 gz6 gz9 ect3 ect9 br11 ar 6,862 3,166 2,974 2,974 2,900 2,710 2,492 common 465,160 251,590 231,024 230,942 219,245 217,998 200,476 cs 599,591 299,168 269,894 269,505 255,800 252,421 216,448 de 816,936 418,006 364,545 363,699 348,722 346,532 306,308 en 1,216,551 616,410 565,279 564,799 515,573 511,732 483,680 es-es 745,200 382,802 338,713 338,043 325,261 321,401 280,483 fi 475,403 232,636 201,296 200,551 191,040 189,392 174,582 fr 614,101 307,248 270,857 270,430 254,427 252,001 230,718 id 175,088 89,101 80,549 80,288 75,736 75,030 68,763 it 641,884 317,731 279,667 279,425 264,140 261,822 233,248 ja 215,448 104,799 94,625 94,384 88,449 87,589 83,864 nl-be 426,557 219,271 194,869 194,470 187,976 184,267 171,667 pl 839,810 421,867 369,324 368,332 354,910 350,759 307,957 pt-br 500,678 254,855 224,509 224,250 214,618 212,695 185,902 7,739,269 3,918,650 3,488,125 3,482,092 3,298,797 3,266,349 2,946,588 2 significant digits: orig gz1 gz6 gz9 ect3 ect9 br11 ar 6,862 3,169 2,973 2,973 2,916 2,710 2,516 common 465,160 243,823 225,843 225,777 207,165 205,044 191,448 cs 599,591 291,512 265,358 265,045 244,986 243,077 211,266 de 816,936 404,747 356,866 356,063 335,343 332,622 299,353 en 1,216,551 606,514 559,486 559,136 503,519 498,395 474,734 es-es 745,200 371,497 333,565 333,053 310,601 307,862 271,463 fi 475,403 229,256 199,767 199,020 187,205 185,512 172,704 fr 614,101 299,844 267,411 267,017 245,319 243,280 224,533 id 175,088 88,606 80,296 80,061 75,071 74,134 68,562 it 641,884 308,820 276,025 275,888 254,654 251,230 225,678 ja 215,448 101,746 93,473 93,234 85,159 84,040 81,378 nl-be 426,557 216,906 194,020 193,668 183,182 181,464 170,471 pl 839,810 412,888 365,058 364,230 343,009 339,877 299,569 pt-br 500,678 248,487 222,080 221,853 206,600 204,692 180,523 7,739,269 3,827,815 3,442,221 3,437,018 3,184,729 3,153,939 2,874,198 1 significant digit: orig gz1 gz6 gz9 ect3 ect9 br11 ar 6,862 3,166 2,974 2,974 2,900 2,710 2,492 common 465,160 227,374 206,990 206,899 185,039 183,093 172,964 cs 599,591 263,767 234,211 233,893 210,779 208,834 186,302 de 816,936 377,582 329,738 329,035 304,288 302,713 278,473 en 1,216,551 567,736 515,950 515,651 452,781 449,091 434,440 es-es 745,200 339,659 300,275 299,749 271,897 269,173 242,292 fi 475,403 213,047 185,951 185,624 169,979 168,462 159,369 fr 614,101 276,255 243,559 243,241 218,588 216,815 205,086 id 175,088 85,358 77,808 77,594 70,999 70,281 65,289 it 641,884 284,432 251,137 250,908 224,151 222,013 203,944 ja 215,448 94,879 86,035 85,927 75,757 74,826 73,295 nl-be 426,557 205,356 183,666 183,415 169,521 168,037 160,520 pl 839,810 374,975 330,014 329,432 301,064 298,535 260,761 pt-br 500,678 224,974 197,883 197,672 178,285 176,650 159,036 7,739,269 3,538,560 3,146,191 3,142,014 2,836,028 2,811,233 2,604,263 1/2 significant digits: orig gz1 gz6 gz9 ect3 ect9 br11 ar 6,862 3,166 2,974 2,974 2,903 2,710 2,501 common 465,160 211,645 191,940 191,909 165,063 162,380 156,620 cs 599,591 249,818 223,776 223,796 198,139 196,313 174,689 de 816,936 348,022 304,559 304,435 275,453 274,445 256,759 en 1,216,551 543,654 491,415 491,459 421,118 418,104 404,452 es-es 745,200 316,712 281,992 282,006 248,449 246,718 222,356 fi 475,403 199,913 177,060 176,997 156,336 155,294 146,301 fr 614,101 263,578 234,472 234,486 206,445 205,173 192,309 id 175,088 81,267 75,076 75,149 65,913 65,024 60,588 it 641,884 264,977 234,664 234,739 203,738 202,171 185,148 ja 215,448 92,055 86,051 86,015 73,110 72,138 70,275 nl-be 426,557 194,692 176,509 176,475 157,189 156,193 148,774 pl 839,810 357,031 319,144 319,153 283,590 282,204 235,176 pt-br 500,678 217,628 195,786 195,859 172,797 171,644 151,244 7,739,269 3,344,158 2,995,418 2,995,452 2,630,243 2,610,511 2,407,192 0 significant digits: orig gz1 gz6 gz9 ect3 ect9 br11 ar 6,862 3,166 2,974 2,974 2,903 2,710 2,501 common 465,160 203,789 181,763 181,750 155,077 152,311 146,830 cs 599,591 234,179 204,668 204,705 178,522 176,590 157,874 de 816,936 322,684 275,616 275,568 246,098 244,849 234,118 en 1,216,551 513,335 453,962 454,070 381,236 377,510 368,680 es-es 745,200 298,468 260,406 260,532 226,392 224,656 202,361 fi 475,403 190,923 166,539 166,634 146,463 145,153 136,588 fr 614,101 251,444 220,591 220,602 192,802 191,794 181,129 id 175,088 77,785 71,029 71,038 61,701 60,530 55,979 it 641,884 253,397 221,070 221,179 221,179 188,405 173,049 ja 215,448 88,616 81,396 81,426 68,110 67,293 65,623 nl-be 426,557 186,540 166,964 166,992 148,094 146,594 140,863 pl 839,810 335,182 294,333 294,394 259,041 257,091 214,149 pt-br 500,678 203,970 179,437 179,505 156,609 155,620 137,027 7,739,269 3,163,478 2,780,748 2,781,369 2,444,227 2,391,106 2,216,771 ``` With delta coding: ``` Delta coding: orig gz1 gz6 gz9 ect3 ect9 br11 ar 5,952 2,995 2,844 2,844 2,844 2,703 2,395 common 427,961 244,548 225,573 225,483 217,631 216,629 202,722 cs 512,849 280,210 248,352 248,230 240,148 236,534 201,686 de 808,393 420,810 367,975 367,145 354,163 351,625 311,635 en 970,809 539,374 493,416 493,075 475,386 472,508 444,471 es-es 738,536 389,363 344,042 343,294 331,028 328,165 292,054 fi 462,127 232,893 203,051 202,597 194,327 192,454 180,247 fr 562,489 297,784 263,446 263,069 253,626 251,115 230,025 id 168,036 89,687 80,900 80,638 77,353 76,525 71,125 it 596,056 311,698 276,478 276,233 266,748 263,404 238,464 ja 203,599 103,849 93,402 93,109 89,780 88,051 85,346 nl-be 414,537 220,930 197,559 197,279 189,950 188,161 176,431 pl 817,934 424,504 372,420 371,583 358,786 355,708 322,545 pt-br 495,596 259,299 228,047 227,637 219,558 217,401 194,258 7,184,874 3,817,944 3,397,505 3,392,216 3,271,328 3,240,983 2,953,404 Delta coding + Groups of 100: orig gz1 gz6 gz9 ect3 ect9 br11 ar 5,951 2,989 2,837 2,837 2,837 2,701 2,412 common 378,884 222,848 209,194 209,115 201,781 201,006 190,410 cs 461,079 256,644 233,166 233,166 227,708 223,532 200,127 de 713,140 386,208 343,947 343,545 332,838 329,836 298,950 en 902,873 511,549 472,945 472,798 457,868 453,897 431,870 es-es 632,710 347,429 315,228 315,145 305,944 302,234 275,502 fi 403,824 211,224 187,439 187,248 180,436 178,561 169,254 fr 492,253 270,056 244,804 244,599 236,360 234,107 218,788 id 140,303 78,424 72,244 72,161 69,382 68,828 65,240 it 516,372 280,249 254,771 254,735 246,357 244,029 225,286 ja 169,893 90,168 82,812 82,680 79,200 78,460 76,080 nl-be 360,453 199,783 181,979 181,905 175,889 174,255 165,529 pl 716,865 385,419 345,635 345,186 334,366 331,351 302,916 pt-br 423,393 231,215 209,117 209,056 201,898 200,290 183,343 6,317,993 3,474,205 3,156,118 3,154,176 3,052,864 3,023,087 2,805,707 Delta coding + 2 significant digits: orig gz1 gz6 gz9 ect3 ect9 br11 ar 5,955 2,995 2,841 2,841 2,420 2,841 2,701 common 338,426 200,621 189,422 189,352 173,131 183,933 182,262 cs 415,302 232,802 213,280 213,269 186,939 205,946 204,353 de 641,897 352,267 315,595 315,253 277,877 304,325 302,651 en 855,734 486,084 450,663 450,517 413,547 434,286 432,291 es-es 567,100 314,419 286,879 286,799 254,932 277,032 275,112 fi 383,173 201,630 180,539 180,368 164,521 173,715 171,918 fr 448,506 247,724 225,763 225,598 204,630 217,808 216,154 id 138,549 77,372 71,462 71,395 64,584 68,628 68,068 it 463,062 254,057 232,231 232,178 207,856 223,683 222,541 ja 155,846 82,927 76,398 76,307 70,434 72,989 72,384 nl-be 347,330 193,080 176,312 176,256 161,401 170,355 168,738 pl 654,522 355,302 321,305 320,876 284,271 309,553 307,606 pt-br 383,912 211,434 192,464 192,406 170,865 185,586 184,205 5,799,314 3,212,714 2,935,154 2,933,415 2,637,408 2,830,680 2,810,984 Delta coding + 1 significant digits: orig gz1 gz6 gz9 ect3 ect9 br11 ar 5,951 2,989 2,837 2,837 2,412 2,837 2,701 common 278,349 165,438 157,906 157,847 143,465 152,699 151,100 cs 320,429 182,852 169,702 169,694 150,797 164,382 163,110 de 518,285 291,627 264,803 264,558 235,696 255,815 254,384 en 732,162 419,187 390,543 390,423 358,769 376,886 375,261 es-es 439,746 248,185 229,008 228,934 203,920 221,240 219,741 fi 301,417 162,183 147,033 146,927 134,901 141,588 140,413 fr 359,584 202,790 187,098 187,002 170,907 180,521 179,309 id 117,900 66,716 62,170 62,117 56,190 59,570 59,155 it 363,387 203,342 187,989 187,941 169,044 181,921 180,498 ja 127,283 68,009 62,735 62,697 57,608 60,041 59,463 nl-be 286,597 162,999 150,597 150,551 138,592 145,621 144,205 pl 506,053 280,206 257,472 257,040 217,449 248,274 246,616 pt-br 298,716 167,513 154,134 154,069 137,090 152,447 147,831 4,655,859 2,624,036 2,424,027 2,422,637 2,176,840 2,343,842 2,323,787 Delta coding + 1/2 significant digits: orig gz1 gz6 gz9 ect3 ect9 br11 ar 5,949 2,988 2,836 2,836 2,836 2,702 2,395 common 252,431 149,594 141,318 141,259 137,072 135,083 129,016 cs 288,685 164,360 152,036 152,018 147,651 145,843 135,056 de 459,440 260,720 237,803 237,606 230,221 228,732 213,399 en 668,994 382,902 356,459 356,319 345,330 342,618 326,910 es-es 391,075 221,192 203,774 203,700 197,195 195,446 181,332 fi 270,546 146,063 132,571 132,471 127,718 126,495 120,898 fr 329,572 186,803 172,519 172,438 166,727 165,391 156,983 id 108,650 61,639 57,650 57,588 55,227 54,810 51,846 it 322,857 181,067 167,010 166,984 161,967 160,491 149,620 ja 120,103 63,994 58,826 58,802 56,679 55,877 53,955 nl-be 261,392 149,850 138,994 138,912 135,104 133,130 127,684 pl 458,568 254,462 233,488 233,043 226,774 223,675 191,412 pt-br 277,916 156,547 143,891 143,831 141,306 137,989 127,034 4,216,178 2,382,181 2,199,175 2,197,807 2,131,807 2,108,282 1,967,540 Delta coding + 0 significant digits: orig gz1 gz6 gz9 ect3 ect9 br11 ar 5,949 2,988 2,836 2,836 2,836 2,702 2,395 common 234,476 137,901 130,470 130,404 125,944 124,730 118,382 cs 250,061 139,348 127,060 127,045 123,310 121,864 112,799 de 386,356 220,600 202,084 201,972 195,969 194,579 182,401 en 592,195 337,236 313,632 313,493 305,343 301,280 286,992 es-es 342,790 192,403 176,264 176,194 170,197 169,016 155,128 fi 247,194 133,099 120,543 120,460 115,908 114,976 109,059 fr 297,438 168,573 155,579 155,522 150,607 149,326 141,729 id 99,359 56,475 53,048 53,018 50,948 50,467 46,941 it 292,112 162,726 149,216 149,171 144,397 143,259 133,073 ja 110,296 58,480 53,484 53,439 51,077 50,634 48,654 nl-be 239,907 138,243 128,757 128,712 124,790 123,206 117,931 pl 404,003 220,682 201,161 200,657 194,919 192,415 162,160 pt-br 242,596 135,444 124,146 124,082 121,224 118,980 107,927 3,744,732 2,104,198 1,938,280 1,937,005 1,877,469 1,857,434 1,725,571 ```
MrWook commented 6 months ago

Hey @lifthrasiir,

First off, thank you very much for your thorough analysis of the issue with the dictionaries. It would greatly benefit this package to reduce the size of these dictionaries.

Since you've already conducted benchmarks, could you please share the code for Rank Quantization with a repository, a random JavaScript site, or any other resource? This would greatly enhance my understanding of the process and how to implement it into this package.

The Delta coding already has a code example, which makes it quite easy to grasp. However, it's important to note that the implementation needs to support different kinds of written languages, such as Polish or German special characters, which may not be covered by the current regex pattern /([A-Z])/g.

Additionally, we should not assume that the dictionaries are already sorted alphabetically, as they ideally should be sorted by occurrence. Unfortunately, not every government provides a neat open-source list of such data online, so we've sourced some of the data from other places, which is then hosted on GitHub.

You don't need to worry about inversion transformation during startup, as the dictionaries are already built at that time, which only takes a few milliseconds. Adding one small script more wouldn't make much of a difference, and it would justify this step to some extent.

This would be the perfect opportunity to implement such changes, as I'm currently working on the next major version, and this would definitely be a breaking change if we transition from JSON files to JavaScript files with a long string inside that needs to be parsed. A good place for implementation would be directly inside the List Generator. The inversion part could be handled within setRankedDictionaries in the Options file. The only downside is that we wouldn't provide simple JSON files for others to import anymore.

I'm always happy to have more contributors on board!

lifthrasiir commented 6 months ago

Since you've already conducted benchmarks, could you please share the code for Rank Quantization with a repository, a random JavaScript site, or any other resource? This would greatly enhance my understanding of the process and how to implement it into this package.

I didn't want to alter the generator itself for testing so I've naively done the rank quantization in the prefixCompress function. For example, a quantization to ½ significant digits would need the following code:

   if (!parsed.every((e) => typeof e === 'string' && isCompactDoubleQuotedString(e))) {
     // Should be rare enough, so don't bother escape them.
     return
   }

+  for (let step = 100; step < parsed.length; step *= 10) {
+    // Remove at most `M * step` items starting from the offset `N * step`,
+    // sort them lexicographically, and replace the original items with the sorted array.
+    // `Array.splice` gracefully handles out-of-bound ranges, so no additional check is needed.
+    parsed.splice(1 * step, 1 * step, ...parsed.splice(1 * step, 1 * step).sort())
+    parsed.splice(2 * step, 3 * step, ...parsed.splice(2 * step, 3 * step).sort())
+    parsed.splice(5 * step, 5 * step, ...parsed.splice(5 * step, 5 * step).sort())
+  }

   const deltas = []
   let last = ''
   for (const e of parsed) {

You'd probably want to streamline this more in the actual code, e.g. by using a generator.

The Delta coding already has a code example, which makes it quite easy to grasp. However, it's important to note that the implementation needs to support different kinds of written languages, such as Polish or German special characters, which may not be covered by the current regex pattern /([A-Z])/g.

You may have thought about capitalizing each word and removing commas (foo,bar,quuxFooBarQuux)? Here those characters replace a comma and are used as separators with additional information, so you can use any set of characters that do not appear in the word list.

I've originally used control characters /([\x01-\x09])/g before, which also are unlikely to appear and don't have to be escaped in string literals (the next character \x0a, i.e. \n, isn't), until I realized that every current word list is converted to lowercase so [A-Z] are guaranteed to be always available. The original pattern can be used as an alternative for lists which do contain uppercased Latin letters.

Additionally, we should not assume that the dictionaries are already sorted alphabetically, as they ideally should be sorted by occurrence.

That's probably true for first hundred or thousand words, but the long-tailed nature of such dictionary limits an accuracy of any further ranks. For example, most would agree that a difference between 500 occurrences and 600 occurrences is not that worthy to make. (There are 2,734 English words in the aforementioned frequency list, for the comparison.) It's more debatable for 500 vs. 1,000 occurrences though, which would have to be eventually decided.

While I didn't so in my testing, I agree the actual quantization should be done in the list generator in order to take account for the actual frequencies.

You don't need to worry about inversion transformation during startup, as the dictionaries are already built at that time, which only takes a few milliseconds. Adding one small script more wouldn't make much of a difference, and it would justify this step to some extent.

I too think so, but I was surprised to learn that String.split was much faster than I imagined, so I couldn't ignore a miniscule possibility that some users may have unconsciously relied on this performance behavior. A passing mention in the release note may be enough.

This would be the perfect opportunity to implement such changes, as I'm currently working on the next major version, and this would definitely be a breaking change if we transition from JSON files to JavaScript files with a long string inside that needs to be parsed. [...] The only downside is that we wouldn't provide simple JSON files for others to import anymore.

Good to hear, though I think the only necessary change here is to expose a function instead of an array, so that the decompression can happen lazily. I have also briefly considered to convert every field in the dictionary object to a property, so that such lazy decompression is done automatic:

import commonWords from './commonWords'

const cache = {}
const dictionary = {
  get ['commonWords-en']() {
    return cache.commonWords ??= lazyCommonWords()
  }
}

The pre-compressed file would be placed in commonWords.js, so the existing commonWords.json might be still retained. (But I'm not sure how many users do directly depend on those JSON files.)

The inversion part could be handled within setRankedDictionaries in the Options file.

I would rather put the inversion outside of the setRankedDictionaries, because otherwise you have to document the compression format and it can't be no longer customized for each input. My proposed schemes are extremely general and applicable for most human texts, but you can do better if you know exactly which inputs are expected.

As an extreme example, I have once written a self-decompressing JavaScript packer named Roadroller. While it is too computation-intensive and never suitable for general uses, Roadroller combined with gzip performs much better than even Brotli, partly because it can optimize various parameters to given input and that makes up for the added size of a built-in decompressor. But Roadroller can't (yet) see the underlying human assumption, so it doesn't outperform rank quantization and delta coding described above.

Of course, there would be a small number of compression strategies used at any point of time, so the shared decompression code can still be refactored into its own module if needed. In my testing that code was duplicated every time, but I made sure that the duplicated code remains always same and can't significantly worsen the compression (an additional function expression was sufficient for that).

mimi89999 commented 1 day ago

Also, I think that it would be better to have the dictionaries compressed in the packages. Web servers often do the compression on the fly and therefore use low compression levels. Precompresssing would make it possible to use maximum gzip compression levels. The dicts could then be decompressed in the browser using a built-in function: https://developer.mozilla.org/en-US/docs/Web/API/Compression_Streams_API

mimi89999 commented 1 day ago

By using gzip -9 and encoding the compressed content with base64, I still got a compression ratio of 1.6. That's quite impressive.