Open lifthrasiir opened 9 months ago
Hey @lifthrasiir,
First off, thank you very much for your thorough analysis of the issue with the dictionaries. It would greatly benefit this package to reduce the size of these dictionaries.
Since you've already conducted benchmarks, could you please share the code for Rank Quantization with a repository, a random JavaScript site, or any other resource? This would greatly enhance my understanding of the process and how to implement it into this package.
The Delta coding already has a code example, which makes it quite easy to grasp. However, it's important to note that the implementation needs to support different kinds of written languages, such as Polish or German special characters, which may not be covered by the current regex pattern /([A-Z])/g.
Additionally, we should not assume that the dictionaries are already sorted alphabetically, as they ideally should be sorted by occurrence. Unfortunately, not every government provides a neat open-source list of such data online, so we've sourced some of the data from other places, which is then hosted on GitHub.
You don't need to worry about inversion transformation during startup, as the dictionaries are already built at that time, which only takes a few milliseconds. Adding one small script more wouldn't make much of a difference, and it would justify this step to some extent.
This would be the perfect opportunity to implement such changes, as I'm currently working on the next major version, and this would definitely be a breaking change if we transition from JSON files to JavaScript files with a long string inside that needs to be parsed. A good place for implementation would be directly inside the List Generator. The inversion part could be handled within setRankedDictionaries in the Options file. The only downside is that we wouldn't provide simple JSON files for others to import anymore.
I'm always happy to have more contributors on board!
Since you've already conducted benchmarks, could you please share the code for Rank Quantization with a repository, a random JavaScript site, or any other resource? This would greatly enhance my understanding of the process and how to implement it into this package.
I didn't want to alter the generator itself for testing so I've naively done the rank quantization in the prefixCompress
function. For example, a quantization to ½ significant digits would need the following code:
if (!parsed.every((e) => typeof e === 'string' && isCompactDoubleQuotedString(e))) {
// Should be rare enough, so don't bother escape them.
return
}
+ for (let step = 100; step < parsed.length; step *= 10) {
+ // Remove at most `M * step` items starting from the offset `N * step`,
+ // sort them lexicographically, and replace the original items with the sorted array.
+ // `Array.splice` gracefully handles out-of-bound ranges, so no additional check is needed.
+ parsed.splice(1 * step, 1 * step, ...parsed.splice(1 * step, 1 * step).sort())
+ parsed.splice(2 * step, 3 * step, ...parsed.splice(2 * step, 3 * step).sort())
+ parsed.splice(5 * step, 5 * step, ...parsed.splice(5 * step, 5 * step).sort())
+ }
const deltas = []
let last = ''
for (const e of parsed) {
You'd probably want to streamline this more in the actual code, e.g. by using a generator.
The Delta coding already has a code example, which makes it quite easy to grasp. However, it's important to note that the implementation needs to support different kinds of written languages, such as Polish or German special characters, which may not be covered by the current regex pattern
/([A-Z])/g
.
You may have thought about capitalizing each word and removing commas (foo,bar,quux
→ FooBarQuux
)? Here those characters replace a comma and are used as separators with additional information, so you can use any set of characters that do not appear in the word list.
I've originally used control characters /([\x01-\x09])/g
before, which also are unlikely to appear and don't have to be escaped in string literals (the next character \x0a
, i.e. \n
, isn't), until I realized that every current word list is converted to lowercase so [A-Z]
are guaranteed to be always available. The original pattern can be used as an alternative for lists which do contain uppercased Latin letters.
Additionally, we should not assume that the dictionaries are already sorted alphabetically, as they ideally should be sorted by occurrence.
That's probably true for first hundred or thousand words, but the long-tailed nature of such dictionary limits an accuracy of any further ranks. For example, most would agree that a difference between 500 occurrences and 600 occurrences is not that worthy to make. (There are 2,734 English words in the aforementioned frequency list, for the comparison.) It's more debatable for 500 vs. 1,000 occurrences though, which would have to be eventually decided.
While I didn't so in my testing, I agree the actual quantization should be done in the list generator in order to take account for the actual frequencies.
You don't need to worry about inversion transformation during startup, as the dictionaries are already built at that time, which only takes a few milliseconds. Adding one small script more wouldn't make much of a difference, and it would justify this step to some extent.
I too think so, but I was surprised to learn that String.split
was much faster than I imagined, so I couldn't ignore a miniscule possibility that some users may have unconsciously relied on this performance behavior. A passing mention in the release note may be enough.
This would be the perfect opportunity to implement such changes, as I'm currently working on the next major version, and this would definitely be a breaking change if we transition from JSON files to JavaScript files with a long string inside that needs to be parsed. [...] The only downside is that we wouldn't provide simple JSON files for others to import anymore.
Good to hear, though I think the only necessary change here is to expose a function instead of an array, so that the decompression can happen lazily. I have also briefly considered to convert every field in the dictionary
object to a property, so that such lazy decompression is done automatic:
import commonWords from './commonWords'
const cache = {}
const dictionary = {
get ['commonWords-en']() {
return cache.commonWords ??= lazyCommonWords()
}
}
The pre-compressed file would be placed in commonWords.js
, so the existing commonWords.json
might be still retained. (But I'm not sure how many users do directly depend on those JSON files.)
The inversion part could be handled within setRankedDictionaries in the Options file.
I would rather put the inversion outside of the setRankedDictionaries
, because otherwise you have to document the compression format and it can't be no longer customized for each input. My proposed schemes are extremely general and applicable for most human texts, but you can do better if you know exactly which inputs are expected.
As an extreme example, I have once written a self-decompressing JavaScript packer named Roadroller. While it is too computation-intensive and never suitable for general uses, Roadroller combined with gzip performs much better than even Brotli, partly because it can optimize various parameters to given input and that makes up for the added size of a built-in decompressor. But Roadroller can't (yet) see the underlying human assumption, so it doesn't outperform rank quantization and delta coding described above.
Of course, there would be a small number of compression strategies used at any point of time, so the shared decompression code can still be refactored into its own module if needed. In my testing that code was duplicated every time, but I made sure that the duplicated code remains always same and can't significantly worsen the compression (an additional function expression was sufficient for that).
Also, I think that it would be better to have the dictionaries compressed in the packages. Web servers often do the compression on the fly and therefore use low compression levels. Precompresssing would make it possible to use maximum gzip compression levels. The dicts could then be decompressed in the browser using a built-in function: https://developer.mozilla.org/en-US/docs/Web/API/Compression_Streams_API
By using gzip -9
and encoding the compressed content with base64, I still got a compression ratio of 1.6. That's quite impressive.
By using Gzip with level 9 and base85 encoding, I managed to achieve a compression level of around 2.0, that means reducing the file size by half.
CompressionStream is supported by about 90% of browsers in use: https://caniuse.com/mdn-api_compressionstream and can decompress Gzip, so no additional library is required. The base85 implementation is about 4KB non minified.
Here is a simple POC for compressing the dictionaries:
diff --git a/data-scripts/_helpers/runtime.ts b/data-scripts/_helpers/runtime.ts
index 4340aea..67f2687 100644
--- a/data-scripts/_helpers/runtime.ts
+++ b/data-scripts/_helpers/runtime.ts
@@ -1,5 +1,7 @@
import * as fs from 'fs'
+import * as fflate from 'fflate';
import * as path from 'path'
+import { encodeBase85 } from "@alttiri/base85";
import { execSync } from 'child_process'
export interface LooseObject {
@@ -58,6 +60,12 @@ export default class ListHandler {
path.join(folder, `${options.filename}.json`),
JSON.stringify(data),
)
+ const buf = fflate.strToU8(JSON.stringify(data))
+ const compressed = fflate.compressSync(buf, { level: 9, mem: 12 })
+ const compressedString = fflate.strFromU8(compressed, true)
+
+ console.log(buf.length / btoa(compressedString).length)
+ console.log(buf.length / encodeBase85(compressed).length)
}
// eslint-disable-next-line no-console
console.timeEnd(options.filename)
I implemented it in https://github.com/zxcvbn-ts/zxcvbn/pull/276
In my knowledge, the size of word lists remains a major pain point for the original zxcvbn (e.g. dropbox/zxcvbn#196, dropbox/zxcvbn#268, dropbox/zxcvbn#285, dropbox/zxcvbn#338). Even though Brotli
Content-Encoding
does allow for more compression than gzip, it is significant enough that it is worthwhile to fix in the library itself as well.I have written two concrete proposals that can significantly reduce the size footprint of the built-in dictionary. I believe they are easy to review and apply without any specialized knowledge about compression. Both also present some decisions for maintainers to make, so I tried to give enough information to aid that process.
Rank quantization
For example,
packages/languages/en/src/commonWords.json
starts with the following:...and ends with the following:
This list of 26,758 words is automatically generated from the frequency list of top 50K words. The cutoff is currently set to at least 500 occurrences, so the final entry
koch
should have occurred exactly 500 times. But there are many other words with the exactly same number of occurrences; in fact it's the case for all words shown above. Data compression algorithms don't and can't realize that and try to reproduce the exact order---you need to preprocess the data in the way that algorithms can work with an easier data.Furthermore, while the public API does return the exact rank as a part of
Feedback
, it is probably meaningless for the aforementioned reason. We can for example imagine that ranks for first 100 words are exact, but next 100 words have the same rank (to be displayed as "101–200th common word"), and so on. This kind of rank quantization allows for an arbitrary reordering within the same rank and can greatly improve the compression without any additional code.Benchmarks
I've experimented with several quantization schemes. I assumed that the relative rank of first 100 words is significant enough and never altered that. After that point, each tested scheme has the following quantized ranks:
Words with the same quantized rank have been simply sorted in the Unicode codepoint order, so that similar words can be placed much more closely. While other shuffling is possible, this is good enough for our purpose and also helps the next proposal.
I also wanted to make sure that this optimization is visible throughout different compression algorithms and parameters, so I have used the following methods:
Results for
@zxcvbn-ts/language-en
:Combined results for all
@zxcvbn-ts/language-*
packages:The uncompressed size remains identical because the only change here is reordering. Still, it is possible to reduce the compressed size by 10% with a reasonable rank quantization, and that reduction is more or less consistent for all tested parameters. (There is a reason for this consistency, but that's way too technical to explain here.)
Delta coding
It should be noted that
packages/languages/en/src/firstnames.json
among others is already sorted:...and many consecutive words share the same prefix (for example, 6 words start with
abb
). We can exploit this by simple delta coding like this:Here the number of preceding underscores represents the number of first characters to retain. And it's not necessary to repeat each underscore; we can use some reserved character for two consecutive underscores, some other for three, and so on. Since uppercase letters are removed during the generation, let's use them in place of two or more underscores:
At this point we don't even need any comma, because those reserved characters can only occur at the beginning of each word. So we can further simplify the scheme by replacing a lone
,
withA
,,_
withB
,,__
withC
and so on instead:While compression algorithms can pick this kind of repetitions by themselves, we already know the list is sorted so we can make use of that fact much more efficiently in this case.
Benchmarks
Results for
@zxcvbn-ts/language-en
:Combined results for all
@zxcvbn-ts/language-*
packages:In general this delta coding scheme reduces another 1–20% of compressed data on top of rank quantization. The best case results in 50% reduction!
Caveats
It should be noted that such transformation has to be inversed every time the script is being loaded. The following is the actual code for delta coding I've used, and you can see that it returns an immediately invoked function that does the job.
In my testing they are not very slow, but not extremely quick either. For
@zxcvbn-ts/language-en
both browsers I've tested (Firefox and Chrome) took 5–10 ms to undo the transformation, even with some additional optimization to avoid memory allocation.Amazingly enough, Chrome's V8 does seem to specially handle
String.split
with a string argument and it was 10,000x faster than any other tested code, whileString.split
was only ~3x faster in Firefox. I haven't tested a direct character-wise iteration but I doubt it would be as fast asString.split
in V8, so that's one thing to take account for.Raw data for benchmarks