Tokenizer does not split sentence on spaces

CarstenGrohmann commented 3 months ago

The tokenizer does not split a sentence as passphrase on a all spaces.

Example: 'Today the sun 1s shining and it's very hot'

Unsplit dictionary words:

' the sun 1s '
' and it's '
' hot'

The unsplit pattern is treated as a bruteforce pattern instead of the faster dictionary pattern. This leads to an overestimation of the password strength.

MrWook commented 3 months ago

Hey @CarstenGrohmann, the "tokenizer" is validating the spaces and even other separator. There is even a separator matcher which gives them a value. But the sequences you see in the result are only a small subset of found sequences. Actually your example sentence finds 125 sequences which are validated with an algorithm.

But if you want another handling you could "normalize" your password that you give into zxcvbn-ts and remove all spaces beforehand

BenKennish commented 3 months ago

Came here to report this problem too. I was actually getting it with the final version of the original zxcvbn project and was waiting for them to fix the bug before I realised that the project died so long ago! 😆

@MrWook - I can see that I can remove spaces from the passwords and that zxcvbn-ts then correctly identifies "strategiccreateshaftdiaryparishheadless" as 6 dictionary words but I don't understand why I would need to strip the spaces. Isn't it up the zxcvbn-ts to correctly identify any words without "normalizing" the password in any way? Why is it seemingly harder to identify the words when there are spaces around them than no spaces? That doesn't make sense to me and I'd suggest it's a bug. Sure we could strip the spaces, but what if the separator was a "."? or a "-"? The entropy of the password shouldn't really differ and yet...

guessesLog10 for "strategiccreateshaftdiaryparishheadless" (without spaces): 22.657512646035222 guessesLog10 for "strategic create shaft diary parish headless" (with spaces): 31.322201551220438

MrWook commented 3 months ago

hey @BenKennish thanks for your input!

Removing spaces beforehand can be useful if someone prefers a different sequence selection method. However, keeping spaces between words actually strengthens the process, leading to a higher score. This library is designed to do much more than just finding words within strings. In fact, it only identifies words from specific dictionaries containing common words. Therefore, if a password contains words not in these dictionaries, it won't be matched.

Using your example "strategic create shaft diary parish headless" and eliminating the brute force matcher (which you mentioned affects the scoring), we achieve an even higher guessLog10 score of 40. The sequences used include every individual word and each separator within the string.

I hope that this makes this clearer. But if you have any questions feel free to ask.

BenKennish commented 3 months ago

Oooh yes I see that does make things clearer for me. Thanks. Although perhaps one might argue that space is such a common separator that its use wouldn't actually affect the true entropy/strength much in real terms. If I were writing a password cracker to crack passwords based on passphrases, I'm almost certain I would check for "strategic create shaft diary parish headless" when checking for "strategiccreateshaftdiaryparishheadless" (probably not checking any other separators) which would mean the real world strength of "strategic create shaft diary parish headless" would be quite a bit less than ones using uncommon separators (e.g. "¬" or "b") to create "strategic¬create¬shaft¬diary¬parish¬headless" or "strategicbcreatebshaftbdiarybparishbheadless".

CarstenGrohmann commented 3 months ago

Hi @MrWook,

Thank you for the explanation. It was unclear to me that the displayed result was only one of 125 with my example. It makes sense to me, as this is just a fast and reasonably accurate estimate of the effort required to crack passwords.

Many thanks for this useful project.

You may close this issue.

BenKennish commented 2 months ago

Oooh yes I see that does make things clearer for me. Thanks. Although perhaps one might argue that space is such a common separator that its use wouldn't actually affect the true entropy/strength much in real terms. If I were writing a password cracker to crack passwords based on passphrases, I'm almost certain I would check for "strategic create shaft diary parish headless" when checking for "strategiccreateshaftdiaryparishheadless" (probably not checking any other separators) which would mean the real world strength of "strategic create shaft diary parish headless" would be quite a bit less than ones using a uncommon separators (e.g. "¬" or "b") and making "strategic¬create¬shaft¬diary¬parish¬headless" or "strategicbcreatebshaftbdiarybparishbheadless".

Any comment on this though? Using common separators for a passphrase would seem a useful feature for a future version.

BenKennish commented 2 months ago

Any plans that zxcvbn-ts would more cleverly handle separators for passphrases in the future?

MrWook commented 2 months ago

@BenKennish i'm sorry i wrote you back but it seems like i never clicked on the comment button to send it out 🤦

We already check for different separators which are configured in packages/libraries/main/src/data/const.ts. Which currently is:

export const SEPERATOR_CHARS = [
  ' ',
  ',',
  ';',
  ':',
  '|',
  '/',
  '\\',
  '_',
  '.',
  '-',
]

The scoring for each separator is the same but we could extend this and give each separator a unique value or atleast a different one. Space and - have probably the same weight. Or we could order this array of seperators and the first entry is the most common one like a space and has the lowest score.

We can always improve this library so feel free to open an issue for this and try to improve the separator matcher :) The scoring and matching function are here

BenKennish commented 2 months ago

@BenKennish i'm sorry i wrote you back but it seems like i never clicked on the comment button to send it out 🤦

Whoops haha np

We can always improve this library so feel free to open an issue for this and try to improve the separator matcher :) The scoring and matching function are here

Thanks for that. I will see what I can do if I get a bit of time 👍

zxcvbn-ts / zxcvbn

Tokenizer does not split sentence on spaces #265