web-mech / badwords

A javascript filter for badwords
MIT License
631 stars 324 forks source link

support for accented chars #107

Closed zzgab closed 3 years ago

zzgab commented 3 years ago

Some languages (like French) use accents in words. Example assécher means to dry up (skin, hair etc.). The native JS RegExp \b splits words in a naive, Latin-only way, so the character é gets interpreted as a word separator, thus yielding to ass é cher and ass gets censored out.

So we end up with ***écher which is nonsense in French.

This PR is an improvement to the previous, incomplete attempt that had been made to support the French via user-provided word sep.

The new option enhancedWordSep, defaulting to false, will use a separation regexp which works for accented languages.