you-dont-need / You-Dont-Need-Lodash-Underscore

List of JavaScript methods which you can use natively + ESLint Plugin
MIT License
18.72k stars 816 forks source link

Replacing lodash string functions with native one requires special care for Unicode strings with non-BMP symbols #321

Open laithshadeed opened 3 years ago

laithshadeed commented 3 years ago

For example, the native split function '๐Ÿ˜€-hi-๐Ÿ…'.split('') will break your string compared to lodash _.'๐Ÿ˜€-hi-๐Ÿ…' because it failed to recognize emojis as a single symbol and instead splits its surrogate pairs into two pieces. It is the same reason why calling length on emojis returns two instead of one '๐Ÿ˜€'.length

Lodash takes special care if your string has non-BMP symbols for example emojis. To correctly split '๐Ÿ˜€-hi-๐Ÿ…'; you can use the spread operator: [...'๐Ÿ˜€-hi-๐Ÿ…']

But even the spread operator does not handle grapheme clusters. For that, you need the Unicode Text Segmentation algorithm. Chrome already implemented the algorithm in Intl.Segmenter in 87. You can use the algorithm like this:

[...(new Intl.Segmenter).segment('๐Ÿ˜€-hi-๐Ÿ…')].map(x => x.segment)

More about Unicode issues in Javascript in: https://mathiasbynens.be/notes/javascript-unicode

Happy passing emojis around ๐Ÿ˜€

mrienstra commented 1 year ago

Comparison of some methods: https://stackblitz.com/edit/stackblitz-typescript-lrag9u?devToolsHeight=90&file=index.ts

const str = '๐Ÿ…-๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘ง-เฎจเฎฟ-๊น-่‘›๓ „€';

naive, split

str.split('');
// (20) ["\ud83d", '\udc05', '-', '\ud83d', '\udc68', 'โ€', '\ud83d', '\udc69', 'โ€', '\ud83d', '\udc67', '-', 'เฎจ', 'เฎฟ', '-', '๊น', '-', '่‘›', '\udb40', '\udd00']

slightly better, spread operator

[...str]
// (15) ["๐Ÿ…", '-', '๐Ÿ‘จ', 'โ€', '๐Ÿ‘ฉ', 'โ€', '๐Ÿ‘ง', '-', 'เฎจ', 'เฎฟ', '-', '๊น', '-', '่‘›', '๓ „€']

In supported browsers, Intl.Segmenter

[...new Intl.Segmenter().segment(str)].map((g) => g.segment);
// (9) ["๐Ÿ…", '-', '๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘ง', '-', 'เฎจเฎฟ', '-', '๊น', '-', '่‘›๓ „€']

graphemer 1.4.0

import Graphemer from 'graphemer';
const splitter = new Graphemer();
splitter.splitGraphemes(str);
// (9) ["๐Ÿ…", '-', '๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘ง', '-', 'เฎจเฎฟ', '-', '๊น', '-', '่‘›๓ „€']

lodash 4.17.10

import _ from 'lodash';
_.split(str, '');
// (11) ["๐Ÿ…", '-', '๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘ง', '-', 'เฎจ', 'เฎฟ', '-', '๊น', '-', '่‘›', '๓ „€']

fabric.js v6.0.0-beta10 graphemeSplit (internal function)

import { graphemeSplit } from './fabric_graphemeSplit';
graphemeSplit(str);
// (15) ["๐Ÿ…", '-', '๐Ÿ‘จ', 'โ€', '๐Ÿ‘ฉ', 'โ€', '๐Ÿ‘ง', '-', 'เฎจ', 'เฎฟ', '-', '๊น', '-', '่‘›', '๓ „€']

@formatjs Intl.Segmenter 11.4.2 polyfill

await import('@formatjs/intl-segmenter/polyfill-force');
[...new Intl.Segmenter().segment(str)].map((g) => g.segment);
// (9) ["๐Ÿ…", '-', '๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘ง', '-', 'เฎจเฎฟ', '-', '๊น', '-', '่‘›๓ „€']