Open laithshadeed opened 3 years ago
Comparison of some methods: https://stackblitz.com/edit/stackblitz-typescript-lrag9u?devToolsHeight=90&file=index.ts
const str = '๐
-๐จโ๐ฉโ๐ง-เฎจเฎฟ-๊น-่๓ ';
str.split('');
// (20) ["\ud83d", '\udc05', '-', '\ud83d', '\udc68', 'โ', '\ud83d', '\udc69', 'โ', '\ud83d', '\udc67', '-', 'เฎจ', 'เฎฟ', '-', '๊น', '-', '่', '\udb40', '\udd00']
[...str]
// (15) ["๐
", '-', '๐จ', 'โ', '๐ฉ', 'โ', '๐ง', '-', 'เฎจ', 'เฎฟ', '-', '๊น', '-', '่', '๓ ']
[...new Intl.Segmenter().segment(str)].map((g) => g.segment);
// (9) ["๐
", '-', '๐จโ๐ฉโ๐ง', '-', 'เฎจเฎฟ', '-', '๊น', '-', '่๓ ']
import Graphemer from 'graphemer';
const splitter = new Graphemer();
splitter.splitGraphemes(str);
// (9) ["๐
", '-', '๐จโ๐ฉโ๐ง', '-', 'เฎจเฎฟ', '-', '๊น', '-', '่๓ ']
import _ from 'lodash';
_.split(str, '');
// (11) ["๐
", '-', '๐จโ๐ฉโ๐ง', '-', 'เฎจ', 'เฎฟ', '-', '๊น', '-', '่', '๓ ']
import { graphemeSplit } from './fabric_graphemeSplit';
graphemeSplit(str);
// (15) ["๐
", '-', '๐จ', 'โ', '๐ฉ', 'โ', '๐ง', '-', 'เฎจ', 'เฎฟ', '-', '๊น', '-', '่', '๓ ']
await import('@formatjs/intl-segmenter/polyfill-force');
[...new Intl.Segmenter().segment(str)].map((g) => g.segment);
// (9) ["๐
", '-', '๐จโ๐ฉโ๐ง', '-', 'เฎจเฎฟ', '-', '๊น', '-', '่๓ ']
For example, the native split function
'๐-hi-๐ '.split('')
will break your string compared to lodash_.'๐-hi-๐ '
because it failed to recognize emojis as a single symbol and instead splits its surrogate pairs into two pieces. It is the same reason why calling length on emojis returns two instead of one'๐'.length
Lodash takes special care if your string has non-BMP symbols for example emojis. To correctly split '๐-hi-๐ '; you can use the spread operator:
[...'๐-hi-๐ ']
But even the spread operator does not handle grapheme clusters. For that, you need the Unicode Text Segmentation algorithm. Chrome already implemented the algorithm in Intl.Segmenter in 87. You can use the algorithm like this:
[...(new Intl.Segmenter).segment('๐-hi-๐ ')].map(x => x.segment)
More about Unicode issues in Javascript in: https://mathiasbynens.be/notes/javascript-unicode
Happy passing emojis around ๐