Myanmar language (Burmese) README
Syllable segmenation is an important preprocess for many natural language processing (NLP) such as romanization, transliteration and graphame-to-phoneme (g2p) conversion.
"sylbreak" is a syllable segmentation tool for Myanmar language (Burmese) text encoded with Unicode (e.g. Myanmar3, Padauk). I used only one short line of regular expression (RE) as follow:
$line =~ s/((?<!$ssSymbol)[$myConsonant](?![$aThat$ssSymbol])|[$enChar$otherChar])/$sep$1/g;
Here, the point is (a consonant not after a subscript symbol AND not followed by a-That character or a subscript symbol)
Here, variables are declared as follows:
my $myConsonant = "က-အ";
my $enChar = "a-zA-Z0-9";
my $otherChar = "ဣဤဥဦဧဩဪဿ၌၍၏၀-၉၊။!-\/:-\@\[-`{-~\\s";
my $ssSymbol = "္";
my $aThat = "်";
Fig. Visualization of sylbreak RE
If you use shell (sylbreak.sh), perl (sylbreak.pl) and python (sylbreak.py) scripts, no need to make installation.
Enjoy syllable breaking!
Ye@Lab
In the paper titled "An Algorithm for Myanmar Syllable Segmentation based on the Official Standard Myanmar Unicode Text" presented at the ICCA-2023 conference, the authors make the following statement in Section VI, Performance Evaluation:
Furthermore, we compared the correctness of our algorithm with an existing algorithm, sylbreak3. As stated in Section II, the drawback of the sylbreak3 algorithm is that it cannot correctly segment syllables that contain consonants, ‘်’ and ‘့’. To evaluate this, we tested another set of 165 common syllables in 8 random Myanmar sentences shown Table IX. The results obtained should be seen in the Table X.
According to this experiment, it can be clearly seen that the sylbreak3 algorithm can correctly segment all Myanmar syllables including Parli and digits but it fails in detecting the boundary of syllables composed of ‘်’ and ‘့’.
The statement that "sylbreak fails in detecting the boundary of syllables that composed of ‘်’ and ‘့ ’" is wrong. When I read their paper carefully, I found that the test data is not correctly typed according to the Unicode typing of the Myanmar language. In details, they typed Auk-ka-myit ("့") and then A-that ("်") instead of A-that ("်") and then Auk-ka-myit ("့") order. I assumed they got wrong segmentation results because of this. Actually, sylbreak tool is working well if the user provided the Myanmar text that typed correct order based on the Unicode standard.
Here is the video file that I explained well by comparing the example words from their paper. Though I explained in Myanmar language, hope everyone can follow my explanation.
Video Link: https://vimeo.com/864665740?share=copy
Thanks to Swan Htet Aung who informed my typo mistake of $otherChar ... ဥဥ ---> ဥဦ
sylbreak RE example programs for Java and Java Script was written by Chan Mrate Ko Ko.