Preprocessing code for Chinese

TinaChen95 commented 1 year ago

Do you have any suggestions for Chinese data preprocessing? For example, text normalization, g2p, etc. From your experience, will the accuracy of the g2p model have great impact on the model performance ?

TinaChen95 commented 1 year ago

These are what I'm gonna try:

using text normalization to preprocess all the text
using jieba to replace TransfoXLTokenizer
using pypinyin + ipa to replace phonemize
train BPE tokenizer. I'm also wondering maybe using initials and finals as sup-phoneme.

Any other suggestions? Thanks!

yl4579 commented 1 year ago

When I trained the multilingual PL-BERT (English, Japanese, Chinese), I tried two preprocessing methods for Chinese and didn't notice any difference in terms of quality for the downstream TTS tasks (possibly also because the AiShell dataset is simple and just like VCTK where no clear context or emotions are there).

The simplest way is character-level P2G, i.e., you treat each character as a grapheme. You should also take into account the change of pronunciations (heterophones) given different contexts (for example, "了" can be read both as "liao" and "le" depending on the context).

Another more complicated way is to represent graphemes at the word level. For example, you treat "了" (particle) as a grapheme, but you treat "了解" as another grapheme (instead of two graphemes, "了" and "解"). This probably helps for Japanese too, as a lot of graphemes can be shared between Chinese and Japanese.

For me, I used tokenizer = BertTokenizer.from_pretrained("uer/gpt2-chinese-cluecorpussmall") tokenizer and used pypinyin with the conversion table here to convert characters into IPA. Luckily Chinese pronunciations are easier than Japanese as the pronunciations don't change depending on the context but only on the word. Once you know the word, the pronunciations are always the same unlike in Japanese. So once you get the tokenized words, you can convert them individually to pinyin and then to IPA.

I don't think there's any need to train a BPE tokenizer. Not sure what it is for.

I will leave this issue open in case someone else needs to train a PL-BERT in Chinese.

TinaChen95 commented 1 year ago

I've tried tokenizer = BertTokenizer.from_pretrained("uer/gpt2-chinese-cluecorpussmall") and got some wrong output:

tokenizer.tokenize('这是一句中文文本，时间是2023年7月12日。') output: ['这', '是', '一', '句', '中', '文', '文', '本', '，', '时', '间', '是', '202', '##3', '年', '7', '月', '12', '日', '。']

digits like '2023' '7' '12' should be read out, and '2023' may should not to be treated seperately, so maybe I need to use a Text Normalization module first.

Also, you've mentioned to treat "了解" as another grapheme, this requires a chinese tokenizer. May I know which tokenizer did you use?

I'm quite surprise that there is no difference in terms of quality for the downstream TTS tasks, cause when we input wrong graphme it hurts naturalness a lot. May I know how do you evaluate the output chinese speech? Does the result outperform the baseline that is not pretrained on large-scale corpus?

Thanks!

yihuitang commented 1 year ago

Hi @TinaChen95 ,

How is your try going? Can you please share the desired input and output format for preprocessing Chinese? If you can give a few examples, that'll be very helpful. Thank you.

yl4579 commented 1 year ago

@TinaChen95 You can use tokenizers here https://fengshenbang-doc.readthedocs.io/zh/latest/index.html that has word-level tokenization instead of character-level. It is true that you will need to normalize the date and numbers to their reading. As for the performance in the Mandarin dataset, I only tested it on AiShell dataset which is similar to VCTK that does not have any emotion or context, so the difference is probably not that big. I could not find any Chinese audiobook or emotional speech dataset with contexts like LJSpeech or LibriTTS, so if you know any that I can test on please let me know. Also since the PL-BERT is eventually fine-tuned with the TTS model, as long as the phonemes are correct in the TTS dataset, the incorrect phonemization during pre-training indeed has little effect. This is only confirmed for English datasets however, but I believe similar things should hold true for Chinese.

yl4579 / PL-BERT

Preprocessing code for Chinese #14