Closed NikitaKononov closed 1 year ago
For your first question, all you need is the word (grapheme) and its corresponding IPAs (phonemes). Please check the dataset format in the preprocess.py notebook. The dataset size depends on how large your TTS dataset is. If your TTS dataset is small (like only a few hours), you probably need more text data to get a good performance and vice versa.
For your second question, the phonemizer does support stress for English. I'm not sure how well it works for Slavic languages, but for English, you can see in the data list I prepared for StyleTTS that the stresses are specified with ' (primary stress) and ˌ (secondary stress), in line with IPA. According to my knowledge of phonetics, the IPAs for Slavic languages should follow the same rules as English (for example, the IPA of the word "Россия" is /rɐˈsʲijə/, where the primary stress is on the "ссия" part). All you need is to set with_stress
to True when initializing the phonemizer.
For your first question, all you need is the word (grapheme) and its corresponding IPAs (phonemes). Please check the dataset format in the preprocess.py notebook. The dataset size depends on how large your TTS dataset is. If your TTS dataset is small (like only a few hours), you probably need more text data to get a good performance and vice versa.
For your second question, the phonemizer does support stress for English. I'm not sure how well it works for Slavic languages, but for English, you can see in the data list I prepared for StyleTTS that the stresses are specified with ' (primary stress) and ˌ (secondary stress), in line with IPA. According to my knowledge of phonetics, the IPAs for Slavic languages should follow the same rules as English (for example, the IPA of the word "Россия" is /rɐˈsʲijə/, where the primary stress is on the "ссия" part). All you need is to set
with_stress
to True when initializing the phonemizer.
Thank you so much for the quick and detailed response!
Let me clarify a little, I meant manual stress control on inference for example Alibab'a / Alib'aba Росс'ия / Р'оссия
It is kinda important for correct phonemizing of some proper names in English and for invented / fictional words, for Slavic languages it's critically important - same character sequence can have different meaning depending on stress
Does the model take this into account?
And another little question, if you have time and don't mind answering
How we can use PL-BERT for multiligual TTS model? Is there a possibility to train one PL-BERT model for several languages?
Or the most obvious method: we mark-up input TTS text with lang marks like
And just swap models for different languages getting corresponding output
@NikitaKononov there's no need to mark the language, you can simply mix the graphemes (i.e., your grapheme vocabulary includes both "Росія" and "Россия" in Ukrainian and Russian). The model will figure out the rest of it. I have trained a multilingual model in English, Japanese, and Chinese. Some graphemes are the same for Chinese and Japanese, but the model can still distinguish them based on the context.
@NikitaKononov there's no need to mark the language, you can simply mix the graphemes (i.e., your grapheme vocabulary includes both "Росія" and "Россия" in Ukrainian and Russian). The model will figure out the rest of it. I have trained a multilingual model in English, Japanese, and Chinese. Some graphemes are the same for Chinese and Japanese, but the model can still distinguish them based on the context.
Thanks a lot for the answer So, model will be able to handle mixed language sentences?
For example: Hello, my name is Кирилл Картаполов, J'adore lire des livres und Fahrrad fahren (english, russian, french, deutch)
And how exactly I should train the model? Load wiki datasets with ".en", ".ru", ".de" etc., preprocess them separately and then concat them in one big dataset without fragment/sentence mixing?
And should I change parameter vocab_size: 178 from model_params? What does this number mean? It's not clear for me, can you please elaborate on this?
@yl4579 And can you please tell me, how did you process datasets? Separately?
I have for example wikipedia_20220301.en.processed wikipedia_20220301.de.processed wikipedia_20220301.fr.processed wikipedia_20220301.ru.processed wikipedia_20220301.it.processed
When and how I should concat them? On training process initializing? I should load them one by one and feed concatenated data to the dataloader? Wouldn't my RAM blow up, maybe there's a smarter way to do this?
Also I see, that normalize_text is adapted only for english. Do you have any advice on adapting this to other languages? For example, I don't have any knowledge in Japanese, so I won't be able to construct smth similar for that language
Thanks
Some graphemes are the same for Chinese and Japanese, but the model can still distinguish them based on the context.
Hello Would you be so kind to give me some advice on the questions above, if you have time, please? Thank you
Sorry I just saw this.
For your first question, you have to first make sure your processed datasets use the same tokenizer to get the grapheme tokens. Assume it is done this way (i.e., you trained a tokenizer on joint en, de, fr, ru, it), you can simply concatenate into a big dataset by first loading them and concatenate. When you use huggingface datasets library, it never loads the entire dataset into the memory but only the indices on the disk, so it will not blow up your RAM. You can just save the concatenated dataset and train your model with that.
For your second question, you do need some knowledge of that language to do these kinds of things. You can refer to normalize_text
for English and see the rules I used (not those were not discovered by me either I just copied the code from other open source repos) and replace them with the rules for other languages.
Sorry I just saw this.
Thank you very much for such a detailed answer! I think I've understood how the things should be done
I have only two little questions left, if you would be so kind to answer them:
I don't really understand parameter vocab_size: 178 from model_params, what does it represent? tokenizer unique token count? or something else
I have mispronunciation problems in my TTS models training, that are connected to espeak phonemization. Espeak appends stresses to vowel phonemes by it's dictionary without any connection to text context. So I hope, that proper preprocessing and training of PL-BERT can solve this problem.
But I don't have any good ideas... maybe I should disable stressing phonemes while espeak phonemizing? but in this case PL-BERT won't focus on stresses too, coz there'll be no stresses... )
Either I'm missing something, or I've encountered a genuine dilemma.
Thank you a lot for you contribution
Sorry I just saw this.
Thank you very much for such a detailed answer! I think I've understood how the things should be done
I have only two little questions left, if you would be so kind to answer them:
- I don't really understand parameter vocab_size: 178 from model_params, what does it represent? tokenizer unique token count? or something else
- I have mispronunciation problems in my TTS models training, that are connected to espeak phonemization. Espeak appends stresses to vowel phonemes by it's dictionary without any connection to text context. So I hope, that proper preprocessing and training of PL-BERT can solve this problem.
But I don't have any good ideas... maybe I should disable stressing phonemes while espeak phonemizing? but in this case PL-BERT won't focus on stresses too, coz there'll be no stresses... )
Either I'm missing something, or I've encountered a genuine dilemma.
Thank you a lot for you contribution
Did you finish your training ? With the release of StyleTTS2 I am looking for a way to support european languages. Your list sounded like a great start
Hello!
Thank you, you have done an incredible and very useful work for the community
I would like to train PL-BERT for Slavic languages: Polish, Russian, Ukrainian
I don't really understand how much data will be required for training Could you please elaborate on this?
Also I see, that you use espeak phonemizer in your work. It has a significant drawback - we can't manually control the stress, it puts it down using his dictionary So, the final PL-BERT model can't have reaction for stress? If we denote them as + or ' symbols before vowel letters Is that stress problem solved in some way, or you didn't focus on it with english language?
Looking forward for your answer, thank you