Closed rajdeep1337 closed 10 months ago
Sorry for the late reply. I was quite busy recently. The format is basically the pair (phonemes, grapheme token)
. The goal is to predict the masked phonemes and the corresponding grapheme token for each phoneme. You can refer to this multilingual PL-BERT dataset as an example: https://huggingface.co/datasets/styletts2-community/multilingual-pl-bert
I want to train my own PL-bert model, but am unsure in what format the dataset needs to be. Could you please shed some lights on this? Thanks!