yl4579 / PL-BERT

Phoneme-Level BERT for Enhanced Prosody of Text-to-Speech with Grapheme Predictions
MIT License
210 stars 36 forks source link

What is a reasonable phoneme prediction accuracy? #16

Closed sos1sos2Sixteen closed 11 months ago

sos1sos2Sixteen commented 11 months ago

Thanks for your contribution!

I am currently toying with my implementation of phoneme-level BERT for Chinese. and i have noticed a gap between the phonemic prediction accuracy of the tokens overall (top-1 true prediction / n_tokens, ~88%) and tokens whose inputs were randomly masked (top-1 true prediction of masked tokens / n_masks, ~25%).

Since the published checkpoint did not include a language modeling head for phonemes, i trained a simple linear prediction layer on top of your published model using 10,000 lines from wikipedia and used another 10,000 lines for testing (probably all part of your training data though), by simply masking 15% of the words, i calculated an overall accuracy of 87.42% and masked accuracy of 24.86%, which is basically the same as my Chinese model albeit mine used a different phone-set.

  1. I am curious if these numbers are roughly consistent with your findings.
  2. Which accuracy does the grapheme prediction accuracy reported in Table-3 of the paper refers to?
  3. What is your opinion on the relation of these prediction accuracies to the actual downstream task performance (TTS)?

i'm very looking forward to your kind reply!

yl4579 commented 11 months ago
  1. I didn't compute the prediction accuracy for masked tokens in my experiments, as this is not relevant to the problem I'm interested in. Unfortunately I couldn't find the model with the projection head because the project was more than a year ago. I still have the training log of the pre-trained model, which shows the token loss was close to 1.
  2. I have only computed the vocabulary prediction accuracy (not the token accuracy) for all tokens (not just masked), which was around 68% in my experiment. This is shown in Table 3.
  3. I would say they are related, but the correlation isn't that high. The entire point of PL-BERT is you learn a language model at the phoneme level, so all the findings for BERT should apply to PL-BERT too. In BERT, the pretext task performance does not necessarily translate to the downstream performance, which heavily depends on the task type and your training data. I trained the entire model for a month though the loss stopped decreasing just after a week of training, yet this seems to be the case for most BERT models, and you still want the model to somehow "overfit" on the data so it can memorize as much information as possible to be transferred to the downstream task. The entire point of (large) language models is some sort of data compression, so the general rule of thumb is to train as long as you can afford.