yl4579 / PL-BERT

Phoneme-Level BERT for Enhanced Prosody of Text-to-Speech with Grapheme Predictions
MIT License
211 stars 36 forks source link

Is this better than Png-bert ? #18

Closed dutchsing009 closed 11 months ago

dutchsing009 commented 11 months ago

@yl4579 Hello Author , 1-I just wanted to quickly ask if "PL-BERT" is actually better than Png-bert? 2-Also Png-bert pushed NAT to achieve GT 4.47 out of 4.47 but pl-bert didn't push styletts to that level , Is it because styletts is already inferior to NAT? thanks in advance

yl4579 commented 11 months ago
  1. The authors of MP-BERT claim it is comparable to PnG-BERT and we found we are better than MP-BERT, so it is reasonable to assume it is better than PnG-BERT even though we have not compared to it. Another way to see it is StyleTTS 2 that uses PL-BERT is better than NaturalSpeech that uses MP-BERT by a large margin.
  2. PnG-BERT only compared to the ground truth in terms of MOS, while in our paper we used CMOS which is more accurate as raters are only asked two compare two samples (MOS experiments have no other samples). VITS is also human-level in MOS but clearly not human-lever in CMOS. Also, PnG-BERT was tested on a proprietary dataset with 343 hours of data, while we only tested our model on a 24-hour LJSpeech dataset. It is not really comparable as we don’t have their dataset, so it doesn’t really mean StyleTTS is worse than NAT.