Is this better than Png-bert ?

The authors of MP-BERT claim it is comparable to PnG-BERT and we found we are better than MP-BERT, so it is reasonable to assume it is better than PnG-BERT even though we have not compared to it. Another way to see it is StyleTTS 2 that uses PL-BERT is better than NaturalSpeech that uses MP-BERT by a large margin.
PnG-BERT only compared to the ground truth in terms of MOS, while in our paper we used CMOS which is more accurate as raters are only asked two compare two samples (MOS experiments have no other samples). VITS is also human-level in MOS but clearly not human-lever in CMOS. Also, PnG-BERT was tested on a proprietary dataset with 343 hours of data, while we only tested our model on a 24-hour LJSpeech dataset. It is not really comparable as we don’t have their dataset, so it doesn’t really mean StyleTTS is worse than NAT.

yl4579 / PL-BERT