Closed nhanhttrong closed 1 year ago
I believe the problem lies in the duration loss, it somehow fluctuates between 1 and 0.6. I think the text aligner is probably fine. Could you check if the first stage of the model sounds good?
Can you provide code to test stage 1? and I have another question, while preprocessing data i use phonemizer to convert text in data to IPA, but all of most IPA which was create by phonemes, aren't exist in your code, so during training i get this status and it print text line which mismatch and loss still working but i wonder how it do that
and is raise keyerror in this code is effect to train ?
Thank all your reply, hope you have a good day
You can use the following code for testing (under the inference notebook) from the validation loss computation code:
from meldataset import build_dataloader
train_path = config.get('train_data', None)
val_path = config.get('val_data', None)
train_list, val_list = get_data_path_list(train_path, val_path)
train_dataloader = build_dataloader(train_list,
batch_size=batch_size,
num_workers=8,
dataset_config={},
device=device)
val_dataloader = build_dataloader(val_list,
batch_size=batch_size,
validation=True,
num_workers=2,
device=device,
dataset_config={})
_, batch = next(enumerate(train_dataloader)) # can also be val_dataloader
batch = [b.to(device) for b in batch]
texts, input_lengths, mels, mel_input_length = batch
with torch.no_grad():
mask = length_to_mask(mel_input_length // (2 ** model.text_aligner.n_down)).to('cuda')
m = length_to_mask(input_lengths)
ppgs, s2s_pred, s2s_attn_feat = model.text_aligner(mels, mask, texts)
s2s_attn_feat = s2s_attn_feat.transpose(-1, -2)
s2s_attn_feat = s2s_attn_feat[..., 1:]
s2s_attn_feat = s2s_attn_feat.transpose(-1, -2)
with torch.no_grad():
text_mask = length_to_mask(input_lengths).to(texts.device)
attn_mask = (~mask).unsqueeze(-1).expand(mask.shape[0], mask.shape[1], text_mask.shape[-1]).float().transpose(-1, -2)
attn_mask = attn_mask.float() * (~text_mask).unsqueeze(-1).expand(text_mask.shape[0], text_mask.shape[1], mask.shape[-1]).float()
attn_mask = (attn_mask < 1)
s2s_attn_feat.masked_fill_(attn_mask, -float("inf"))
if TMA_CEloss:
s2s_attn = F.softmax(s2s_attn_feat, dim=1) # along the mel dimension
else:
s2s_attn = F.softmax(s2s_attn_feat, dim=-1) # along the text dimension
# get monotonic version
with torch.no_grad():
mask_ST = mask_from_lens(s2s_attn, input_lengths, mel_input_length // (2 ** model.text_aligner.n_down))
s2s_attn_mono = maximum_path(s2s_attn, mask_ST)
s2s_attn = torch.nan_to_num(s2s_attn)
# encode
t_en = model.text_encoder(texts, input_lengths, m)
asr = (t_en @ s2s_attn_mono)
# get clips
mel_len = int(mel_input_length.min().item() / 2 - 1)
en = []
gt = []
for bib in range(len(mel_input_length)):
mel_length = int(mel_input_length[bib].item() / 2)
random_start = np.random.randint(0, mel_length - mel_len)
en.append(asr[bib, :, random_start:random_start+mel_len])
gt.append(mels[bib, :, (random_start * 2):((random_start+mel_len) * 2)])
en = torch.stack(en)
gt = torch.stack(gt).detach()
with torch.no_grad():
F0_real, _, F0 = model.pitch_extractor(gt.unsqueeze(1))
F0 = F0.reshape(F0.shape[0], F0.shape[1] * 2, F0.shape[2], 1).squeeze()
# reconstruct
s = model.style_encoder(gt.unsqueeze(1))
real_norm = log_norm(gt.unsqueeze(1)).squeeze(1)
mel_rec = model.decoder(en, F0_real, real_norm, s)
mel_rec = mel_rec[..., :gt.shape[-1]]
synthesized = []
for idx in range(mel_rec.size(0)):
with torch.no_grad():
# synthesize into waveforms
c = mel_rec[idx].squeeze()
y_g_hat = generator(c.unsqueeze(0))
y_out = y_g_hat.squeeze().cpu().numpy()
synthesized.append(y_out)
import IPython.display as ipd
for wave in synthesized:
display(ipd.Audio(wave, rate=24000))
As for the IPA error, it seems like only a few characters are not in the dictionary and they are automatically ignored by the text cleaner. You can do print(char)
instead of print(index)
to see which specific character.
So does the IPA error affect training? If yes, do I need to retrain with the more extensive IPA set through text-aligner?
Thank for your reply
@christopherohit If these characters are not essential for reconstruction then it doesn’t matter, because the text aligner is eventually finetuned with your new dataset and unseen characters of pretrained models will be learned during TMA training. But if these characters are important for reconstruction (like pauses, breaths, laughs etc.) then you need to retrain.
I'm pretraining your model on the vivo dataset (Vietnamese) but the results are not what I expected. Here is the original audio: https://drive.google.com/file/d/12mZdg8yVhgQj35Vt3thoxIK44_jWWaCJ/view?usp=sharing and here is the result: https://drive.google.com/file/d/1UOuUHrxiR5DvF1MrpccMZ6bdwKyfO2AE/view?usp=sharing
This is the loss during training stage 1 and stage 2
p/s: I used the ASR of the original article to train Vietnamese. I wonder if it has any problems because during the training stage 1 there were quite a lot of keyerror errors. Thank you very much