For the loss of the classification discriminator, why it only classifies the fake samples, but not the real samples at the same time, so that it can better learn the tone characteristics of the speaker.
I tried to regenerate the generated mel spectrum through the style encoder to generate a vector, and do L1loss with the vector of the target tone, but this did not achieve better results.
You can classify the real samples too, but the real samples are not converted by the generator, so it will not help the generator to fix its mistakes.
I've tried this too but it gave me even worse results because reconstruction loss forces the generator to not convert the input characteristics, so it will make the similarity even lower.
Hi, I have two questions about the loss.