Closed arishov1 closed 3 months ago
From what I understand, section 2.3.2 of the STREAMVC paper cites MelGan as the base for the the adversarial loss, which uses this formulation (in section 2.3).
Oh, I see. But if you use the formulation from that source, does that mean the discriminator loss is incorrect?
In the formulation, it uses min(0,1−D_k(x)), but in the implementation, we use F.relu(1 - scale[-1]), which is equivalent to max(0,1−scale[−1]).
Our implementation is based on MelGan's official implementation, you can see similar doubts expressed in an issue in the melgan repo, however as per the discussion there, the error is in the paper.
Let's follow the trail of citations to dig a little deeper.
STREAMVC actually cites two papers as the source for the adversarial/feature/reconstruction losses, MelGan and Streaming-SEANet.
MelGan's paper shows the adversarial losses as: $\min\limits_{D_k} E_x \left[ \min (0, 1-Dk(x)) \right] + E{s,z} \left[ \min (0, 1+D_k(G(s,z))) \right] , \forall k=1,2,3 $ $\min\limitsG E{s,z}\left[ \sum\limits_{k=1,2,3} - D_k(G(s,z)) \right] $ However it's cites 2 papers: Geometric GAN (which the GAN hinge loss is usually attributed to) and Spectral Normalization.
Geometric GAN has the following loss: $\min\limits_{D} Ex \left[ \max (0, 1-D(x)) \right] + E{z} \left[ \max (0, 1+D(G(z))) \right] $ $\min\limitsG -E{z}\left[ D(G(z)) \right] $ which is similar to our implementation.
The Spectral Normalization paper has the following loss: $\max\limits_{D} Ex \left[ \min (0, -1+D(x)) \right] + E{z} \left[ \min (0, -1-D(G(z))) \right] $ $\min\limitsG -E{z}\left[ D(G(z)) \right] $ which is exactly the same as the Geometric GAN formulation (which is also cited as inspiration).
Finally, Streaming-SEANet used a similar loss to SoundStream: $\min\limits_{D} Ex \left[ \max (0, 1-D(x)) \right] + E{z} \left[ \max (0, 1+D(G(z))) \right] $ $\min\limitsG E{z}\left[ \max(0, 1- D(G(z))) \right] $ Streaming-SEANet cites SEANet as inspiration, and both SEANet and SoundStream don't cite anyone for the adversarial loss, just call it hinge loss.
So I hope it makes sense why we chose this formulation.
Okay, got it. Thank you for your detailed answer!
Hi,
Thank you for your implementation.
As I was reading the StreamVC paper, I noticed that they use the adversarial loss as described here. If we implement the loss as described in that paper, the code will change as follows:
Before:
After:
My question is, is there any specific reason you decided to change this loss?