yuval-reshef / StreamVC

An unofficial pytorch implementation of "STREAMVC: REAL-TIME LOW-LATENCY VOICE CONVERSION".
MIT License
54 stars 7 forks source link

Adversarial Loss Implementation Mismatch from StreamVC #2

Closed arishov1 closed 3 months ago

arishov1 commented 3 months ago

Hi,

Thank you for your implementation.

As I was reading the StreamVC paper, I noticed that they use the adversarial loss as described here. If we implement the loss as described in that paper, the code will change as follows:

Before:

class GeneratorLoss(nn.Module):
    def forward(self, fake: list[list[torch.Tensor]], mask_ratio: torch.Tensor):
        loss = torch.tensor(
            0., device=fake[0][0].device, dtype=fake[0][0].dtype)
        for scale in fake:
            loss += -masked_mean_from_ratios(scale[-1], mask_ratio)
        return loss

After:

 class GeneratorLoss(nn.Module):
    def forward(self, fake: list[list[torch.Tensor]], mask_ratio: torch.Tensor):
        loss = torch.tensor(
            0., device=fake[0][0].device, dtype=fake[0][0].dtype)
        for scale in fake:
            loss += masked_mean_from_ratios(F.relu(1 - scale[-1]), mask_ratio)
        return loss

My question is, is there any specific reason you decided to change this loss?

yuval-reshef commented 3 months ago

From what I understand, section 2.3.2 of the STREAMVC paper cites MelGan as the base for the the adversarial loss, which uses this formulation (in section 2.3).

arishov1 commented 3 months ago

Oh, I see. But if you use the formulation from that source, does that mean the discriminator loss is incorrect?

In the formulation, it uses min⁡(0,1−D_k(x)), but in the implementation, we use F.relu(1 - scale[-1]), which is equivalent to max⁡(0,1−scale[−1]).

image

yuval-reshef commented 3 months ago

Our implementation is based on MelGan's official implementation, you can see similar doubts expressed in an issue in the melgan repo, however as per the discussion there, the error is in the paper.

Let's follow the trail of citations to dig a little deeper.

STREAMVC actually cites two papers as the source for the adversarial/feature/reconstruction losses, MelGan and Streaming-SEANet.

MelGan's paper shows the adversarial losses as: $\min\limits_{D_k} E_x \left[ \min (0, 1-Dk(x)) \right] + E{s,z} \left[ \min (0, 1+D_k(G(s,z))) \right] , \forall k=1,2,3 $ $\min\limitsG E{s,z}\left[ \sum\limits_{k=1,2,3} - D_k(G(s,z)) \right] $ However it's cites 2 papers: Geometric GAN (which the GAN hinge loss is usually attributed to) and Spectral Normalization.

Geometric GAN has the following loss: $\min\limits_{D} Ex \left[ \max (0, 1-D(x)) \right] + E{z} \left[ \max (0, 1+D(G(z))) \right] $ $\min\limitsG -E{z}\left[ D(G(z)) \right] $ which is similar to our implementation.

The Spectral Normalization paper has the following loss: $\max\limits_{D} Ex \left[ \min (0, -1+D(x)) \right] + E{z} \left[ \min (0, -1-D(G(z))) \right] $ $\min\limitsG -E{z}\left[ D(G(z)) \right] $ which is exactly the same as the Geometric GAN formulation (which is also cited as inspiration).

Finally, Streaming-SEANet used a similar loss to SoundStream: $\min\limits_{D} Ex \left[ \max (0, 1-D(x)) \right] + E{z} \left[ \max (0, 1+D(G(z))) \right] $ $\min\limitsG E{z}\left[ \max(0, 1- D(G(z))) \right] $ Streaming-SEANet cites SEANet as inspiration, and both SEANet and SoundStream don't cite anyone for the adversarial loss, just call it hinge loss.

So I hope it makes sense why we chose this formulation.

arishov1 commented 3 months ago

Okay, got it. Thank you for your detailed answer!