yukara-ikemiya / friendly-stable-audio-tools

Refactored / updated version of `stable-audio-tools` which is an open-source code for audio/music generative models originally by Stability AI.
MIT License
147 stars 11 forks source link

Question about how CFG is being applied. #3

Open sakura-nyaa opened 4 months ago

sakura-nyaa commented 4 months ago

The cfg_scale is being passed into the DiffusionTransformer class and used directly on the output of the DiT. But the DiT is being trained to output v, not the noise. So we're taking a weighted combination of the conditional and unconditional v.

The model still generates reasonable outputs, but is this how its supposed to be? As far as I know CFG is defined in terms of the score function which is a scaled version of the noise

yukara-ikemiya commented 4 months ago

I got your point. I hadn't thought deeply about the implementation of CFG in V-diffusion, but I think it can be interpreted as follows. Please note that I'm not very familier with the math theory, so I apologize if there're any theoretical errors.

From the V-diffusion paper[1], prediction of v corresponds to simultaneous prediction of $\mathbf{\epsilon}$ and $\mathbf{x}$.

\mathbf{v} = \alpha \mathbf{\epsilon} - \sigma \mathbf{x}

In this context, just like the usual noise objective, considering $\mathbf{\epsilon}$ and $\mathbf{x}$ as score functions, we can apply CFG to each as follows.

\mathbf{v}_{cfg} =  \alpha( (1+\omega)\mathbf{\epsilon} - \omega\mathbf{\epsilon}_u ) - \sigma ((1+\omega)\mathbf{x} - \omega\mathbf{x}_u)

($\mathbf{\epsilon}_u$ and $\mathbf{x}_u$ are unconditional outputs)

Here, we can further transform the above expression as follows.

\begin{eqnarray}
\mathbf{v}_{cfg} &=& (1 + \omega)( \alpha \mathbf{\epsilon} - \sigma \mathbf{x}) - \omega ( \alpha \mathbf{\epsilon}_u - \sigma \mathbf{x}_u) \
&=& (1+\omega) \mathbf{v} - \omega \mathbf{v}_u
\end{eqnarray}

Therefore, this ultimately means that the usual implementation of CFG can also be applied in V-diffusion.

[1] https://arxiv.org/abs/2202.00512

sakura-nyaa commented 4 months ago

hello, thanks for the reply. I was playing around with this model some more and I did some derivations as well. I tried implementing the CFG how its usually done and the generations were slightly different, but not significantly so.

If there were to be any differences I suspect it would be a result of scaling the x0 embedded within the v:

# use phi for the noise level
cos := cos(phi) = alpha
sin := sin(phi) = sigma
# name the noisy latent z
z := cos*signal + sin*noise

# definition of v
v := cos*noise - sin*signal

################################
# _u indicates unconditional _c indicates conditional
################################
# Start
v_cfg = v_u + w*(v_c - v_u)
# Definition of v
v_cfg = (cos*noise_u - sin*signal_u) + w*((cos*noise_c - sin*signal_c) - (cos*noise_u - sin*signal_u))
# distribute w twice and make it a flat sum of terms
v_cfg = cos*noise_u - sin*signal_u + w*cos*noise_c - w*sin*signal_c - w*cos*noise_u + w*sin*signal_u
# group noise terms together and group signal terms together
v_cfg = (cos*noise_u + w*cos*noise_c - w*cos*noise_u) + (-sin*signal_u - w*sin*signal_c + w*sin*signal_u)
# factor out cos and -sin
v_cfg = cos*(noise_u + w*noise_c - w*noise_u) + -sin*(signal_u + w*signal_c - w*signal_u)
# factor out the w
v_cfg = cos*(noise_u + w*(noise_c - noise_u)) + -sin*(signal_u + w*(signal_c - signal_u))
################################
# signal_u + w*(signal_c - signal_u) might introduce differences.
# it depends on what the sampler does with the output it receives from the model
# because the noise_u + w*(noise_c - noise_u) part is what we usually use in CFG

I don't think this is really a problem with your repo, it's just something interesting I found while reading it. I might read more and see if I can derive that everything works out as usual in the end. I need to convert the vpred formulation into the karras formulation used inside k-diffusion.

Thanks for your time. Have a nice day. :^)