Confusion about the residual modeling

Nice work !

In the paper, the residual clean speech x_0 and the residual noisy speech y_0 are adopt for the input of the stochastic model S_θ.

However, in the CVPR2022 paper 'Deblurring via Stochastic Refinement', I find that for the stochastic model, they use a blurry image y and the clean residual x_0 - gθ(x_0) as input, where the x_0 is the clean image and gθ(·) is the deterministic model.

Here comes my confusion. You use the residual noisy speech y_0 as the condition of the diffusion model, while the CVPR paper directly adopts the blurry image y as the condition. Since the diffusion is processed for the residual, I think your solution is more straightforward.

I'm not sure if my understanding is correct, and I would like to hear your insights.

zhibinQiu / SRTNet

Confusion about the residual modeling #2