[ICCV 2023] VPD is a framework that leverages the high-level and low-level knowledge of a pre-trained text-to-image diffusion model to downstream visual perception tasks.
When set 'use_attn' to True, runtime error is occurred cause of not matched channel size.
Could you confirm my understanding? Please correct if needed.
Hello, Thank you for sharing interesting work.
Did you use cross attention map when training depth estimation?
the code below, cross attention is disabled in depth estimation. https://github.com/wl-zhao/VPD/blob/main/depth/models_depth/model.py#L57
When set 'use_attn' to True, runtime error is occurred cause of not matched channel size. Could you confirm my understanding? Please correct if needed.
Thank you.