wl-zhao / VPD

[ICCV 2023] VPD is a framework that leverages the high-level and low-level knowledge of a pre-trained text-to-image diffusion model to downstream visual perception tasks.
https://vpd.ivg-research.xyz
MIT License
506 stars 30 forks source link

Question about "S" #40

Open BlingHe opened 1 year ago

BlingHe commented 1 year ago

Hi,

Thanks for sharing this great work.

I have some questions regarding "S". In section 3.2, you mentioned that "S" contains all category names associated with the task, and in section 3.4, you indicated that "S" varies across different tasks.

As per my understanding, for tasks such as referring image segmentation and depth estimation, |S|= 1, representing the given text or a specific category. However, in terms of semantic segmentation, I am uncertain about "S" which contains all category names relevant to this task. Does it refer to all category names present in the current image? If so, how is the loss function designed to establish a linkage between textual information and image content?

wl-zhao commented 1 year ago

Hi,

For semantic segmentation, we construct S using all the categories in the dataset. In our experiment, we use ADE20K dataset containing 150 categories, and thus |S|=150.