Question about "S" - Githubissues

Hi,

Thanks for sharing this great work.

I have some questions regarding "S". In section 3.2, you mentioned that "S" contains all category names associated with the task, and in section 3.4, you indicated that "S" varies across different tasks.

As per my understanding, for tasks such as referring image segmentation and depth estimation, |S|= 1, representing the given text or a specific category. However, in terms of semantic segmentation, I am uncertain about "S" which contains all category names relevant to this task. Does it refer to all category names present in the current image? If so, how is the loss function designed to establish a linkage between textual information and image content?

wl-zhao / VPD

Question about "S" #40