A question about network architecture

ymLeiFDU / CLIP-Lung

CLIP-Lung (MICCAI 2023)

17 stars 0 forks source link

A question about network architecture #3

Open Dijkstra111111 opened 5 months ago

Dijkstra111111 commented 5 months ago

Hello,I would like to ask why you use ViT-B/16 as a text encoder. Why not use NLP models as a text encoder? Thank you very much.

ymLeiFDU commented 5 months ago

Hi, the NLP text encoder is also feasible. However, the text encoders in vision-language pertained models such as CLIP can generate text embeddings that are more aligned or closed to the visual information, benefiting the cross-modal learning.

Pang-b0 commented 2 weeks ago

Hello, I want to ask if you have reproduced the results of the article, and I want to ask about the process of data preprocessing.