Dear sir,
Thanks for your re-implementation. I have some problems with your implementation.
a) Does this code re-produce the experimental results reported in the original paper?
b) I have some questions about your implementation.
Thank for your interest on the code. To answer your questions:
I tried on Caltech101 and ImageNet, results can be reproduced.
I change code in original CLIP since there is ambiguity that was not mentioned in the paper: how to do vector dimensions alignment, I use simple avg pooling, you can try other methods.
Since outputs from CLIP are normalized, and I did not find whether authors in the paper normalize, I didn’t do normalization, if you find, I’d appreciate you post your modification here.
Dear sir, Thanks for your re-implementation. I have some problems with your implementation. a) Does this code re-produce the experimental results reported in the original paper? b) I have some questions about your implementation.
https://github.com/yuranusduke/CALIP/blob/b56911ed6d8ea180e73befa9b5904894dabb9ef8/clip/model.py#L143-L151 It seems that you use the adaptive_pooling_1d to merge the dim from 2048 to 1024. This part seems strange. Can the dimension be merged this way?
https://github.com/yuranusduke/CALIP/blob/b56911ed6d8ea180e73befa9b5904894dabb9ef8/trainer/calip.py#L80-L98 The Fs, Fv, Ft are normed. The generated features Fva, Fta, and Fsa are un-normed. You directly use the inner dot to get the classification score. I think the Fva, Fta, and Fsa should be normed too.