xuguohai / X-CLIP

An official implementation for "X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval"
https://arxiv.org/abs/2207.07285
MIT License
121 stars 15 forks source link

SeqTransf & meanP #1

Open celestialxevermore opened 1 year ago

celestialxevermore commented 1 year ago

Dear Author,

I really am appreciated and fascinated by your work, and feel thankful of releasing your code.

I know that CLIP4clip + meanP have all the best performance among CLIP4Clip + seqTranf, seqLSTM, and tightTransf,

But I found that in your script, always seqTransf are recommended in sh files.

Is that any special reason that why "sim_header == seqTransf" is default setting?

I had looked your Table 2 on MSVD, your model recorded X-CLIP(ViT-B/32) R@1 scores 47.1 . Is it mean that when X-Clip with seqTransf is the best than any other mode -meanP, tightTransf- ? I cannot find that what kind of sim_header retrieved that scores in that table.

If X-CLIP + seqtrasnf is recommended anyway, any special reason why seqTrasnf outperforms than meanP, unlike Clip4Clip did?

Sincerely,

xuguohai commented 1 year ago

We propose a temporal encoder to model the temporal relationship by setting "sim_header == seqTransf" (as shown in Figure 2.) The ablation study of temporal encoder is shown in Table 8.

celestialxevermore commented 1 year ago

Thank you for replying. As I know, The temporal encoder, Transformer is randomly initialized, which causes some sub-optimal phenomenon as the randomly initialized weights of the seqTransf do harm on CLIP pretrained weights. Am I wrong? or Any ideas about this?

Thx.

xuguohai commented 1 year ago

I agree with you. If the seqTranf is randomly initialized (actually initialized from clip as shown in line 116 of modules/modeling.py), it may cause some sub-optimal phenomenon. That is why CLIP4Clip + meanP is better than CLIP4Clip + seqTranf in most datasets.

Therefore, in our paper, we recommend using original clip to obtain frame-level visual features as shown in line 298 of modules/modeling_xclip.py. The temporal encoder helps to obtain the global video-level visual representation.

celestialxevermore commented 1 year ago

Oh, Really Thank you for your very kind and fast reply.

I didn't notice that line 116 of modules/modeling.py means that code makes initialized the seqTransf from clip.

Q1. Then, What about Cross model? in tightTransf?

Q2. Plus, As I novice for Deep Learning, I cannot understand exactly that why line 298 of modules/modeling_xclip.py the seemly simply just only 'copying' action from visual_output can be interpreted as using original clip to earn the frame-level visual features. I guess that the visual_output give all the objects earned from clip parameters to visual_output_original.

Q3. Then, what if I do modelling newly on using some other Layers like seqTransf or seqLSTM or TightTransf, is there no need to freeze some layers but do what you did in line 298 of modules/modeling_xclip.py is enough to help making better in performance? Can you teach me about this comment?

Thx. you're very kind.

willyfh commented 1 year ago

I did an experiment on another language (Indonesian) on MSVD using XCLIP. And I found that X-CLIP+meanP performs the best compared to the other. I haven't tried on the English one tho. But my experiment indicates that X-CLIP+seqTransf, i.e., the proposed temporal encoder, don't always perform the best on a dataset with different characteristics as in MSVD-Indonesian. I will share my experiment results later.