Clarification needed - Githubissues

CarmelaCalabrese commented 11 months ago

Hi! Thank you for your amazing job!

I would like to use it by myself but there are a couple of things that I don't understand and I would ask for clarifications:

1- in SlowFast/models folder there are different model class definitions like Contrastive, ClipImage, BasicClipVideo, TemporalClipVideo. For me, it's unclear what they represent because there isn't a 1 to 1 matching with the paper. Could you clarify this, please?

2- what is masked?

3- both in BasicClipVideo and TemporalClipVideo, you commented "Clip visual encoder for space feature extraction. Adding various temporal fusion type" but I suppose they don't do the same. Could you fix this?

4- in TemporalVisionTransformer there are different "temporal_transformation" that can be defined, as "expand_temporal_view", as done in the testing example. Where do I find the list of the available transformations and why should we apply them? I understood from the paper that you simply change the self-attention layers for Open-VCLIP. Probably I am missing something.

5- last but not least, I think that naming the folder "slowfast" when inside you can find different models, is a bit misleading.

Thank you in advance for you understanding and help.

wengzejia1 commented 11 months ago

Hi CarmelaCalabrese,

Thank you for your interest in our project. Let me address each of your concerns:

1. Model Class Definitions: We build this project based on the SlowFast and the "Contrastive"you mentioned is the original class defined in the original repo. Here we clarify the model class definition of "ClipImage, BasicClipVideo, TemporalClipVideo" as bellow:

The "ClipImage" model is used for testing on image tasks, for example, testing on ImageNet. The main difference between "ClipImage" and "BasicClipVideo" is the definition of the forward function, in which the "BasicClipVideo" forward function will deal with the temporal dimension of the video data.
The "BasicClipVideo" will compute features of each frame individually, while the "TemporalClipVideo" contains more complex temporal modeling methods that you can choose by passing the "temporal_modeling_type" parameter. In the file "customize_visiontransformer.py", you will find that we implement different kinds of temporal modeling method in the "TSTransformer" class and our paper choose to use the "expand_temporal_view" as our temporal modeling method. In particular, if you pass the "temporal_modeling_type" as "None", then the "TemporalClipVideo" is doing the same thing as "BasicClipVideo".

2. what is masked? I am not sure whether you are referring to the 'masked.py' file in the 'slowfast/models' directory. This file is part of the original content from the SlowFast repository, and we have not utilized it in our model."

3. Comments in BasicClipVideo and TemporalClipVideo: I appreciate your observation. I'll review and update the comments in BasicClipVideo and TemporalClipVideo to provide clearer descriptions of their functionalities. Thank you for bringing this to our attention.

4. TemporalVisionTransformer: You can find the list of different temporal modeling method in the "TSTransformer" class in "customize_visiontransformer.py". Our paper choose to use the "expand_temporal_view" as our temporal modeling method. In particular, you can find the implementation of "expand_temporal_view" in "slowfast/models/torch_utils/functional.py", which relates to your mentioned "simply change the self-attention layers".

5. Naming of the "slowfast" Folder: I understand your concern about the naming of the "slowfast" folder. We'll consider improving the folder structure and naming conventions to better reflect the diverse models within.

Thank you for your valuable feedback and patience. If you have any further questions or if there's anything else you'd like clarification on, please feel free to ask.

Best.

CarmelaCalabrese commented 11 months ago

Hi @wengzejia1 ,

thank you for your very detailed answer, now it is much clearer to me.

One final silly doubt but it could help someone else looking at the code. In the Example: Testing on UCF-101 (B/16) in the README.md of the project, you call "--cfg configs/Kinetics/TemporalCLIP_vitb16_8x16_STAdapter.yaml". By comparing this config file with "--cfg configs/Kinetics/TemporalCLIP_vitb16_8x8.yaml", following what you said, it seems to me that this last one actually applies BasicClipVideo and that "--cfg configs/Kinetics/TemporalCLIP_vitb16_8x16_STAdapter.yaml" doesn't apply STAdapter anywhere but it uses the "expand_temporal_view" instead.

Is it just a matter of wrong naming or am I loosing any other configuration option that I should take into account?

Thank you again!

wengzejia1 commented 11 months ago

Yes you're right. In the initial stages of this project, I wanted to remind myself that the hyperparameters are based on the settings outlined in the STAdapter article. But I think the current naming is inappropriate. Thank you for bringing this to my attention, and I plan to update it accordingly.

wengzejia1 / Open-VCLIP

Clarification needed #3