yabufarha / ms-tcn

Other
223 stars 59 forks source link

Where did the features of the datasets come from? #39

Open Youthfeng123 opened 2 years ago

Youthfeng123 commented 2 years ago

Hi there, after reading the paper and code, I found that the ms-tcn takes features of video as input. Then I load a feature into a numpy variable, seeing the data is a matrix with shape (2048,n). Here is my confuse: do the features in the datasets could be transformed into video? Or they are features that extracted from other backbone? What is that extractor? Looking forward to your reply.

Youthfeng123 commented 2 years ago

I read the paper again, knowing that you use the video features extracted from I3D model. So could you please tell me that which extractor did you use? Two-Stream I3D Kinetics-pretraining or miniKinetics-pretraining?

shenjiyuan123 commented 2 years ago

You can check this issue, there may have the answer you want. https://github.com/yabufarha/ms-tcn/issues/34 From my perspective, I think they use the pretrained model of two-stream I3D Kinetics.

Youthfeng123 commented 2 years ago

You can check this issue, there may have the answer you want. #34 From my perspective, I think they use the pretrained model of two-stream I3D Kinetics.

Hi~ thanks for your reply. I 've checked your proposal roughly, and noticed that one of the responder put a repo about feature extractor. Have you tried it before? If I choose a video shotted by myself, can I obtain a feature with shape (2048,n)? If not, it's okay~ thanks a lot anyway~

shenjiyuan123 commented 2 years ago

You can check this issue, there may have the answer you want. #34 From my perspective, I think they use the pretrained model of two-stream I3D Kinetics.

Hi~ thanks for your reply. I 've checked your proposal roughly, and noticed that one of the responder put a repo about feature extractor. Have you tried it before? If I choose a video shotted by myself, can I obtain a feature with shape (2048,n)? If not, it's okay~ thanks a lot anyway~

Hi. To be honesty, I've just recorded the data and am going to try this feature extractor. I also have the concern about whether the feature extractor repo can work because the last commit is about 5 years ago (cry.). I hope that maybe we can communicate frequently afterward when we have some results.

Youthfeng123 commented 2 years ago

You can check this issue, there may have the answer you want. #34 From my perspective, I think they use the pretrained model of two-stream I3D Kinetics.

Hi~ thanks for your reply. I 've checked your proposal roughly, and noticed that one of the responder put a repo about feature extractor. Have you tried it before? If I choose a video shotted by myself, can I obtain a feature with shape (2048,n)? If not, it's okay~ thanks a lot anyway~

Hi. To be honesty, I've just recorded the data and am going to try this feature extractor. I also have the concern about whether the feature extractor repo can work because the last commit is about 5 years ago (cry.). I hope that maybe we can communicate frequently afterward when we have some results.

Wow, It seems that we are in the same situation! Yeah of course I hope we can communicate this problem frequently too. I'm going to try that repo too, and wish luck to both of us.

shenjiyuan123 commented 2 years ago

You can check this issue, there may have the answer you want. #34 From my perspective, I think they use the pretrained model of two-stream I3D Kinetics.

Hi~ thanks for your reply. I 've checked your proposal roughly, and noticed that one of the responder put a repo about feature extractor. Have you tried it before? If I choose a video shotted by myself, can I obtain a feature with shape (2048,n)? If not, it's okay~ thanks a lot anyway~

Hi. To be honesty, I've just recorded the data and am going to try this feature extractor. I also have the concern about whether the feature extractor repo can work because the last commit is about 5 years ago (cry.). I hope that maybe we can communicate frequently afterward when we have some results. Wow, It seems that we are in the same situation! Yeah of course I hope we can communicate this problem frequently too. I'm going to try that repo too, and wish luck to both of us.

Fighting~

bqdeng commented 2 years ago

WOW! I'm also facing this problem now. If you two have a breakthrough, please come back and share it! I will thank you very much!

shenjiyuan123 commented 2 years ago

WOW! I'm also facing this problem now. If you two have a breakthrough, please come back and share it! I will thank you very much!

Hi! So glad to hear that you are also trying to do the same thing as me. I just want to share my latest progress with you. I have tried a new repo, which also have the function for extracting the features. The only different thing is that the output dim is [n, 768] and you need to make a transpose of it in order to satisfy the input requirement of ASFormer. And also, if you want to use this method, you can check this for the more detail https://github.com/ttlmh/Bridge-Prompt/issues/3. Hope this can really help you!

Youthfeng123 commented 2 years ago

WOW! I'm also facing this problem now. If you two have a breakthrough, please come back and share it! I will thank you very much!

Hi! So glad to hear that you are also trying to do the same thing as me. I just want to share my latest progress with you. I have tried a new repo, which also have the function for extracting the features. The only different thing is that the output dim is [n, 768] and you need to make a transpose of it in order to satisfy the input requirement of ASFormer. And also, if you want to use this method, you can check this for the more detail ttlmh/Bridge-Prompt#3. Hope this can really help you!

Hi there, it seems that we are not along. I did have tried https://github.com/ahsaniqbal/Kinetics-FeatureExtractor , but I failed when I try to install that c++project, if any of you installed this, Please teach me how, thanks! Then I found this repo: https://github.com/VividLe/ExtractVideoFeature ,and I also extract rgb and flow features successfully, but this code does some downsample in time, so the output dim would be [n_frame-chunk_size,1024], it means that you should fix it with padding some zero tensor into the video. So glad to hear that you also got the features, I will try your method as well.

Youthfeng123 commented 2 years ago

WOW! I'm also facing this problem now. If you two have a breakthrough, please come back and share it! I will thank you very much!

Hi! So glad to hear that you are also trying to do the same thing as me. I just want to share my latest progress with you. I have tried a new repo, which also have the function for extracting the features. The only different thing is that the output dim is [n, 768] and you need to make a transpose of it in order to satisfy the input requirement of ASFormer. And also, if you want to use this method, you can check this for the more detail ttlmh/Bridge-Prompt#3. Hope this can really help you!

By the way, how did you transform your ouput? Like adding a fully connected network, but didn't it means that the model need to be trained again?

shenjiyuan123 commented 2 years ago

WOW! I'm also facing this problem now. If you two have a breakthrough, please come back and share it! I will thank you very much!

Hi! So glad to hear that you are also trying to do the same thing as me. I just want to share my latest progress with you. I have tried a new repo, which also have the function for extracting the features. The only different thing is that the output dim is [n, 768] and you need to make a transpose of it in order to satisfy the input requirement of ASFormer. And also, if you want to use this method, you can check this for the more detail ttlmh/Bridge-Prompt#3. Hope this can really help you!

By the way, how did you transform your ouput? Like adding a fully connected network, but didn't it means that the model need to be trained again?

For the first comment. To be frank, I didn't try this repo any more. And from your description, I think you can try to maintain the downsample rate during the training process so that the dim would be the same. Because if the video is so long, it definitely needs to do the downsample to decrease the feature matrix size.

For the second comment. Yes, the models need to be trained from scratch. I just simply modify the action segmentation's input dim. Also, a possible way is that you can add a projection at first and load the pre-trained model parameters for the rest of the model and make a fine-tune.

Hope to hear your feedback and success!

habakan commented 2 years ago

Hi! I have the same problem.
The following repository is later research in action segmentation and quoted this repository. https://github.com/yiskw713/asrf
By README, the above repo is using the same feature and we can extract features using this.

Dataset GTEA, 50Salads, Breakfast You can download features and G.T. of these datasets from this repository. Or you can extract their features by yourself using this repository

I'll try to use https://github.com/yiskw713/video_feature_extractor.
Have any of you tried this?

Youthfeng123 commented 2 years ago

Hi! I have the same problem. The following repository is later research in action segmentation and quoted this repository. https://github.com/yiskw713/asrf By README, the above repo is using the same feature and we can extract features using this.

Dataset GTEA, 50Salads, Breakfast You can download features and G.T. of these datasets from this repository. Or you can extract their features by yourself using this repository

I'll try to use https://github.com/yiskw713/video_feature_extractor. Have any of you tried this?

Hi, thanks for your proposal! To be honest, I didn’t know this feature extractor until I read your reply, I’ve used the extractor I mentioned in previous answer to do experiments for some time, and I found that the features extracted by that extractor are not good enough, which means that the accuracy wasn’t that high as the record in the paper. I will try your repo, and thank you again.

habakan commented 2 years ago

Hi, thanks for your proposal! To be honest, I didn’t know this feature extractor until I read your reply, I’ve used the extractor I mentioned in previous answer to do experiments for some time, and I found that the features extracted by that extractor are not good enough, which means that the accuracy wasn’t that high as the record in the paper. I will try your repo, and thank you again.

Thank you sharing your status!
I'll also re-experment to reproduct paper records, but it may sound difficult from what you've said.
My proposal repo is using pytorch-i3d. So, I think this feature is not completely identical with official features.

shenjiyuan123 commented 2 years ago

Thank you sharing your status! I'll also re-experment to reproduct paper records, but it may sound difficult from what you've said. My proposal repo is using pytorch-i3d. So, I think this feature is not completely identical with official features.

Hi! I am also working on the feature extraction process. I am looking forward for your results. If you have any when using this repo, I hope you can share us with the result. Thank you very much!

rickywrq commented 2 years ago

Hi, @Youthfeng123 @habakan @shenjiyuan123. I am also exploring I3D lately and find the code from the following repos that you have shared (code1 and code2) work for me.

@Youthfeng123 For the shape (2048,n), my assumption is that when you input 21 video frames (as mentioned here) with the frame size of (224,224), i.e., the input size is (-1,3,21,224,224), you will get shape of (-1,1024,2,1,1) as the output of last AvgPool3d.

Youthfeng123 commented 2 years ago

Hi, @Youthfeng123 @habakan @shenjiyuan123. I am also exploring I3D lately and find the code from the following repos that you have shared (code1 and code2) work for me.

@Youthfeng123 For the shape (2048,n), my assumption is that when you input 21 video frames (as mentioned here) with the frame size of (224,224), i.e., the input size is (-1,3,21,224,224), you will get shape of (-1,1024,2,1,1) as the output of last AvgPool3d.

Thanks for your explanation of the tensor's shape, it helps a lot! May I ask which optical flow extractor did you use? I used this repo to extract optical flow of video, since I would like to try more different methods, would you please share yours? Thanks a lot! @littlesi789 @shenjiyuan123 @habakan

rickywrq commented 2 years ago

Hi, @Youthfeng123 @habakan @shenjiyuan123. I am also exploring I3D lately and find the code from the following repos that you have shared (code1 and code2) work for me. @Youthfeng123 For the shape (2048,n), my assumption is that when you input 21 video frames (as mentioned here) with the frame size of (224,224), i.e., the input size is (-1,3,21,224,224), you will get shape of (-1,1024,2,1,1) as the output of last AvgPool3d.

Thanks for your explanation of the tensor's shape, it helps a lot! May I ask which optical flow extractor did you use? I used this repo to extract optical flow of video, since I would like to try more different methods, would you please share yours? Thanks a lot! @littlesi789 @shenjiyuan123 @habakan

Hi, @Youthfeng123. In the I3D paper, the author used tv-l1 to extract optical flow. You may find code implementations from others.

Please be cautious with my explanation above. I could not find if the authors used RGB, or optical flow, or both in the MS-TCN paper (please correct me if they did mention it in the paper). In the original I3D paper, the two flows are averaged at the final prediction stage and the predictions are arraies with a length of 400. So the above explanation for 2048 is my assumption, and unfortunately, I cannot find a way right now to reproduce their extracted features.

habakan commented 2 years ago

@littlesi789 @Youthfeng123 Thanks for sharing!
This author comments shows features are extracted by this repo.
And this repo seems to use optical flow tv-l1 implemented by OpenCV(Target code is this).
Considering above, I think MS-TCN use OpticalFlow feature(but I could not find in the paper).

XuanHien304 commented 2 years ago

Can I ask that anyone of you extract successfully video features with shape (2048xT)? I cannot install opencv 2.4.13 in the kinetic feature extraction repo.

XuanHien304 commented 2 years ago

@littlesi789 @shenjiyuan123 @bqdeng Do you mind answering me :'<

KarolyneFarfan commented 7 months ago

Hi guys, I am currently facing the same problem, I would like to know if anyone have made this repo work ? https://github.com/ahsaniqbal/Kinetics-FeatureExtractor/tree/master. I would be really grateful if you can give some advice.

XuanHien304 commented 7 months ago

Hi @KarolyneFarfan, I used another repo to extract feature, you can see my project at: https://github.com/XuanHien304/E2E-Action-Segmentation