Feature extraction and Python3 support

kkirtac commented 4 years ago

Hi @yabufarha ,

thanks for the good work.

Could you please provide more information on feature extraction? I understand that you've used https://github.com/ahsaniqbal/Kinetics-FeatureExtractor. Is it using information from future frames, i.e., to compute features of frame i, is it smthg like (i-10, ..., i-1, i, i+1, ..., i+10) for a 21-frame temporal window, or is it only looking back?

A second question, since python2 is deprecated as of Jan 2020, I wasn't able to make the things in the feature extraction repo work. Could you provide a recipe for switching to Python3, both for your repo and for the feature extraction repo? I know that would not be easy but any comments and suggestions will be useful.

Best,

yabufarha commented 4 years ago

Hi @kkirtac ,

Happy that you are interested in our work.

To switch to Python3 for this repository, you only need to modify line 49 in batch_gen.py For python2.7: length_of_sequences = map(len, batch_target) For python3.7: length_of_sequences = list(map(len, batch_target))

Regarding the feature extraction, we used a window of size 21 around each frame. I.e. the window for frame i contains the frames in range [i-10, i+10] as you mentioned. To switch to Python3 for that repository, I'm sorry I do not have any suggestions to help as I do not maintain that repository.

I hope this would help.

Best, Yazan

kkirtac commented 4 years ago

Thank you. Maybe I will need to find a pytorch repo for i3d feature extraction. You can suggest one if you maybe know?

To clarify for my understanding;

1) Is the feature extractor you have used was pretrained in Kinetics or in Imagenet?

2) Does the feature extractor network yield per-frame feature vectors in the range [i-10, i+10], i.e., all 21 frames in one pass, or it does provide a feature vector only for the i'th frame? And is this behavior the same both for rgb and flow input?

3) Maybe somewhat related to 2'nd question but, what is your temporal stride? This can change if you receive features for all 21 frames in one pass or just get the features for the middle frame. I guess the network should yield features for all 21 frames, no?

Thanks.

yabufarha commented 4 years ago

The I3D is pretrained on Kinetics and we get one feature vector for each frame. This means that the stride is one. In other words, for each frame i, we pass the window [i-10, i+10] to the I3D and average-pool the volume at the penultimate layer. The result is a 1024 dimensional vector for the RGB stream and we do the same thing using the flow stream. The final feature vector for each frame is obtained by concatenating the vectors form both RGB and flow streams which results in a 2048 dimensional vector for each frame. I think there are many other good repositories to extract I3D features. You need just to search for i3d. However, I didn't really use any particular one in my work

kkirtac commented 4 years ago

Thank you. Is this line performing the pooling that you described https://github.com/ahsaniqbal/Kinetics-FeatureExtractor/blob/master/extractor_lazy.py#L127 ?

I am not sure if other repositories are performing the pooling in the same way, or just returning per frame features for all frames in the given sequence. Anyway, I can perform the same additional pooling if I understand it correctly.

yabufarha commented 4 years ago

Yes, lines 127-134 describe the operations to apply on mixed_5c to get the features vector.

kkirtac commented 4 years ago

One final question @yabufarha ,

do you have per-frame or per-sequence (21 frames) class label as target, during training?

Thank you.

yabufarha commented 4 years ago

You mean to train MS-TCN? Yes, the datasets provide frame-wise annotations. Kindly note that the features are extracted offline using I3D pretrained on Kinetics without fine-tuning.

kkirtac commented 4 years ago

You mean to train MS-TCN? Yes, the datasets provide frame-wise annotations. Kindly note that the features are extracted offline using I3D pretrained on Kinetics without fine-tuning.

Yes, for MS-TCN, thank you.

Do you apply a spatial transform such as RandomResizedCrop(224) or RandomCrop(224) to the 21-frames sequence before extracting features? Or just resize it to 224?

yabufarha commented 4 years ago

We use a center crop of size 224x224

yabufarha / ms-tcn

Feature extraction and Python3 support #19