yabufarha / ms-tcn

Other
214 stars 58 forks source link

Download Dataset #12

Closed yuanzhedong closed 4 years ago

yuanzhedong commented 5 years ago

Hi, thank you for the great work! I have trouble to download the data folder on MEGA, seems I need to pay for it. I'm wondering if you did any preprocessing? Will it still work if I download the dataset from their official websit? Thanks!

yabufarha commented 5 years ago

Hi, You should be able to download the data folder. But you need to download the MEGA free desktop app first and use it to download the data.

yuanzhedong commented 4 years ago

Hi @yabufarha , thank you for your reply! Do you have the code to run the I3D feature extraction given a video file? I'm trying to try it on my own dataset but get stuck to extract features. Thank you!

yabufarha commented 4 years ago

Hi, We used the following repository to extract I3D features: https://github.com/ahsaniqbal/Kinetics-FeatureExtractor

yuanzhedong commented 4 years ago

@yabufarha I tried that feature extractor and it works! Thank you! Another question I have is do you use both RGB features and optical flow features? Given a video with length T, I can get (T/16, 1024) RGB i3d features and (T/16, 1024) optical flow features. Did you concatenate those two into (T/16, 2048) tensor to be the input of MS-TCN?

yabufarha commented 4 years ago

We used both RGB and optical flow. The default settings generate for a video of length T an output array of size (T, 2048) such that RGB features and flow features are concatenated. For the MS-TCN code, there is a parameter were you set the dimension of the input features. The input to the MS-TCN should be of shape (bz, features_dim, T)

yuanzhedong commented 4 years ago

For i3d extractor how do you get embedding for each frame? Seems I have to fill in 16 frames into i3d extractor to get one data point. E.g. the input shape is (1, 16, 224, 224, 3) and the output shape is (1,1,1,1024)

yabufarha commented 4 years ago

For each frame we pass a video segment centered at that frame to the I3D model. But the code in the referenced repository already does that. You only need to provide the list of videos.

yuanzhedong commented 4 years ago

Got it, thank you so much for the help!

yuanzhedong commented 4 years ago

Hi @yabufarha, if you have time could you help me to locate the code to where the video segments are generated in that reference repo, to me it seems just passing the whole video frames and save the embeddings. Here's the code to get all frames for one video: https://github.com/ahsaniqbal/Kinetics-FeatureExtractor/blob/4c50003a1684517106d8f66afbfd588ebae28241/extractor.py#L28 Here's the code to pass the whole video into i3d: https://github.com/ahsaniqbal/Kinetics-FeatureExtractor/blob/4c50003a1684517106d8f66afbfd588ebae28241/extractor.py#L134

yuanzhedong commented 4 years ago

Also how long is the video segment do you use to extract per frame embedding?

yabufarha commented 4 years ago

Actually we used the default settings of 21 frames. Regarding the segment generation code, unfortunately I didn't look into the details. But if you want to extract features, this should not be relevant.

yuanzhedong commented 4 years ago

I see, thanks! The reason I'm asking is we want to extract features by using cv2.calcOpticalFlowFarneback from opencv to compute the optical flow. Basically, we want to rewrite the feature extractor in python. The feature extractor code you shared is very helpful, but we are struggling with how to feed the data into i3d model. For example, if we feed with rgb input with batch _size = 1 and 21 frames, the input dim is (1, 21, 224, 224, 3), but the output dim I get is (2, 1, 1, 1024). Which means for each frame in the original video, we will get a 2 x 1024 as rgb feature, instead of 1 x 1024. Turns out the maximum temporal window size we can have is 16. If we fill with (1, 16, 224, 224, 3), we will get (1, 1, 1, 1024) feature as expected. Not sure which part we're missing, but your response is really helpful, thank you so much!

yabufarha commented 4 years ago

You are right about the output dimension. Nevertheless, the code is based on the I3D paper and an average pooling is applied on the temporal dimension of the output. And that's how you get always 1x1024 features vector for each modality even for larger temporal window - by averaging over the temporal dim.

yuanzhedong commented 4 years ago

Average pooling over the temporal dim makes a lot of sense, thank you!!