Closed cmhungsteve closed 5 years ago
After chatting with the author in this thread https://github.com/yabufarha/ms-tcn/issues/12 I think the input of the model is (T, 2048)
where T is the length of the video, plus 1024 RGB features and 1024 optical flow features from I3D. I also downloaded the feature data listed in README and can verify that.
Yes, we used both RGB and optical flow.
got it. Thank you.
From the paper, you mentioned you extracted features from RGB frames using I3D. Did you include other kinds of modalities (e.g. optical flow, MHI) in your features?
I am a little bit confused because most of the methods you compared use RGB + MHI (Motion History Image). It is really impressive if you beat them using RGB only.