Open hw-liang opened 3 years ago
In network architecture, they have mentioned that their decoder is the same.
" In video generation, four deconvolutional layers are stacked and followed by C3D blocks. To generate a video which is r times as slow as the input video, we set the 4-th deconvolutional layer with a stride of r × 2 × 2, where the reconstructing rate r is determined through ablation study."
I haven't tried to reproduce the exact results, but just to clear your doubt. Maybe the original authors could shed more light on how to reproduce the exact results.
In network architecture, they have mentioned that their decoder is the same.
" In video generation, four deconvolutional layers are stacked and followed by C3D blocks. To generate a video which is r times as slow as the input video, we set the 4-th deconvolutional layer with a stride of r × 2 × 2, where the reconstructing rate r is determined through ablation study."
I haven't tried to reproduce the exact results, but just to clear your doubt. Maybe the original authors could shed more light on how to reproduce the exact results.
Oh, I missed this part. Thank you very much! They provided the pretrained checkpoint and I try to reproduce UCF classification result with C3D model using the original code and hyper-setting, but there is 5% gap.
The decoder structure of C3D, R3D, R21D is same.
Based on your repo, it seems that you are using the same decoder structure for all of the backbones(c3d, r3d, r(2+1)d). But in your paper, it seems you used different decoder structure based on C3D-block, R2D-block and R21D-block.
We cannot reproduce the result reported in the paper base on current code. Could you also provide your decoder implementation of r3d, r(2+1)d?