wilson1yan / VideoGPT

MIT License
962 stars 115 forks source link

Number of Layers for Prior Model #11

Closed xinbowu2 closed 3 years ago

xinbowu2 commented 3 years ago

Will there be a significant performance drop with 8 attention layer or 4 attention layer prior models on the bair pushing dataset? I also found paper used a 16 attention layer prior model but the default setting in the code base is 8 attention layer.

wilson1yan commented 3 years ago

There are some ablation in the paper (Table 4) detailing FVD on the BAIR dataset. Going down to 8 layers seems to have similar performance in FVD (as 16 layers) + higher bits / dim, but too low (e.g. 4 or 2) starts to degrade the sample quality.

xinbowu2 commented 3 years ago

Thank you! In addition, will the performance be sensitive to the batch size? If we want to reduce the batch size to like 16, should we adjust the learning rate accordingly? I found the prior model will still consume too much memory even with uses of sparse attention, so I am wondering if there is a smaller prior design with small performance loss.

wilson1yan commented 3 years ago

I think batch size to 16 should still be fine, though I haven't tried it so not completely sure. If you are training on GPUs with tensor cores, then you can also try using mixed precision training: --amp_level O1 --precision 16 which should reduce a good amount of GPU memory usage (~40%).

xinbowu2 commented 3 years ago

Thank you! I will try it.