Implementation of long video generation
MakeLongVideo - Pytorch

Implementation of long video generation based on diffusion model.

"Ironman is surfing" "a car is racing" "a cat eating food of a bowl, in von Gogh style" "a giraffe underneath the microwave"
"a glass bead falling into water with huge splash" "a video of Earth rotating in space" "A teddy bear running in New York City" "A stunning aerial drone footage time lapse of El Capitan in Yosemite National Park at sunset"
## Change Logs - [07/23/2023] LAION400M did not help too much, so I collected another 100m video/text pairs except 2M webvid dataset. Part of them are watermark free. After 2~3 months training, result seems not bad. I will release watermark free checkpoint soon. Training on RTX3090 2GPUs for video generation task is really a pain. ## Setup ### Requirements ```shell python3 -m pip install -r requirements.txt ``` ## Training ### Prepare Stable Diffusion v1-4 pretrained weights download from huggingface and put it in directory 'checkpoints' which is configured in configs/makelongvideo.yaml ### Download webvid dataset download webvid dataset into directory 'data/webvid' using https://github.com/m-bain/webvid repo. Then prepare dataset using command ```shell python3 genvideocap.py ``` ### Download LAION400M dataset download laion400m into directory 'data/laion400m' ### Train first train using resolution 128x128 ```shell accelerate launch --config_file ./configs/multigpu.yaml train.py --config configs/makelongvideo.yaml ``` then finetune in resolution 256x256, modify last line of configs/makelongvideo256x256.yaml according to your local epoch checkpoint ```shell accelerate launch --config_file ./configs/multigpu.yaml train.py --config configs/makelongvideo256x256.yaml ``` ## Inference Pretrained weights: https://huggingface.co/xiexiecn/MakeLongVideo ```shell # unwrap checkpoint first TORCH_DISTRIBUTED_DEBUG=DETAIL accelerate launch train.py --config configs/makelongvideo.yaml --unwrap ./outputs/makelongvideo/checkpoint-5200 ``` inference directly ```shell python3 infer.py --width 256 --height 256 --prompt "a panda is surfing" ``` inference using latents initialized by sample video ```shell python3 infer.py --width 256 --height 256 --prompt "a panda is surfing" --sample_video_path your_sample_video ``` inference by sample frame rate 6 (actual frame rate is 24/6==4) ```shell python3 infer.py --width 256 --height 256 --prompt "a panda is surfing" --speed 6 ``` ## Todo - [x] generate 24 frames video of 256x256 - [x] add fps control - [x] release pretrained checkpoint - [ ] remove watermark - [ ] improve resolution to 512x512 - [ ] 1~2minutes video generation - [ ] make story video ## References * Make-A-Video: https://github.com/lucidrains/make-a-video-pytorch * Tune-A-Video: https://github.com/showlab/Tune-A-Video * diffusers: https://github.com/huggingface/diffusers ## Citations ```bibtex @misc{Singer2022, author = {Uriel Singer}, url = {https://makeavideo.studio/Make-A-Video.pdf} } ``` ``` @article{wu2022tuneavideo, title = {Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation}, author = {Wu, Jay Zhangjie and Ge, Yixiao and Wang, Xintao and Lei, Stan Weixian and Gu, Yuchao and Hsu, Wynne and Shan, Ying and Qie, Xiaohu and Shou, Mike Zheng}, journal={arXiv preprint arXiv:2212.11565}, year = {2022}, note = {under review} } ```