MakeLongVideo - Pytorch
Implementation of long video generation based on diffusion model.
"Ironman is surfing" |
"a car is racing" |
"a cat eating food of a bowl, in von Gogh style" |
"a giraffe underneath the microwave" |
![](https://github.com/xuduo35/MakeLongVideo/raw/main/samples/Ironman is surfi-7P269S.gif) |
![](https://github.com/xuduo35/MakeLongVideo/raw/main/samples/a car is racing-tm9rwR-4x.gif) |
![](https://github.com/xuduo35/MakeLongVideo/raw/main/samples/a cat eating foo-IR47a4.gif) |
![](https://github.com/xuduo35/MakeLongVideo/raw/main/samples/a giraffe undern-cAmGAc.gif) |
"a glass bead falling into water with huge splash" |
"a video of Earth rotating in space" |
"A teddy bear running in New York City" |
"A stunning aerial drone footage time lapse of El Capitan in Yosemite National Park at sunset" |
![](https://github.com/xuduo35/MakeLongVideo/raw/main/samples/a glass bead fal-Uxxg0y.gif) |
![](https://github.com/xuduo35/MakeLongVideo/raw/main/samples/a video of Earth-DzP1ma.gif) |
![](https://github.com/xuduo35/MakeLongVideo/raw/main/samples/A teddy bear run-A17vOA.gif) |
![](https://github.com/xuduo35/MakeLongVideo/raw/main/samples/A stunning aeria-WdIUoM.gif) |
## Change Logs
- [07/23/2023] LAION400M did not help too much, so I collected another 100m video/text pairs except 2M webvid dataset. Part of them are watermark free. After 2~3 months training, result seems not bad. I will release watermark free checkpoint soon. Training on RTX3090 2GPUs for video generation task is really a pain.
## Setup
### Requirements
```shell
python3 -m pip install -r requirements.txt
```
## Training
### Prepare Stable Diffusion v1-4 pretrained weights
download from huggingface and put it in directory 'checkpoints' which is configured in configs/makelongvideo.yaml
### Download webvid dataset
download webvid dataset into directory 'data/webvid' using https://github.com/m-bain/webvid repo. Then prepare dataset using command
```shell
python3 genvideocap.py
```
### Download LAION400M dataset
download laion400m into directory 'data/laion400m'
### Train
first train using resolution 128x128
```shell
accelerate launch --config_file ./configs/multigpu.yaml train.py --config configs/makelongvideo.yaml
```
then finetune in resolution 256x256, modify last line of configs/makelongvideo256x256.yaml according to your local epoch checkpoint
```shell
accelerate launch --config_file ./configs/multigpu.yaml train.py --config configs/makelongvideo256x256.yaml
```
## Inference
Pretrained weights: https://huggingface.co/xiexiecn/MakeLongVideo
```shell
# unwrap checkpoint first
TORCH_DISTRIBUTED_DEBUG=DETAIL accelerate launch train.py --config configs/makelongvideo.yaml --unwrap ./outputs/makelongvideo/checkpoint-5200
```
inference directly
```shell
python3 infer.py --width 256 --height 256 --prompt "a panda is surfing"
```
inference using latents initialized by sample video
```shell
python3 infer.py --width 256 --height 256 --prompt "a panda is surfing" --sample_video_path your_sample_video
```
inference by sample frame rate 6 (actual frame rate is 24/6==4)
```shell
python3 infer.py --width 256 --height 256 --prompt "a panda is surfing" --speed 6
```
## Todo
- [x] generate 24 frames video of 256x256
- [x] add fps control
- [x] release pretrained checkpoint
- [ ] remove watermark
- [ ] improve resolution to 512x512
- [ ] 1~2minutes video generation
- [ ] make story video
## References
* Make-A-Video: https://github.com/lucidrains/make-a-video-pytorch
* Tune-A-Video: https://github.com/showlab/Tune-A-Video
* diffusers: https://github.com/huggingface/diffusers
## Citations
```bibtex
@misc{Singer2022,
author = {Uriel Singer},
url = {https://makeavideo.studio/Make-A-Video.pdf}
}
```
```
@article{wu2022tuneavideo,
title = {Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation},
author = {Wu, Jay Zhangjie and Ge, Yixiao and Wang, Xintao and Lei, Stan Weixian and Gu, Yuchao and Hsu, Wynne and Shan, Ying and Qie, Xiaohu and Shou, Mike Zheng},
journal={arXiv preprint arXiv:2212.11565},
year = {2022},
note = {under review}
}
```