yuanxion / Text2Video-Zero

Text-to-Image Diffusion Models are Zero-Shot Video Generators
Other
2 stars 1 forks source link

[Basic] Research for different solutions for Text to Video and compare the advantages and disadvantages #3

Open yuanxion opened 1 year ago

yuanxion commented 1 year ago

Read papers to get more (innovation) ideas. Share learning of Diffusion model/ControlNet.

Idea: Make the video generation more smoothing.

cold-blue commented 1 year ago

Hi, all. I wonder whether it is possible to compute a 3D consistency metric between different frames? Maybe this kind of metric needs us to propose in some special format.

aixiaodewugege commented 1 year ago

Maybe one way is trying to use these frames and nerf to generate a 3D scene. And analysis whether the 3D scene is good?

yuanxion commented 1 year ago

3D reconstruction tasks are typically evaluated using reference-based metrics like Chamfer Distance (https://zhuanlan.zhihu.com/p/629086843):

image

From DreamFusion, they evaluate the CLIP R-Precision: https://arxiv.org/pdf/2209.14988.pdf.

image
cold-blue commented 1 year ago

I have a consideration: maybe we shouldn't choose Zero to be the base model for our second stage. The advantage of Zero is that it doesn't need any training. If we break this constraint to enhance motion coherence by finetuning it in some way, it may be meaningless in terms of innovation because there have been some models solving the motion coherence problem by finetuning a text-to-image model with unlabeled video datasets. If we indeed want to solve the coherence problem by 3D latent constraints, maybe we should choose some models which are designed to be finetuned to be our base model or comparison ones. However, for the accelerating task in our first stage, Zero is a good choice.

cold-blue commented 1 year ago

https://arxiv.org/pdf/2303.11328.pdf MicrosoftTeams-image (4)

cold-blue commented 1 year ago

https://arxiv.org/abs/2303.14184 MicrosoftTeams-image (5)

aixiaodewugege commented 1 year ago

We can also generate 3D motion, if we only focus on human video. image

paper : Generating Diverse and Natural 3D Human Motions from Text

yuanxion commented 1 year ago

I have a consideration: maybe we shouldn't choose Zero to be the base model for our second stage. The advantage of Zero is that it doesn't need any training. If we break this constraint to enhance motion coherence by finetuning it in some way, it may be meaningless in terms of innovation because there have been some models solving the motion coherence problem by finetuning a text-to-image model with unlabeled video datasets. If we indeed want to solve the coherence problem by 3D latent constraints, maybe we should choose some models which are designed to be finetuned to be our base model or comparison ones. However, for the accelerating task in our first stage, Zero is a good choice.

Yes, agree. We have powerful GPUs with Server now, so it is ok to change and train whichever models.

cold-blue commented 1 year ago

https://huggingface.co/blog/text-to-video In this web page, there are several model demos including Zero for us to try freely.

Screenshot 2023-06-08 101421
cold-blue commented 1 year ago

I have tried several T2V demos on huggingface and this one seems the best up to now. https://huggingface.co/spaces/NeuralInternet/Text-to-Video_Playground

Screenshot 2023-06-08 101506