yuanxion / Text2Video-Zero

Text-to-Image Diffusion Models are Zero-Shot Video Generators
Other
2 stars 1 forks source link

Basic knowledge sharing of T2V PoC #7

Open yuanxion opened 1 year ago

yuanxion commented 1 year ago

T2V is planned to enable inferencing LLMs like Stable Diffusion on CPU/GPU, and training on Habana Gaudi/DG2, as well as improving the Generated Video quality, like more realistic frame, and coherency between frames.

yuanxion commented 1 year ago

Sharing of this PoC:

yuanxion commented 1 year ago

Text2Video-Zero paper: https://arxiv.org/pdf/2303.13439.pdf 【AIGC-AI视频生成系列-文章1】Text2Video-Zero, https://zhuanlan.zhihu.com/p/626777733

Advantages: use text-2-image synthesis (Stable Diffusion) without any training or optimization Development: Template-based -> GAN (attention) -> Transformer (multi-modal) -> Diffusion

image

Methods:

  1. Stable Diffusion + zero shot (no more training or optimization)
  2. Motion Dynamics in Latent Codes
  3. Reprogramming Cross-Frame Attention
  4. Background smoothing (Optional) with salient object detection
  5. Conditional and Specialized Text-to-Video with ControlNet
  6. Broadly applications for Diffusion-based (like Video Instruct-Pix2Pix) video generation and editing.

Thinking: Generate 3D foreground object, to improve the image quality?

yuanxion commented 1 year ago

ControlNet

ControlNet 在论文里提到,Canny Edge Detector 模型的训练用了 300 万张边缘-图像-标注对的语料,A100 80G 的 600 个 GPU 小时。Human Pose (人体姿态骨架)模型用了 8 万张 姿态-图像-标注 对的语料, A100 80G 的 400 个 GPU 时。 T2I-Adapter 的训练是在 4 块 Tesla 32G-V100 上只花了 2 天就完成,包括 3 种 condition,sketch(15 万张图片语料),Semantic segmentation map(16 万张)和 Keypose(15 万张)。 两者的差异:ControlNet 目前提供的预训模型,可用性完成度更高,支持更多种 condition detector (9 大类)。 T2I-Adapter ”在工程上设计和实现得更简洁和灵活,更容易集成和扩展” 此外,T2I-Adapter 支持一种以上的 condition model 引导,比如可以同时使用 sketch 和 segmentation map 作为输入条件,或 在一个蒙版区域 (也就是 inpaint ) 里使用 sketch 引导。

yuanxion commented 1 year ago

Stable Diffusion

yuanxion commented 1 year ago

Long Video generation

https://github.com/wuyongyi/awesome-video-generate