yerfor / Real3DPortrait

Real3D-Portrait: One-shot Realistic 3D Talking Portrait Synthesis; ICLR 2024 Spotlight; Official code
MIT License
912 stars 102 forks source link

How to fine-tune the pre-trained model on my dataset? #67

Open felixshing opened 3 months ago

felixshing commented 3 months ago

Hello, I would like to ask how can I load the pre-trained model and fine-tune it on my self-collected dataset?

yerfor commented 3 months ago

Hi, you can use the init_from_ckpt option.

felixshing commented 3 months ago

Hi, you can use the init_from_ckpt option.

Thanks!

I have another question regarding the pre-trained models you provided. Specifically, you included "audio2secc_vae" and "secc2plane_torso_orig". However, in your training guidelines for audio, it is recommended to first train "audio_lm3d_syncnet" and then "audio2motion". Similarly, for motion, the guideline suggests first training "Img-to-Plane" followed by "Motion-to-Video", which includes "secc2plane_head" and "secc2plane_torso".

I am a bit confused about their relationships. Are "audio2secc_vae" equivalent to "audio2motion" and "secc2plane_torso_orig" equivalent to "secc2plane_torso"?

For audio training, should I:

1) Train "audio_lm3d_syncnet" myself, and then 2) When training "audio2motion", provide the checkpoints from both my trained "audio_lm3d_syncnet" and the provided "audio2secc_vae"?

Or, do I not have to train "audio_lm3d_syncnet" at all and just provide "audio2secc_vae" for fine-tuning?

Similarly, for Motion-to-Video training, should I:

1) Train "Img-to-Plane" myself 2) Train "secc2plane_head" myself, based on trained "Img-to-Plane" 3) When training "secc2plane_torso", provide the checkpoints from both my trained "secc2plane_head" and the provided "secc2plane_torso_orig"?

But seems we can only set one checkpoint for "init_from_ckp"?

Additionally, does "secc2plane_head" imply inferring only the head area without the torso?

Thank you so much for your help!

yerfor commented 2 months ago
  1. Yes, "audio2secc_vae" equivalent to "audio2motion" and "secc2plane_torso_orig" equivalent to "secc2plane_torso"
  2. For audio training, should I ==> Yes, you need to train a syncnet.
  3. You can skip the image-to-plane pre-training, and go through the init_from_ckpt => secc2plane_head => secc2plane_torso.
  4. does "secc2plane_head" imply inferring only the head area without the torso? ==> Yes
felixshing commented 2 months ago

Thank you so much for your response! I am still a bit confused about this step:

  1. You can skip the image-to-plane pre-training, and go through the init_from_ckpt => secc2plane_head => secc2plane_torso.

Where can we get the pre-trained model for image-to-plane? It appears that currently, we only have the pre-trained models for "audio2motion" and "secc2plane_torso".

Additionally, I noticed that during evaluation, the human figure changes each time instead of using the one I provided. Where is this part of the setup, and how can we modify it to use my provided human figure?

image

Thank you for your time!

yerfor commented 2 months ago

you can use the provided pre-trained secc2plane_torso to initialize you own secc2plane_head model, just set strict=False.

For using your provided human figure, please modify the code in validation_steps

felixshing commented 2 months ago

you can use the provided pre-trained secc2plane_torso to initialize you own secc2plane_head model, just set strict=False.

For using your provided human figure, please modify the code in validation_steps

Thank you for your reply!

I have modified the training logic. However, when I tried to train the secc2plane_head model on my 4090 GPU, I encountered the OOM issue. Is there any way to reduce the GPU memory requirement during training? I tried to reduce "num_workers" but it did not work

yerfor commented 2 months ago

You can reduce the batch_size, or you can try amp=True

moliq1 commented 1 month ago

@yerfor Hi, Thank you so much for your wonderful work. I was wondering if you could also release a public avaliable model of the syncnet, so we can finetune on our dataset much easier?