yuangan / EAT_code

Official code for ICCV 2023 paper: "Efficient Emotional Adaptation for Audio-Driven Talking-Head Generation".
Other
278 stars 31 forks source link

How to edit with text and interpolate states as described in the paper? #23

Closed G-force78 closed 8 months ago

yuangan commented 8 months ago

We interpolate two emotions between corresponding emotion prompts. image You need to fine-tune EAT with an additional CLIP loss to edit videos with text. image

JamesLong199 commented 6 months ago

Hi,

Could you kindly provide more details on the setting for model fine-tuning with CLIP and the zero-shot text-guided expression editing procedure?

For model fine-tuning with CLIP, my understanding is that: the same losses in emotion adaptation are used in addition to CLIP loss; the fine-tuning is performed on MEAD, where each training video is paired with fixed text prompts of the corresponding emotion category (attached in the screenshot).

For the zero-shot text-guided expression editing, I was wondering how is the CLIP text feature incorporated into the existing model structure (e.g. a mapping from CLIP feature to the latent code z or to the emotion prompt?).

Thank you in advance for your time and help!

image