Details about CLIP fine-tuning and zero-shot text-guided editing

JamesLong199 commented 2 months ago

Hi,

Could you kindly provide more details on the setting for model fine-tuning with CLIP and the zero-shot text-guided expression editing procedure?

For model fine-tuning with CLIP, my understanding is that: the same losses in emotion adaptation are used in addition to CLIP loss; the fine-tuning is performed on MEAD, where each training video is paired with fixed text prompts of the corresponding emotion category (attached in the screenshot).

For the zero-shot text-guided expression editing, I was wondering how is the CLIP text feature incorporated into the existing model structure (e.g. a mapping from CLIP feature to the latent code z or to the emotion prompt?).

Thank you in advance for your time and help!

_Originally posted by @JamesLong199 in https://github.com/yuangan/EAT_code/issues/23#issuecomment-2110126521_

yuangan commented 2 months ago

Hi, thank you for your attention.

As an application of our proposed modules, we achieve this in a direct way: optimize the latent code z with the CLIP's loss. Given an emotion label and a video, we finetune the EAT components including the mapping network, EDN, and EAM to edit the expression in the video according to the input text.

The training of text-guided mapper in StyleCLIP may help you understand this process.

If you have any other questions, please let us know.

JamesLong199 commented 2 months ago

Thank you for your swift response and concise explanation. In addition, the description in the StyleCLIP paper is really helpful :)

JamesLong199 commented 2 months ago

Hi,

Would it be possible to upload a script for CLIP fine-tuning? Thank you in advance for your time and help.

yuangan commented 2 months ago

Thank you for your consistent attention.

The answer is Yes. We are now considering releasing a script for zero-shot video editing this week. This is an interesting phenomenon and it needs more brilliant research, I think.

I am working on clearing the code of this part. You can try it soon.

G-force78 commented 2 months ago

Thank you for your consistent attention.

The answer is Yes. We are now considering releasing a script for zero-shot video editing this week. This is an interesting phenomenon and it needs more brilliant research, I think.

I am working on clearing the code of this part. You can try it soon.

Good to see you continue to work on this in my experience your implementation has been the best for replicating expressions.

yuangan commented 2 months ago

I really appreciate your feedback. @G-force78

I've uploaded the zero-shot editing code, and you can find more details here.

It has been a long journey for me to develop and release this work. I hope you find the emotional talking-head generation as interesting as I do. 😜

G-force78 commented 2 months ago

Hi, Im not sure what this refers to?

Traceback (most recent call last): File "/content/EAT_code/prompt_st_dp_eam3d_mapper_full.py", line 162, in train(text, config, generator, None, kp_detector, audio2kptransformer, mapper, sidetuning, opt.checkpoint, log_dir, dataset, opt.device_ids) File "/content/EAT_code/train_transformer.py", line 474, in train_batch_prompt_mapper3 generator_full = GeneratorFullModelBatchDeepPromptSTEAM3DNewStyle3(text, kp_detector, audio2kptransformer, mapper, sidetuning, generator, discriminator, train_params, estimate_jacobian=config['model_params']['common_params']['estimate_jacobian']) NameError: name 'GeneratorFullModelBatchDeepPromptSTEAM3DNewStyle3' is not defined. Did you mean: 'GeneratorFullModelBatchDeepPromptSTEAM3D'?

yuangan commented 2 months ago

Hi, Im not sure what this refers to?

Traceback (most recent call last): File "/content/EAT_code/prompt_st_dp_eam3d_mapper_full.py", line 162, in train(text, config, generator, None, kp_detector, audio2kptransformer, mapper, sidetuning, opt.checkpoint, log_dir, dataset, opt.device_ids) File "/content/EAT_code/train_transformer.py", line 474, in train_batch_prompt_mapper3 generator_full = GeneratorFullModelBatchDeepPromptSTEAM3DNewStyle3(text, kp_detector, audio2kptransformer, mapper, sidetuning, generator, discriminator, train_params, estimate_jacobian=config['model_params']['common_params']['estimate_jacobian']) NameError: name 'GeneratorFullModelBatchDeepPromptSTEAM3DNewStyle3' is not defined. Did you mean: 'GeneratorFullModelBatchDeepPromptSTEAM3D'?

Hi, you need to download the latest version of our code. I've uploaded these functions in our "train_transformer.py" and so on...

G-force78 commented 2 months ago

I did update the files, maybe Ive missed something but here it is https://github.com/yuangan/EAT_code/blob/622d5460d8308177e71edc5ee40ed0422a54ca82/train_transformer.py#L262 GeneratorFullModelBatchDeepPromptSTEAM3D

yuangan commented 2 months ago

I did update the files, maybe Ive missed something but here it is

https://github.com/yuangan/EAT_code/blob/622d5460d8308177e71edc5ee40ed0422a54ca82/train_transformer.py#L262

GeneratorFullModelBatchDeepPromptSTEAM3D

Hi, I can find GeneratorFullModelBatchDeepPromptSTEAM3D and GeneratorFullModelBatchDeepPromptSTEAM3DNewStyle3 in modules/model_transformer.py. Could you check these functions in your code? Maybe you need git pull and pull the latest version.

JamesLong199 commented 2 months ago

Thank you for your consistent attention.

The answer is Yes. We are now considering releasing a script for zero-shot video editing this week. This is an interesting phenomenon and it needs more brilliant research, I think.

I am working on clearing the code of this part. You can try it soon.

Thank you so much for your awesome project and I really appreciate you taking time to release this 💯 .

yuangan / EAT_code

Details about CLIP fine-tuning and zero-shot text-guided editing #30