Open JamesLong199 opened 2 months ago
Hi, thank you for your attention.
As an application of our proposed modules, we achieve this in a direct way: optimize the latent code z with the CLIP's loss. Given an emotion label and a video, we finetune the EAT components including the mapping network, EDN, and EAM to edit the expression in the video according to the input text.
The training of text-guided mapper in StyleCLIP may help you understand this process.
If you have any other questions, please let us know.
Thank you for your swift response and concise explanation. In addition, the description in the StyleCLIP paper is really helpful :)
Hi,
Would it be possible to upload a script for CLIP fine-tuning? Thank you in advance for your time and help.
Thank you for your consistent attention.
The answer is Yes. We are now considering releasing a script for zero-shot video editing this week. This is an interesting phenomenon and it needs more brilliant research, I think.
I am working on clearing the code of this part. You can try it soon.
Thank you for your consistent attention.
The answer is Yes. We are now considering releasing a script for zero-shot video editing this week. This is an interesting phenomenon and it needs more brilliant research, I think.
I am working on clearing the code of this part. You can try it soon.
Good to see you continue to work on this in my experience your implementation has been the best for replicating expressions.
I really appreciate your feedback. @G-force78
I've uploaded the zero-shot editing code, and you can find more details here.
It has been a long journey for me to develop and release this work. I hope you find the emotional talking-head generation as interesting as I do. 😜
Hi, Im not sure what this refers to?
Traceback (most recent call last):
File "/content/EAT_code/prompt_st_dp_eam3d_mapper_full.py", line 162, in
Hi, Im not sure what this refers to?
Traceback (most recent call last): File "/content/EAT_code/prompt_st_dp_eam3d_mapper_full.py", line 162, in train(text, config, generator, None, kp_detector, audio2kptransformer, mapper, sidetuning, opt.checkpoint, log_dir, dataset, opt.device_ids) File "/content/EAT_code/train_transformer.py", line 474, in train_batch_prompt_mapper3 generator_full = GeneratorFullModelBatchDeepPromptSTEAM3DNewStyle3(text, kp_detector, audio2kptransformer, mapper, sidetuning, generator, discriminator, train_params, estimate_jacobian=config['model_params']['common_params']['estimate_jacobian']) NameError: name 'GeneratorFullModelBatchDeepPromptSTEAM3DNewStyle3' is not defined. Did you mean: 'GeneratorFullModelBatchDeepPromptSTEAM3D'?
Hi, you need to download the latest version of our code. I've uploaded these functions in our "train_transformer.py" and so on...
I did update the files, maybe Ive missed something but here it is https://github.com/yuangan/EAT_code/blob/622d5460d8308177e71edc5ee40ed0422a54ca82/train_transformer.py#L262 GeneratorFullModelBatchDeepPromptSTEAM3D
I did update the files, maybe Ive missed something but here it is
GeneratorFullModelBatchDeepPromptSTEAM3D
Hi, I can find GeneratorFullModelBatchDeepPromptSTEAM3D and GeneratorFullModelBatchDeepPromptSTEAM3DNewStyle3 in modules/model_transformer.py
. Could you check these functions in your code? Maybe you need git pull
and pull the latest version.
Thank you for your consistent attention.
The answer is Yes. We are now considering releasing a script for zero-shot video editing this week. This is an interesting phenomenon and it needs more brilliant research, I think.
I am working on clearing the code of this part. You can try it soon.
Thank you so much for your awesome project and I really appreciate you taking time to release this 💯 .
Hi,
Could you kindly provide more details on the setting for model fine-tuning with CLIP and the zero-shot text-guided expression editing procedure?
For model fine-tuning with CLIP, my understanding is that: the same losses in emotion adaptation are used in addition to CLIP loss; the fine-tuning is performed on MEAD, where each training video is paired with fixed text prompts of the corresponding emotion category (attached in the screenshot).
For the zero-shot text-guided expression editing, I was wondering how is the CLIP text feature incorporated into the existing model structure (e.g. a mapping from CLIP feature to the latent code z or to the emotion prompt?).
Thank you in advance for your time and help!
_Originally posted by @JamesLong199 in https://github.com/yuangan/EAT_code/issues/23#issuecomment-2110126521_