yuangan / EAT_code

Official code for ICCV 2023 paper: "Efficient Emotional Adaptation for Audio-Driven Talking-Head Generation".
Other
279 stars 31 forks source link

Inquiry Regarding Preprocessing VOX2 and MEAD Dataset for Training #37

Open Calmepro777 opened 5 months ago

Calmepro777 commented 5 months ago

Thanks for the authors wonder work.

As I am attempting to train the model myself and reproce the results, I would be grateful if the authors could give me more detailed instruction or code scripts on preprocessing the two datasets involved .

It seems that the Preprocessing session in the README file only contains those for inference.

Thanks in advance

yuangan commented 5 months ago

Thank you for your attention.

You can download the preprocessed MEAD data from Yandex or Baidu.

As for the Vox2, you can find some details from this issue. In short, we filtered the Vox2 data to 213400 videos and you can find the list from our processed deepfeature32. The training data can also be preprocessed with our preprocessing code. But you'd better reorganize them according to their function, such as:

vox
|----voxs_images
      |----id00530_9EtkaLUCdWM_00026
      |----...
|----voxs_latent
      |----id00530_9EtkaLUCdWM_00026.npy
      |----...
|----voxs_wavs
      |----id00530_9EtkaLUCdWM_00026.wav
      |----...
|----deepfeature32
      |----id00530_9EtkaLUCdWM_00026.npy
      |----...
|----bboxs
      |----id00530_9EtkaLUCdWM_00026.npy
      |----...
|----poseimg
      |----id00530_9EtkaLUCdWM_00026.npy.gz
      |----...

They can be extracted with our preprocess code here. As for the upgrade of the Python environment, there may be some differences in the extracted files. If you find something missing or something wrong, please let us know.

Calmepro777 commented 5 months ago

Thanks for the clarification.

I am folllowing your guidance to process the vox2 dataset.

Regarding the preprocessed MEAD dataset I downloaded via the link you provided, however, it appears to only contain images sampled from videos. I wonder if this is good enough for training.

yuangan commented 5 months ago

Hi,

the videos will be processed into images at last. We train EAT with the images in the provided data.

However, the provided MEAD data is preprocessed by ffmpeg without -crf 10. Hence, the quality may be lower than the data preprocessed with the current preprocess code. If you want higher-quality training data, you can preprocess MEAD from the original MEAD video.

Calmepro777 commented 5 months ago

In addition, I noticed that even if the person in the video that serve as headpose source has minimal head movement, the person in the generated video is like being zoomed in, zoomed out and shaking.

I would appreciate any guideline that could help to improve this.

Thanks in advance

https://github.com/yuangan/EAT_code/assets/92498535/4b50c78a-7ea5-4df2-bcb3-0e9e72068a5d

https://github.com/yuangan/EAT_code/assets/92498535/c6df93d0-a8b3-49e0-889f-56447fc309e6

yuangan commented 5 months ago

Thank you for your attention.

This is a good question. In my experience, the driven results will be better if the source image and driven videos have similar face shapes and poses. You can use the relative-driven poses by modifying the pose of the source image. Here is a function for reference.

I hope this will make your results better. If not, trying more driven poses may also be a solution.

Calmepro777 commented 4 months ago

Thank you for your attention.

You can download the preprocessed MEAD data from Yandex or Baidu.

As for the Vox2, you can find some details from this issue. In short, we filtered the Vox2 data to 213400 videos and you can find the list from our processed deepfeature32. The training data can also be preprocessed with our preprocessing code. But you'd better reorganize them according to their function, such as:

vox
|----voxs_images
      |----id00530_9EtkaLUCdWM_00026
      |----...
|----voxs_latent
      |----id00530_9EtkaLUCdWM_00026.npy
      |----...
|----voxs_wavs
      |----id00530_9EtkaLUCdWM_00026.wav
      |----...
|----deepfeature32
      |----id00530_9EtkaLUCdWM_00026.npy
      |----...
|----bboxs
      |----id00530_9EtkaLUCdWM_00026.npy
      |----...
|----poseimg
      |----id00530_9EtkaLUCdWM_00026.npy.gz
      |----...

They can be extracted with our preprocess code here. As for the upgrade of the Python environment, there may be some differences in the extracted files. If you find something missing or something wrong, please let us know.

Thank you so much for your detailed and clear explanation.

I decide to do Emotional Adaptation Training with the processed MEAD dataset you processed, and I have some questions.

  1. Is it true that the Emotional Adaptation Training does not requrie vox2 dataset
  2. I noticed that the deepfeature released with the processed MEAD dataset is from vox dataset, and hence I experienced following error:
    Original Traceback (most recent call last):
    File "/home/qw/anaconda3/envs/eat/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop
    data = fetcher.fetch(index)
    File "/home/qw/anaconda3/envs/eat/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
    File "/home/qw/anaconda3/envs/eat/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 49, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
    File "/home/qw/proj/BH/EAT/frames_dataset_transformer25.py", line 2519, in __getitem__
    return self.dataset[idx % self.dataset.__len__()]
    File "/home/qw/proj/BH/EAT/frames_dataset_transformer25.py", line 1005, in __getitem__
    return self.getitem_neu(idx)
    File "/home/qw/proj/BH/EAT/frames_dataset_transformer25.py", line 1137, in getitem_neu
    deeps = np.load(deep_path)
    File "/home/qw/anaconda3/envs/eat/lib/python3.7/site-packages/numpy/lib/npyio.py", line 417, in load
    fid = stack.enter_context(open(os_fspath(file), "rb"))
    FileNotFoundError: [Errno 2] No such file or directory: '/data/mead//deepfeature32/W011_con_3_014.npy'

Any comments/guidelines would be appreciated.

yuangan commented 4 months ago
  1. Yes, we do not use Vox2 data in fine-tuning the emotional adaptation stage.
  2. The deepfeature32 contains audio features extracted by the DeepSpeech code. Every dataset should have its deepfeature32 folder. Have you checked the folders in mead.tar.gz?
Calmepro777 commented 4 months ago
  1. Yes, we do not use Vox2 data in fine-tuning the emotional adaptation stage.
  2. The deepfeature32 contains audio features extracted by the DeepSpeech code. Every dataset should have its deepfeature32 folder. Have you checked the folders in mead.tar.gz?

Thanks for your reply.

I think I figured out the problem.

The processed MEAD dataset I previously downloaded from Yandex was, for some reason, corrupted and only contains the images sampled from videos.

I downloaded the processed MEAD dataset from Baidu Cloud again, which contains all the files required for emotional adaptation fine-tuning.

Again, thanks for the wonderful work.