yuangan / EAT_code

Official code for ICCV 2023 paper: "Efficient Emotional Adaptation for Audio-Driven Talking-Head Generation".
Other
245 stars 29 forks source link

Inquiry Regarding Preprocessing VOX2 and MEAD Dataset for Training #37

Open Calmepro777 opened 3 days ago

Calmepro777 commented 3 days ago

Thanks for the authors wonder work.

As I am attempting to train the model myself and reproce the results, I would be grateful if the authors could give me more detailed instruction or code scripts on preprocessing the two datasets involved .

It seems that the Preprocessing session in the README file only contains those for inference.

Thanks in advance

yuangan commented 3 days ago

Thank you for your attention.

You can download the preprocessed MEAD data from Yandex or Baidu.

As for the Vox2, you can find some details from this issue. In short, we filtered the Vox2 data to 213400 videos and you can find the list from our processed deepfeature32. The training data can also be preprocessed with our preprocessing code. But you'd better reorganize them according to their function, such as:

vox
|----voxs_images
      |----id00530_9EtkaLUCdWM_00026
      |----...
|----voxs_latent
      |----id00530_9EtkaLUCdWM_00026.npy
      |----...
|----voxs_wavs
      |----id00530_9EtkaLUCdWM_00026.wav
      |----...
|----deepfeature32
      |----id00530_9EtkaLUCdWM_00026.npy
      |----...
|----bboxs
      |----id00530_9EtkaLUCdWM_00026.npy
      |----...
|----poseimg
      |----id00530_9EtkaLUCdWM_00026.npy.gz
      |----...

They can be extracted with our preprocess code here. As for the upgrade of the Python environment, there may be some differences in the extracted files. If you find something missing or something wrong, please let us know.

Calmepro777 commented 18 hours ago

Thanks for the clarification.

I am folllowing your guidance to process the vox2 dataset.

Regarding the preprocessed MEAD dataset I downloaded via the link you provided, however, it appears to only contain images sampled from videos. I wonder if this is good enough for training.

yuangan commented 16 hours ago

Hi,

the videos will be processed into images at last. We train EAT with the images in the provided data.

However, the provided MEAD data is preprocessed by ffmpeg without -crf 10. Hence, the quality may be lower than the data preprocessed with the current preprocess code. If you want higher-quality training data, you can preprocess MEAD from the original MEAD video.