yuangan / EAT_code

Official code for ICCV 2023 paper: "Efficient Emotional Adaptation for Audio-Driven Talking-Head Generation".
Other
269 stars 30 forks source link

How to generate files for custom audio? #34

Open subharya83 opened 3 months ago

subharya83 commented 3 months ago

According to the instructions provided here:

Note 2: To test with a custom audio, you need to replace the video_name/video_name.wav and deepspeech feature video_name/deepfeature32/video_name.npy. The output length will depend on the shortest length of the audio and driven poses. Refer to here for more details.

I have copied a custom audio file in 16khz sampling rate, like the following: video_processed/00014

├── 00014.wav ├── deepfeature32 ├── latent_evp_25 └── poseimg

From the above, how do I get here?

video_processed/00014 ├── 00014.wav ├── deepfeature32 │   └── 00014.npy ├── latent_evp_25 │   └── 00014.npy └── poseimg └── 00014.npy.gz

yuangan commented 3 months ago

Hi, if you have only one audio, you need to follow these steps:

  1. Extract deepfeature32 with the code here.
  2. Organize the files, then replace the pose-related files latent_evp_25/00014.npy and poseimg/00014.npy.gz with our preprocessed obama video (5 min) ./demo/video_processed/obama/.
  3. Test as the readme shows.

If you have one video, you need to preprocess your custom video first according to our preprocess code. And then test as other preprocessed videos.

If you have any questions, feel free to contact us.

subharya83 commented 3 months ago

Thank you for getting back to me so quickly. And, thanks for the incredible effort.

subharya83 commented 3 months ago

I followed the instructions you gave me. After running the deepspeech feature extraction code, this is what my directory structure looks like:

tree  -f ./demo/video_processed/00014/
./demo/video_processed/00014
├── ./demo/video_processed/00014/00014.wav
├── ./demo/video_processed/00014/deepfeature32
│   └── ./demo/video_processed/00014/deepfeature32/00014.npy
├── ./demo/video_processed/00014/latent_evp_25
│   └── ./demo/video_processed/00014/latent_evp_25/00014.npy
└── ./demo/video_processed/00014/poseimg
    └── ./demo/video_processed/00014/poseimg/00014.npy.gz

However, the code still fails here:

EAT_code/demo.py", line 171, in prepare_test_data
    audio_frames = torch.stack(audio_frames, dim=0)
RuntimeError: stack expects a non-empty TensorList

Looks like the audio_frames list of tensors is empty. This is the output of ffprobe on my custom audio:

 ffprobe -hide_banner ./demo/video_processed/00014/00014.wav 
Input #0, wav, from './demo/video_processed/00014/00014.wav':
  Metadata:
    encoder         : Lavf58.76.100
  Duration: 00:00:06.32, bitrate: 256 kb/s
  Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 16000 Hz, 1 channels, s16, 256 kb/s

And this is the structure of the deepspeech feature extracted npy file:

>>> d=np.load('./demo/video_processed/00014/deepfeature32/00014.npy')
>>> print(d.shape, d.min(), d.max())
(158, 16, 29) -45.72098159790039 22.231658935546875

Any thoughts on what I might be doing wrong?

yuangan commented 3 months ago

Hi, have you checked the value of num_frames here?

I forget the processed gt frames. Maybe you need to copy the cropped images from the preprocessed obama files.