Open subharya83 opened 3 months ago
Hi, if you have only one audio, you need to follow these steps:
latent_evp_25/00014.npy
and poseimg/00014.npy.gz
with our preprocessed obama video (5 min) ./demo/video_processed/obama/
.If you have one video, you need to preprocess your custom video first according to our preprocess code. And then test as other preprocessed videos.
If you have any questions, feel free to contact us.
Thank you for getting back to me so quickly. And, thanks for the incredible effort.
I followed the instructions you gave me. After running the deepspeech feature extraction code, this is what my directory structure looks like:
tree -f ./demo/video_processed/00014/
./demo/video_processed/00014
├── ./demo/video_processed/00014/00014.wav
├── ./demo/video_processed/00014/deepfeature32
│ └── ./demo/video_processed/00014/deepfeature32/00014.npy
├── ./demo/video_processed/00014/latent_evp_25
│ └── ./demo/video_processed/00014/latent_evp_25/00014.npy
└── ./demo/video_processed/00014/poseimg
└── ./demo/video_processed/00014/poseimg/00014.npy.gz
However, the code still fails here:
EAT_code/demo.py", line 171, in prepare_test_data
audio_frames = torch.stack(audio_frames, dim=0)
RuntimeError: stack expects a non-empty TensorList
Looks like the audio_frames
list of tensors is empty. This is the output of ffprobe on my custom audio:
ffprobe -hide_banner ./demo/video_processed/00014/00014.wav
Input #0, wav, from './demo/video_processed/00014/00014.wav':
Metadata:
encoder : Lavf58.76.100
Duration: 00:00:06.32, bitrate: 256 kb/s
Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 16000 Hz, 1 channels, s16, 256 kb/s
And this is the structure of the deepspeech feature extracted npy file:
>>> d=np.load('./demo/video_processed/00014/deepfeature32/00014.npy')
>>> print(d.shape, d.min(), d.max())
(158, 16, 29) -45.72098159790039 22.231658935546875
Any thoughts on what I might be doing wrong?
According to the instructions provided here:
Note 2: To test with a custom audio, you need to replace the video_name/video_name.wav and deepspeech feature video_name/deepfeature32/video_name.npy. The output length will depend on the shortest length of the audio and driven poses. Refer to here for more details.
I have copied a custom audio file in 16khz sampling rate, like the following: video_processed/00014
├── 00014.wav ├── deepfeature32 ├── latent_evp_25 └── poseimg
From the above, how do I get here?
video_processed/00014 ├── 00014.wav ├── deepfeature32 │ └── 00014.npy ├── latent_evp_25 │ └── 00014.npy └── poseimg └── 00014.npy.gz