Closed fantasy-fish closed 4 years ago
I never try feature-wise normalization before, so I'm not sure whether it is better or not. From my experience, I think the normalization over the entire training set is needed, no matter how large your dataset is. It is because the normalization will give you a better statistical dataset (each dim has relatively consistent scale/deviation) comparing to the raw data. Extremely large or small value will decrease the accuracy or even rune the training process. Feature-wise normalization, to my understanding, might lose some absolute value for each sample. But anyway, you can try either feature-wise or both to see how they work.
Thank you for your reply. I tried the same normalization method as in your code and it works pretty well. By the way, what other preprocessing steps do you use in the time domain? Do you make all the data have the same volume by RMS Normalization or Automatic Gain Control? And how do you deal with noise? Do you use any denoising module?
I didn't do preprocessing in the time domain. I think the volume control and denoising module will definitely help to improve the model, but since the BIWI dataset is relatively clean, I didn't do it in my implementation.
But what if I want to build a real-time speech animation system? When the user uses it, there's definitely background noise. Could you recommend any denoise module that can do real-time denoising in this scenario?
I'm not quite familiar with the real-time denoise module in python. But you might want to check https://people.xiph.org/~jm/demo/rnnoise/. I used to do denoising offline by the Audacity.
Oh, thank you for your reply. Actually, I am currently building a system that feeds audio segment(with a time duration of 10ms) into the system and generates the speech animation accordingly. For the demo, we want to play the audio file and speech animation at the same time. For each audio segment, I tried to fork a child process to play it and generate the speech animation in the parent process. But I met with some problem with this approach. May I ask how do you play the audio and the animation at the same time so that they are perfectly synched together?
For most of the time, the generated animation should be synced to the input audio. We firstly set the FPS of the video, like 25, and then generate 25 frames of the animation per second, and finally play them together.
Do your system play the audio file while generating the animation at the same time? Or you first feed the audio into the system to generate the animation, then put the animation and the audio together?
The latter one - we generate the animation for the entire audio and then put them back together. The algorithm could be realtime but we haven't had a realtime implementation yet.
Got it. Thank you for your reply.
Sorry, I still have an academic-related question. I have read a bunch of recently-published speech animation papers. Most of them are phoneme-based. But I learnt from my friend that phoneme is fading away in the field of speech recognition, and they currently tend to use some wavenet features. I was wondering why the speech animation field is still sticking to this technique. Is it because of the working mode in a real-world production pipeline or some other reason?
Yes, as you said, the phoneme-based or transcript-based methods are more production friendly and can be easily merge into the existing animation/film pipeline. But for the academia, the wave-based feature is more informative to some extent, and could be used for more challenging tasks (especially through neural networks).
Apologies for dropping in to the issue a few months late, but I am attempting a real-time implementation of this and was curious how you'd suggest going about it. From my understanding, python_speech_features does not have a real-time microphone input feature so then the probable approach to it would be to split the continuous audio stream into chunks and multi-process it in a similar fashion as @fantasy-fish suggested.
Would that be a feasible approach or would either of you suggest a better approach?
Thank you!
@lightnarcissus thanks for your interest. You're right - this is only an offline implementation. If you want a real-time one, I think you have to compromise to a few ms delay. You may apply this approach after fetching a small chunk of audio, and then produce the animation.
Hi, I am doing a similar project about speech animation. I am currently using the architecture and audio features from this paper to do the phoneme prediction part. But I am a bit confused about the feature normalization approach in the code. For the 65-dim audio feature input, your code seems to normalize it with the mean and std of the whole training set. However, it may not be a good approach in my case because basically my training set is not big enough and may have a different distribution from the testing set. After a bit googling, I try to do feature-wise normalization. In other words, normalize the MFB, MFC and SSC by sample mean and std for each sample. I don't know whether it works, or do you have any suggestion for the feature normalization process?