Closed G874713346 closed 3 years ago
Thank you for pointing out this important issue! You are right, and I think the last time I modified this part I didn't test it thoroughly...
I'll fix this ASAP!
As for a uniform distribution of seg_len during training, I didn't implement this yet. The function embed_utterance
is only used in testing time. You can take a look at the __getitem__
function in ge2e_dataset.py
(line 53-55) and do the sampling of the length there.
The unfolding problem has been fixed.
Thank you for your answer. I have another question, because you made a mistake in judging the seg_len condition of sliding window, so did the model dvector.pt in the example adopt sliding window? Or directly extract dvector instead of sliding window averaging dvector?
I'm not pretty sure what dvector.pt in the example
is. If you mean the released jit-compiled dvector-step250000.pt, yes it has been recompiled and uses the sliding window to extract audio segments now.
How to realize the window size is drawn from a uniform distribution within [240ms, 1600ms] during training?
In your source code dvector.py, there are two questions. One is the conditional judgment: if utterance. size (1) < = self. seg _ len:, which should be compared with the 0 th dimension, because the 1 ST dimension is 40, so the horizontal dimension is smaller than seg_len=160, and the following sliding window part unfold cannot be reached; Second, the output shape of unfold is [bacth_size, 40, seg_len], while the input shape of AttentivePooledLSTMDvector should be [bacth_size, seg_len, 40], that is, size(-1) must be 40.
As for the uniform distribution seg_len, can I directly add the evenly distributed seg_len when traversing each utterance?
I hope you can give me an answer, thank you!