window size -› seg_len

G874713346 commented 3 years ago

How to realize the window size is drawn from a uniform distribution within [240ms, 1600ms] during training？

In your source code dvector.py, there are two questions. One is the conditional judgment: if utterance. size (1) < = self. seg _ len:, which should be compared with the 0 th dimension, because the 1 ST dimension is 40, so the horizontal dimension is smaller than seg_len=160, and the following sliding window part unfold cannot be reached; Second, the output shape of unfold is [bacth_size, 40, seg_len], while the input shape of AttentivePooledLSTMDvector should be [bacth_size, seg_len, 40], that is, size(-1) must be 40.

As for the uniform distribution seg_len, can I directly add the evenly distributed seg_len when traversing each utterance?

I hope you can give me an answer, thank you!

yistLin commented 3 years ago

Thank you for pointing out this important issue! You are right, and I think the last time I modified this part I didn't test it thoroughly...

I'll fix this ASAP!

As for a uniform distribution of seg_len during training, I didn't implement this yet. The function embed_utterance is only used in testing time. You can take a look at the __getitem__ function in ge2e_dataset.py (line 53-55) and do the sampling of the length there.

yistLin commented 3 years ago

The unfolding problem has been fixed.

G874713346 commented 3 years ago

Thank you for your answer. I have another question, because you made a mistake in judging the seg_len condition of sliding window, so did the model dvector.pt in the example adopt sliding window? Or directly extract dvector instead of sliding window averaging dvector?

yistLin commented 3 years ago

I'm not pretty sure what dvector.pt in the example is. If you mean the released jit-compiled dvector-step250000.pt, yes it has been recompiled and uses the sliding window to extract audio segments now.

yistLin / dvector

window size -› seg_len #6