Closed kareemamrr closed 4 years ago
A segment does not have to cover the entire window. As long as more than 50% of the window falls into the segment, you can count it in.
But, this is just our practice that worked well for our own experiment setup. It's not necessary the best practice for your problem.
The authors of the paper state that after extracting all embeddings they aggregate them by segment, which is of maximum size 400ms post VAD processing. Also, a single embedding is representative of 240ms of the original signal overlapping by 120ms. So two full embeddings would be representative of 360ms of the original signal, whatabout the remaining 40ms? In this issue one of the author states that a segment has about 4 windows but I couldn't understand how that is achieved.