I have been trying to implement paper "Deep clustering: Discriminative embeddings for segmentation and separation", but I am not able to create batches because each audio file have different no of frames. I came across one sentence in experimental setup section that "To ensure the local coherency, the mixture speech was segmented with the length of 100 frames". What I understand is that authors are dividing each sample into 100 frames chunks and use each of this as input. Is that how do author handle variable length input to LSTM??
I have been trying to implement paper "Deep clustering: Discriminative embeddings for segmentation and separation", but I am not able to create batches because each audio file have different no of frames. I came across one sentence in experimental setup section that "To ensure the local coherency, the mixture speech was segmented with the length of 100 frames". What I understand is that authors are dividing each sample into 100 frames chunks and use each of this as input. Is that how do author handle variable length input to LSTM??