Closed komolebi closed 5 years ago
Hi,
We do the padding or truncation for ensuring each sample has the same size (left padding zeroes or truncating the sequence to a reasonable length). If you like, you can also process the sequence from the SDK we provided in the README.md For example, not truncating or padding, but it makes the training not scalable. Hence, for convenience, just directly use the streams we processed :)
I saw in the paper 'we keep the original audio and visual features as extracted,without any word-segmented alignment or manual subsampling. As a result, the lengths of each modality vary significantly, where audio and vision sequences may contain up to > 1, 000 time steps.' But the time steps for training data are the same.For example,acoustic and visual features of CMU- MOSI is all 375 and 500. How it works?