Open hykilpikonna opened 1 year ago
Here is the absolute error between the original and the segmented calculations for a 4-minute audio on a graph... it's weird that the overlapping areas are not the only thing that is affected, but the error seems to bleed to the entire rest of the segment.
The model seems to process the entire audio at once, which leads to high vram usage for long audio. I was trying to compute MERT on a 9:58 audio with an A100 80GB GPU, and it tried to allocate 90GB of vram.
Is it possible to split the audio first, process each segment and obtain the same results? I tried to split the audio into 60s windows using the code below. Even though I managed to make the segmented embedding into the same shape, it seems to give a large mean square error from the original calculation if the entire audio is passed in at once.