Hi @weiyinwei , I appreciate your work. I tried to collect MovieLens dataset, and I have a question:
In terms of acoustic modality, I adopt VGGish to learn the acoustic deep learning features
. But the feature size learned by each audio is (C, 128), how does your paper deal with the dimension as (1, 128) ?
Hi @weiyinwei , I appreciate your work. I tried to collect MovieLens dataset, and I have a question:
In terms of acoustic modality, I adopt VGGish to learn the acoustic deep learning features . But the feature size learned by each audio is (C, 128), how does your paper deal with the dimension as (1, 128) ?