yizhilll / MERT

Official implementation of the paper "Acoustic Music Understanding Model with Large-Scale Self-supervised Training".
Apache License 2.0
301 stars 18 forks source link

When jointly mapping different modal features to the same semantics, which layer of the hidden layer is more appropriate to select #11

Closed tanggang1997 closed 6 months ago

tanggang1997 commented 10 months ago

Now I have extracted the image pre-training features using clip, but I don't know how to choose which layer of mert's features is more suitable for the

yizhilll commented 6 months ago

Hi, I think the layer selection really depends on the task you are working on. If you are trying to do a CLIP-style training, I would suggest you use the last layer output for a general purpose.