Open kkontras opened 2 months ago
A small update here, I used a learning rate of 5e-6 a cosine annealing scheduler, a batch size 8 and kept the best model based on validation accuracy. These are the only differences I spotted. While I get similar results with the CME model with the default training settings (test ~88.6)
Hi,
Thank you for your contribution and the clear code in this repo. I wanted to ask regarding the unimodal performance. I am training solely text encoder on each own for MOSEI and I get already kind of higher accuracy than the multimodal (val 89.6 and test 88.6). I can see in the paper that those numbers are significantly lower. Am I missing something?
For the record, I trained the Rob_d2v_cme_context keeping for the unimodals A_output and T_output in each case. If needed I can share exactly the model.