yaohungt / Multimodal-Transformer

[ACL'19] [PyTorch] Multimodal Transformer
MIT License
799 stars 149 forks source link

OOM on CMU-MOSI with 1080Ti and the reproduction problem #8

Closed hugddygff closed 4 years ago

hugddygff commented 4 years ago

Hi, I follow the supplementary materials and set the bsize to 128, number of attention layers to 4, the heads to 10, But meet Out of memory.

Further, I tried some other parameters, the results are all bad, for example the mae is always larger than 0.9.

So are the results of the model not stable?

Thanks.

yaohungt commented 4 years ago
  1. I never faced the OOM issue. Perhaps you can lower the batch size.

  2. Which dataset are you referring to? These datasets are all small datasets, but they are the best we can find for multimodal human language. MOSEI is a relatively large one, and you can try that first. But still, it is a small dataset. Since they are small, we have to deal with the overfitting problem carefully. Have you tried a different set of dropout rates?

When we test on these datasets, we do find performance varies a lot for different trials. The number 0.9 is what we tried without any parameters tuning for the first and second day.

jerrybai1995 commented 4 years ago

Hi @hugddygff, thanks for your interest in our work. To add to @yaohungt's comment:

It's not exactly the problem of the model; empirically, we found the result on CMU-MOSI quite unstable in general (in comparison, CMU-MOSEI is a lot better), probably because it's very small. It's generally useful to run a few times with different seeds. But MAE always larger than 0.9 seems weird (we reported 0.887). Try different dropout rates and convolutional kernel size.

I would also suggest starting with a [V, A -> L] network (not the full multimodal one) if memory becomes an issue for the unaligned version. Or, you can always reduce the batch size :-)

Let us know if this helps.

hugddygff commented 4 years ago

@jerrybai1995 Thanks for the quick reply, I will have a try later.

shamanez commented 4 years ago

@jerrybai1995 did you manage to reproduce the accuracy. I am also facing the same issue.

jerrybai1995 commented 4 years ago

Which task were you having the problem of? The last time I ran it was a few months ago, but I was able to reproduce results to the same level as reported in the paper.

shamanez commented 4 years ago

I am having a problem with CMU-MOSI dataset.

jerrybai1995 commented 4 years ago

CMU-MOSI should be pretty easy to reproduce with our provided hyperparameters in the paper, I was able to do it. However, as I said above, the result would be quite unstable (due to the dataset) and I would suggest going for multiple runs. What MAE are you getting?

shamanez commented 4 years ago

@jerrybai1995 You are perfectly right! I played around with my dropout rates and different seeds. Actually results depend a lot on the seed parameter (Yes due to the small number of the dataset). Anyway, I managed to get the MAE score below 9 perfectly. Thanks a lot for your advice. It is really helpful and you saved lot of time. THANK YOU.

mdswyz commented 2 years ago

@shamanez Can you provide me with the seeds of your training MOSI and MOSEI, I also have the problem that it is difficult to reproduce the results in the paper, thanks.

shamanez commented 2 years ago

can you share your results so I can give a look

mdswyz commented 2 years ago

@shamanez Thanks for getting back to me, here is my list of the best results obtained so far and their hyperparameters, but they are still far from the results in the paper. Dataset: Unaligned CMU-MOSEI Hyperparameters: Batch Size: 16 Initial Learning Rate: 0.001 Optimizer: Adam Transformers Hidden Unit Size: 40 Crossmodal Blocks: 5 Crossmodal Attention Heads: 8 Temporal Convolution Kernel Size (L/V/A): 1/3/3 Textual Embedding Dropout: 0.3 Crossmodal Attention Block Dropout: 0.1 Output Dropout: 0.1 Gradient Clip: 1 Epochs: 40 Seed: 1111 Result: acc7: 46.8 acc2: 81.9 F1: 81.8 MAE: 0.664 Corr: 0.681 Paper: acc7: 50.7 acc2: 81.6 F1: 81.6 MAE: 0.591 Corr: 0.694

Dataset: Aligned CMU-MOSEI Hyperparameters: Batch Size: 24 Initial Learning Rate: 0.001 Optimizer: Adam Transformers Hidden Unit Size: 30 Crossmodal Blocks: 5 Crossmodal Attention Heads: 5 Temporal Convolution Kernel Size (L/V/A): 1/1/1 Textual Embedding Dropout: 0.25 Crossmodal Attention Block Dropout: 0.1 Output Dropout: 0 Gradient Clip: 0.8 Epochs: 40 Seed: 1111 Result: acc7: 50.0 acc2: 80.9 F1: 81.1 MAE: 0.611 Corr: 0.673 Paper: acc7: 51.8 acc2: 82.5 F1: 82.3 MAE: 0.580 Corr: 0.703

Dataset: Unaligned CMU-MOSI Hyperparameters: Batch Size: 64 Initial Learning Rate: 0.001 Optimizer: Adam Transformers Hidden Unit Size: 40 Crossmodal Blocks: 4 Crossmodal Attention Heads: 10 Temporal Convolution Kernel Size (L/V/A): 1/3/3 Textual Embedding Dropout: 0.2 Crossmodal Attention Block Dropout: 0.2 Output Dropout: 0.1 Gradient Clip: 0.8 Epochs: 100 Seed: 1111 Result: acc7: 36.4 acc2: 80.9 F1: 81.0 MAE: 0.961 Corr: 0.678 Paper: acc7: 39.1 acc2: 81.1 F1: 81.0 MAE: 0.889 Corr: 0.686

Dataset: Aligned CMU-MOSI Hyperparameters: Batch Size: 64 Initial Learning Rate: 0.001 Optimizer: Adam Transformers Hidden Unit Size: 40 Crossmodal Blocks: 4 Crossmodal Attention Heads: 10 Temporal Convolution Kernel Size (L/V/A): 1/3/3 Textual Embedding Dropout: 0.2 Crossmodal Attention Block Dropout: 0.2 Output Dropout: 0.1 Gradient Clip: 0.8 Epochs: 100 Seed: 1111 Result: acc7: 36.2 acc2: 80.2 F1: 80.1 MAE: 0.941 Corr: 0.693 Paper: acc7: 40.0 acc2: 83.0 F1: 82.8 MAE: 0.871 Corr: 0.698