[Paper] accepted at the NAACL 2021:
Multimodal End-to-End Sparse Model for Emotion Recognition, by *[Wenliang Dai ](https://wenliangdai.github.io/)*, Samuel Cahyawijaya , Zihan Liu, Pascale Fung.
Existing works on multimodal affective computing tasks, such as emotion recognition, generally adopt a two-phase pipeline, first extracting feature representations for each single modality with hand-crafted algorithms and then performing end-to-end learning with the extracted features. However, the extracted features are fixed and cannot be further fine-tuned on different target tasks, and manually finding feature extraction algorithms does not generalize or scale well to different tasks, which can lead to sub-optimal performance. In this paper, we develop a fully end-to-end model that connects the two phases and optimizes them jointly. In addition, we restructure the current datasets to enable the fully end-to-end training. Furthermore, to reduce the computational overhead brought by the end-to-end model, we introduce a sparse cross-modal attention mechanism for the feature extraction. Experimental results show that our fully end-to-end model significantly surpasses the current state-of-the-art models based on the two-phase pipeline. Moreover, by adding the sparse cross-modal attention, our model can maintain performance with around half the computation in the feature extraction part.
If you work is inspired by our paper or code, please cite it, thanks!
@inproceedings{dai-etal-2021-multimodal, title = "Multimodal End-to-End Sparse Model for Emotion Recognition", author = "Dai, Wenliang and Cahyawijaya, Samuel and Liu, Zihan and Fung, Pascale", booktitle = "Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies", month = jun, year = "2021", address = "Online", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/2021.naacl-main.417", doi = "10.18653/v1/2021.naacl-main.417", pages = "5305--5316", abstract = "Existing works in multimodal affective computing tasks, such as emotion recognition and personality recognition, generally adopt a two-phase pipeline by first extracting feature representations for each single modality with hand crafted algorithms, and then performing end-to-end learning with extracted features. However, the extracted features are fixed and cannot be further fine-tuned on different target tasks, and manually finding feature extracting algorithms does not generalize or scale well to different tasks, which can lead to sub-optimal performance. In this paper, we develop a fully end-to-end model that connects the two phases and optimizes them jointly. In addition, we restructure the current datasets to enable the fully end-to-end training. Furthermore, to reduce the computational overhead brought by the end-to-end model, we introduce a sparse cross-modal attention mechanism for the feature extraction. Experimental results show that our fully end-to-end model significantly surpasses the current state-of-the-art models based on the two-phase pipeline. Moreover, by adding the sparse cross-modal attention, our model can maintain the performance with around half less computation in the feature extraction part of the model.", }
You can also check our blog here 😊.
As mentioned in our paper, one of the contribution is that we reorganize two datasets (IEMOCAP and CMU-MOSEI) to enable training from the raw data. To the best of our knowledge, prior to our work, papers using these two datasets are based on pre-extracted features, and we did not find a way to map those features back with raw data. Therefore, we did a heavy reorganization of these datasets (refer to Section 3 of the paper for more details).
The raw data can be downloaded from CMU-MOSEI (~120GB) and IEMOCAP (~16.5GB). However, for the IEMOCAP, you need to request for a permission from the original author, then we can give the passcode to download.
We provide two Python scripts as examples of processing the raw data in the ./preprocessing
folder. Alternatively, you can also download our processed raw data for training directly, as shown in the section below.
To run our code directly, you can download the processed data from here (88.6G). Unzip it and the tree structure of the data direcotry looks like this:
./data
- IEMOCAP_HCF_FEATURES
- IEMOCAP_RAW_PROCESSED
- IEMOCAP_SPLIT
- MOSEI_RAW_PROCESSED
- MOSEI_HCF_FEATURES
- MOSEI_SPLIT
python main.py -lr=5e-5 -ep=40 -mod=tav -bs=8 --img-interval=500 --early-stop=6 --loss=bce --cuda=3 --model=mme2e --num-emotions=6 --trans-dim=64 --trans-nlayers=4 --trans-nheads=4 --text-lr-factor=10 --text-model-size=base --text-max-len=100
python main.py -lr=5e-5 -ep=40 -mod=tav -bs=2 --img-interval=500 --early-stop=6 --loss=bce --cuda=3 --model=mme2e_sparse --num-emotions=6 --trans-dim=64 --trans-nlayers=4 --trans-nheads=4 --text-lr-factor=10 -st=0.8 --text-model-size=base --text-max-len=100
python main.py -lr=5e-4 -ep=60 -mod=tav -bs=32 --early-stop=8 --loss=bce --cuda=1 --model=lf_rnn --num-emotions=6 --hand-crafted --clip=2
python main.py -lr=5e-4 -ep=60 -mod=tav -bs=32 --early-stop=8 --loss=bce --cuda=0 --model=lf_transformer --num-emotions=6 --hand-crafted --clip=2
usage: main.py [-h] -bs BATCH_SIZE -lr LEARNING_RATE [-wd WEIGHT_DECAY] -ep
EPOCHS [-es EARLY_STOP] [-cu CUDA] [-cl CLIP] [-sc] [-se SEED]
[--loss LOSS] [--optim OPTIM] [--text-lr-factor TEXT_LR_FACTOR]
[-mo MODEL] [--text-model-size TEXT_MODEL_SIZE]
[--fusion FUSION] [--feature-dim FEATURE_DIM]
[-st SPARSE_THRESHOLD] [-hfcs HFC_SIZES [HFC_SIZES ...]]
[--trans-dim TRANS_DIM] [--trans-nlayers TRANS_NLAYERS]
[--trans-nheads TRANS_NHEADS] [-aft AUDIO_FEATURE_TYPE]
[--num-emotions NUM_EMOTIONS] [--img-interval IMG_INTERVAL]
[--hand-crafted] [--text-max-len TEXT_MAX_LEN]
[--datapath DATAPATH] [--dataset DATASET] [-mod MODALITIES]
[--valid] [--test] [--ckpt CKPT] [--ckpt-mod CKPT_MOD]
[-dr DROPOUT] [-nl NUM_LAYERS] [-hs HIDDEN_SIZE] [-bi] [--gru]
Multimodal End-to-End Sparse Model for Emotion Recognition
optional arguments:
-h, --help show this help message and exit
-bs BATCH_SIZE, --batch-size BATCH_SIZE
Batch size
-lr LEARNING_RATE, --learning-rate LEARNING_RATE
Learning rate
-wd WEIGHT_DECAY, --weight-decay WEIGHT_DECAY
Weight decay
-ep EPOCHS, --epochs EPOCHS
Number of epochs
-es EARLY_STOP, --early-stop EARLY_STOP
Early stop
-cu CUDA, --cuda CUDA
Cude device number
-cl CLIP, --clip CLIP
Use clip to gradients
-sc, --scheduler Use scheduler to optimizer
-se SEED, --seed SEED
Random seed
--loss LOSS loss function
--optim OPTIM optimizer function: adam/sgd
--text-lr-factor TEXT_LR_FACTOR
Factor the learning rate of text model
-mo MODEL, --model MODEL
Which model
--text-model-size TEXT_MODEL_SIZE
Size of the pre-trained text model
--fusion FUSION How to fuse modalities
--feature-dim FEATURE_DIM
Dimension of features outputed by each modality model
-st SPARSE_THRESHOLD, --sparse-threshold SPARSE_THRESHOLD
Threshold of sparse CNN layers
-hfcs HFC_SIZES [HFC_SIZES ...], --hfc-sizes HFC_SIZES [HFC_SIZES ...]
Hand crafted feature sizes
--trans-dim TRANS_DIM
Dimension of the transformer after CNN
--trans-nlayers TRANS_NLAYERS
Number of layers of the transformer after CNN
--trans-nheads TRANS_NHEADS
Number of heads of the transformer after CNN
-aft AUDIO_FEATURE_TYPE, --audio-feature-type AUDIO_FEATURE_TYPE
Hand crafted audio feature types
--num-emotions NUM_EMOTIONS
Number of emotions in data
--img-interval IMG_INTERVAL
Interval to sample image frames
--hand-crafted Use hand crafted features
--text-max-len TEXT_MAX_LEN
Max length of text after tokenization
--datapath DATAPATH Path of data
--dataset DATASET Use which dataset
-mod MODALITIES, --modalities MODALITIES
what modalities to use
--valid Only run validation
--test Only run test
--ckpt CKPT Path of checkpoint
--ckpt-mod CKPT_MOD Load which modality of the checkpoint
-dr DROPOUT, --dropout DROPOUT
dropout
-nl NUM_LAYERS, --num-layers NUM_LAYERS
num of layers of LSTM
-hs HIDDEN_SIZE, --hidden-size HIDDEN_SIZE
hidden vector size of LSTM
-bi, --bidirectional Use Bi-LSTM
--gru Use GRU rather than LSTM