wenliangdai / Multimodal-End2end-Sparse

The code repository for NAACL 2021 paper "Multimodal End-to-End Sparse Model for Emotion Recognition".
95 stars 16 forks source link

Multimodal End-to-End Sparse Model for Emotion Recognition

CC BY 4.0

[Paper] accepted at the NAACL 2021:

Multimodal End-to-End Sparse Model for Emotion Recognition, by *[Wenliang Dai ](https://wenliangdai.github.io/)*, Samuel Cahyawijaya , Zihan Liu, Pascale Fung.

Paper Abstract

Existing works on multimodal affective computing tasks, such as emotion recognition, generally adopt a two-phase pipeline, first extracting feature representations for each single modality with hand-crafted algorithms and then performing end-to-end learning with the extracted features. However, the extracted features are fixed and cannot be further fine-tuned on different target tasks, and manually finding feature extraction algorithms does not generalize or scale well to different tasks, which can lead to sub-optimal performance. In this paper, we develop a fully end-to-end model that connects the two phases and optimizes them jointly. In addition, we restructure the current datasets to enable the fully end-to-end training. Furthermore, to reduce the computational overhead brought by the end-to-end model, we introduce a sparse cross-modal attention mechanism for the feature extraction. Experimental results show that our fully end-to-end model significantly surpasses the current state-of-the-art models based on the two-phase pipeline. Moreover, by adding the sparse cross-modal attention, our model can maintain performance with around half the computation in the feature extraction part.

If you work is inspired by our paper or code, please cite it, thanks!

@inproceedings{dai-etal-2021-multimodal,
    title = "Multimodal End-to-End Sparse Model for Emotion Recognition",
    author = "Dai, Wenliang  and
      Cahyawijaya, Samuel  and
      Liu, Zihan  and
      Fung, Pascale",
    booktitle = "Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies",
    month = jun,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2021.naacl-main.417",
    doi = "10.18653/v1/2021.naacl-main.417",
    pages = "5305--5316",
    abstract = "Existing works in multimodal affective computing tasks, such as emotion recognition and personality recognition, generally adopt a two-phase pipeline by first extracting feature representations for each single modality with hand crafted algorithms, and then performing end-to-end learning with extracted features. However, the extracted features are fixed and cannot be further fine-tuned on different target tasks, and manually finding feature extracting algorithms does not generalize or scale well to different tasks, which can lead to sub-optimal performance. In this paper, we develop a fully end-to-end model that connects the two phases and optimizes them jointly. In addition, we restructure the current datasets to enable the fully end-to-end training. Furthermore, to reduce the computational overhead brought by the end-to-end model, we introduce a sparse cross-modal attention mechanism for the feature extraction. Experimental results show that our fully end-to-end model significantly surpasses the current state-of-the-art models based on the two-phase pipeline. Moreover, by adding the sparse cross-modal attention, our model can maintain the performance with around half less computation in the feature extraction part of the model.",
}

You can also check our blog here 😊.

Dataset

As mentioned in our paper, one of the contribution is that we reorganize two datasets (IEMOCAP and CMU-MOSEI) to enable training from the raw data. To the best of our knowledge, prior to our work, papers using these two datasets are based on pre-extracted features, and we did not find a way to map those features back with raw data. Therefore, we did a heavy reorganization of these datasets (refer to Section 3 of the paper for more details).

The raw data can be downloaded from CMU-MOSEI (~120GB) and IEMOCAP (~16.5GB). However, for the IEMOCAP, you need to request for a permission from the original author, then we can give the passcode to download.

We provide two Python scripts as examples of processing the raw data in the ./preprocessing folder. Alternatively, you can also download our processed raw data for training directly, as shown in the section below.

Preparation

Dataset

To run our code directly, you can download the processed data from here (88.6G). Unzip it and the tree structure of the data direcotry looks like this:

./data
- IEMOCAP_HCF_FEATURES
- IEMOCAP_RAW_PROCESSED
- IEMOCAP_SPLIT
- MOSEI_RAW_PROCESSED
- MOSEI_HCF_FEATURES
- MOSEI_SPLIT

Environment

Command examples for running

Train the MME2E

python main.py -lr=5e-5 -ep=40 -mod=tav -bs=8 --img-interval=500 --early-stop=6 --loss=bce --cuda=3 --model=mme2e --num-emotions=6 --trans-dim=64 --trans-nlayers=4 --trans-nheads=4 --text-lr-factor=10 --text-model-size=base --text-max-len=100

Train the sparse MME2E

python main.py -lr=5e-5 -ep=40 -mod=tav -bs=2 --img-interval=500 --early-stop=6 --loss=bce --cuda=3 --model=mme2e_sparse --num-emotions=6 --trans-dim=64 --trans-nlayers=4 --trans-nheads=4 --text-lr-factor=10 -st=0.8 --text-model-size=base --text-max-len=100

Baselines

LF_RNN

python main.py -lr=5e-4 -ep=60 -mod=tav -bs=32 --early-stop=8 --loss=bce --cuda=1 --model=lf_rnn --num-emotions=6 --hand-crafted --clip=2

LF_TRANSFORMER

python main.py -lr=5e-4 -ep=60 -mod=tav -bs=32 --early-stop=8 --loss=bce --cuda=0 --model=lf_transformer --num-emotions=6 --hand-crafted --clip=2

CLI

usage: main.py [-h] -bs BATCH_SIZE -lr LEARNING_RATE [-wd WEIGHT_DECAY] -ep
               EPOCHS [-es EARLY_STOP] [-cu CUDA] [-cl CLIP] [-sc] [-se SEED]
               [--loss LOSS] [--optim OPTIM] [--text-lr-factor TEXT_LR_FACTOR]
               [-mo MODEL] [--text-model-size TEXT_MODEL_SIZE]
               [--fusion FUSION] [--feature-dim FEATURE_DIM]
               [-st SPARSE_THRESHOLD] [-hfcs HFC_SIZES [HFC_SIZES ...]]
               [--trans-dim TRANS_DIM] [--trans-nlayers TRANS_NLAYERS]
               [--trans-nheads TRANS_NHEADS] [-aft AUDIO_FEATURE_TYPE]
               [--num-emotions NUM_EMOTIONS] [--img-interval IMG_INTERVAL]
               [--hand-crafted] [--text-max-len TEXT_MAX_LEN]
               [--datapath DATAPATH] [--dataset DATASET] [-mod MODALITIES]
               [--valid] [--test] [--ckpt CKPT] [--ckpt-mod CKPT_MOD]
               [-dr DROPOUT] [-nl NUM_LAYERS] [-hs HIDDEN_SIZE] [-bi] [--gru]

Multimodal End-to-End Sparse Model for Emotion Recognition

optional arguments:
  -h, --help            show this help message and exit
  -bs BATCH_SIZE, --batch-size BATCH_SIZE
                        Batch size
  -lr LEARNING_RATE, --learning-rate LEARNING_RATE
                        Learning rate
  -wd WEIGHT_DECAY, --weight-decay WEIGHT_DECAY
                        Weight decay
  -ep EPOCHS, --epochs EPOCHS
                        Number of epochs
  -es EARLY_STOP, --early-stop EARLY_STOP
                        Early stop
  -cu CUDA, --cuda CUDA
                        Cude device number
  -cl CLIP, --clip CLIP
                        Use clip to gradients
  -sc, --scheduler      Use scheduler to optimizer
  -se SEED, --seed SEED
                        Random seed
  --loss LOSS           loss function
  --optim OPTIM         optimizer function: adam/sgd
  --text-lr-factor TEXT_LR_FACTOR
                        Factor the learning rate of text model
  -mo MODEL, --model MODEL
                        Which model
  --text-model-size TEXT_MODEL_SIZE
                        Size of the pre-trained text model
  --fusion FUSION       How to fuse modalities
  --feature-dim FEATURE_DIM
                        Dimension of features outputed by each modality model
  -st SPARSE_THRESHOLD, --sparse-threshold SPARSE_THRESHOLD
                        Threshold of sparse CNN layers
  -hfcs HFC_SIZES [HFC_SIZES ...], --hfc-sizes HFC_SIZES [HFC_SIZES ...]
                        Hand crafted feature sizes
  --trans-dim TRANS_DIM
                        Dimension of the transformer after CNN
  --trans-nlayers TRANS_NLAYERS
                        Number of layers of the transformer after CNN
  --trans-nheads TRANS_NHEADS
                        Number of heads of the transformer after CNN
  -aft AUDIO_FEATURE_TYPE, --audio-feature-type AUDIO_FEATURE_TYPE
                        Hand crafted audio feature types
  --num-emotions NUM_EMOTIONS
                        Number of emotions in data
  --img-interval IMG_INTERVAL
                        Interval to sample image frames
  --hand-crafted        Use hand crafted features
  --text-max-len TEXT_MAX_LEN
                        Max length of text after tokenization
  --datapath DATAPATH   Path of data
  --dataset DATASET     Use which dataset
  -mod MODALITIES, --modalities MODALITIES
                        what modalities to use
  --valid               Only run validation
  --test                Only run test
  --ckpt CKPT           Path of checkpoint
  --ckpt-mod CKPT_MOD   Load which modality of the checkpoint
  -dr DROPOUT, --dropout DROPOUT
                        dropout
  -nl NUM_LAYERS, --num-layers NUM_LAYERS
                        num of layers of LSTM
  -hs HIDDEN_SIZE, --hidden-size HIDDEN_SIZE
                        hidden vector size of LSTM
  -bi, --bidirectional  Use Bi-LSTM
  --gru                 Use GRU rather than LSTM