yzhangcs / parser

:rocket: State-of-the-art parsers for natural language.
https://parser.yzhang.site/
MIT License
829 stars 141 forks source link

RuntimeError: CUDA out of memory #28

Closed attardi closed 4 years ago

attardi commented 4 years ago

I am testing the dev branch, using transformers 2.10.0, somewhat successfully.

However it runs out of CUDA memory on the UD_English-EWT treebank:

Epoch 8 / 50000: Traceback (most recent call last): File "run.py", line 61, in cmd(args) File "/homenfs/tempGPU/iwpt2020/parser/parser/cmds/train.py", line 81, in call loss, train_metric = self.train(train.loader) File "/homenfs/tempGPU/iwpt2020/parser/parser/cmds/cmd.py", line 91, in train s_arc, s_rel = self.model(words, feats) File "/homenfs/tempGPU/iwpt2020/.env/lib64/python3.6/site-packages/torch/nn/modules/module.py", line 550, in call result = self.forward(*input, *kwargs) File "/homenfs/tempGPU/iwpt2020/parser/parser/model.py", line 92, in forward feat_embed = self.feat_embed(feats) File "/homenfs/tempGPU/iwpt2020/.env/lib64/python3.6/site-packages/torch/nn/modules/module.py", line 550, in call result = self.forward(input, **kwargs) File "/homenfs/tempGPU/iwpt2020/parser/parser/modules/bert.py", line 59, in forward embed = embed.maskedscatter(mask.unsqueeze(-1), bert[bert_mask]) RuntimeError: CUDA out of memory. Tried to allocate 2.02 GiB (GPU 0; 14.76 GiB total capacity; 6.04 GiB already allocated; 1.06 GiB free; 12.90 GiB reserved in total by PyTorch)

I thought it was because the treebank is large, but there are larger treebanks on which it works:

wc -l UD_English-EWT/en_ewt-ud-train.conllu 242778 UD_English-EWT/en_ewt-ud-train.conllu wc -l UD_Italian-ISDT/it_isdt-ud-train.conllu 333822 UD_Italian-ISDT/it_isdt-ud-train.conllu

I tried increasing the buckets to 48, but that did not help. It works though by decreasing the batch_size to 500.

The problem occurs also with transformers 2.1.1 on the UD_English-EWT.

yzhangcs commented 4 years ago

Is it OK on master?

attardi commented 4 years ago

Same problem with master brand on English

with transformers 2.1.1

Epoch 1 / 50000: train: Loss: 3.5856 UAS: 45.99% LAS: 21.43% dev: Loss: 3.5484 UAS: 49.89% LAS: 23.37% ... result = self.forward(*input, **kwargs) File "/homenfs/tempGPU/iwpt2020/.env/lib64/python3.6/site-packages/transformers/modeling_bert.py", line 212, in forward attention_scores = attention_scores / math.sqrt(self.attention_head_size) RuntimeError: CUDA out of memory. Tried to allocate 2.10 GiB (GPU 0; 14.76 GiB total capacity; 5.76 GiB already allocated; 1.64 GiB free; 12.33 GiB reserved in total by PyTorch)

with transformers 2.10.0, it happens an epoch later:

Epoch 2 / 50000: train: Loss: 2.9198 UAS: 48.00% LAS: 30.77% dev: Loss: 2.9559 UAS: 50.39% LAS: 33.26% Traceback (most recent call last): ... result = self.forward(*input, *kwargs) File "/homenfs/tempGPU/iwpt2020/.env/lib64/python3.6/site-packages/transformers/modeling_bert.py", line 314, in forward hidden_states = self.LayerNorm(hidden_states + input_tensor) File "/homenfs/tempGPU/iwpt2020/.env/lib64/python3.6/site-packages/torch/nn/modules/module.py", line 550, in call result = self.forward(input, **kwargs) File "/homenfs/tempGPU/iwpt2020/.env/lib64/python3.6/site-packages/transformers/modeling_bert.py", line 235, in forward return outputs RuntimeError: CUDA out of memory. Tried to allocate 2.23 GiB (GPU 0; 14.76 GiB total capacity; 6.34 GiB already allocated; 1.59 GiB free; 12.38 GiB reserved in total by PyTorch)

yzhangcs commented 4 years ago

Could you print the model and fields. This may be due to too many labels.

attardi commented 4 years ago

----------------+-------------------------- Param | Value
----------------+-------------------------- bert_model | bert-base-cased
n_embed | 100
n_char_embed | 50
n_feat_embed | 100
n_bert_layers | 4
embed_dropout | 0.33
n_lstm_hidden | 400
n_lstm_layers | 3
lstm_dropout | 0.33
n_mlp_arc | 500
n_mlp_rel | 100
mlp_dropout | 0.33
lr | 0.002
mu | 0.9
nu | 0.9
epsilon | 1e-12
clip | 5.0
decay | 0.75
decay_steps | 5000
batch_size | 5000
epochs | 50000
patience | 100
min_freq | 2
fix_len | 20
mode | train
punct | False
ftrain | ../train-dev/UD_English-EWT/en_ewt-ud-train.conllu fdev | ../train-dev/UD_English-EWT/en_ewt-ud-dev.conllu ftest | ../test-turkunlp/en2.conllu fembed |
unk | [unk]
conf | config.ini
file | exp/en_ewt-bert
preprocess | True
device | cuda
seed | 1
threads | 16
tree | False
proj | False
feat | bert
buckets | 32
fields | exp/en_ewt-bert/fields
model | exp/en_ewt-bert/model
n_words | 8867
n_feats | 28996
n_rels | 50
pad_index | 0
unk_index | 1
bos_index | 2
feat_pad_index | 0
----------------+--------------------------

yzhangcs commented 4 years ago

----------------+-------------------------- Param | Value
----------------+-------------------------- bert_model | bert-base-cased
n_embed | 100
n_char_embed | 50
n_feat_embed | 100
n_bert_layers | 4
embed_dropout | 0.33
n_lstm_hidden | 400
n_lstm_layers | 3
lstm_dropout | 0.33
n_mlp_arc | 500
n_mlp_rel | 100
mlp_dropout | 0.33
lr | 0.002
mu | 0.9
nu | 0.9
epsilon | 1e-12
clip | 5.0
decay | 0.75
decay_steps | 5000
batch_size | 5000
epochs | 50000
patience | 100
min_freq | 2
fix_len | 20
mode | train
punct | False
ftrain | ../train-dev/UD_English-EWT/en_ewt-ud-train.conllu fdev | ../train-dev/UD_English-EWT/en_ewt-ud-dev.conllu ftest | ../test-turkunlp/en2.conllu fembed |
unk | [unk]
conf | config.ini
file | exp/en_ewt-bert
preprocess | True
device | cuda
seed | 1
threads | 16
tree | False
proj | False
feat | bert
buckets | 32
fields | exp/en_ewt-bert/fields
model | exp/en_ewt-bert/model
n_words | 8867
n_feats | 28996
n_rels | 50
pad_index | 0
unk_index | 1
bos_index | 2
feat_pad_index | 0
----------------+--------------------------

Sorry for being vague, I mean

print(self.model)
print(self.fields)
attardi commented 4 years ago

Model( (word_embed): Embedding(8867, 100) (feat_embed): BertEmbedding(n_layers=4, n_out=100, pad_index=0) (embed_dropout): IndependentDropout(p=0.33) (lstm): BiLSTM(200, 400, num_layers=3, dropout=0.33) (lstm_dropout): SharedDropout(p=0.33, batch_first=True) (mlp_arc_d): MLP(n_in=800, n_out=500, dropout=0.33) (mlp_arc_h): MLP(n_in=800, n_out=500, dropout=0.33) (mlp_rel_d): MLP(n_in=800, n_out=100, dropout=0.33) (mlp_rel_h): MLP(n_in=800, n_out=100, dropout=0.33) (arc_attn): Biaffine(n_in=500, n_out=1, bias_x=True) (rel_attn): Biaffine(n_in=100, n_out=50, bias_x=True, bias_y=True) (criterion): CrossEntropyLoss() )

CoNLL(ID=None, FORM=((words): Field(pad=, unk=, bos=, lower=True), (bert): SubwordField(pad=[PAD], unk=[UNK], bos=[CLS])), LEMMA=None, CPOS=None, POS=None, FEATS=None, HEAD=(arcs): Field(bos=, use_vocab=False), DEPREL=(rels): Field(bos=), PHEAD=None, PDEPREL=None)

attardi commented 4 years ago

BTW. Why do you distinguish:

        if args.feat in ('char', 'bert'):
            self.fields = CoNLL(FORM=(self.WORD, self.FEAT),
                                HEAD=self.ARC, DEPREL=self.REL)
        else:
            self.fields = CoNLL(FORM=self.WORD, CPOS=self.FEAT,
                                HEAD=self.ARC, DEPREL=self.REL)

and then do:

        if args.feat in ('char', 'bert'):
            self.WORD, self.FEAT = self.fields.FORM
        else:
            self.WORD, self.FEAT = self.fields.FORM, self.fields.CPOS

You might just use:

    self.fields = CoNLL(FORM=(self.WORD, self.FEAT),
                         HEAD=self.ARC, DEPREL=self.REL)

and then

        self.WORD, self.FEAT = self.fields.FORM
yzhangcs commented 4 years ago

Since POS tags and words are in different columns, I have to make a distinction.

attardi commented 4 years ago

I see. Dozat uses both word and POS feature. You replace POS with BERT. Have you tried keeping POS?

yzhangcs commented 4 years ago

Yeah, --feat=tag can make model behaves as described in the paper.

attardi commented 4 years ago

I mean using both POS and BERT.

yzhangcs commented 4 years ago

Yes, I have tried this way, but under the strong baseline of BERT, POS tags have very little effect. Nevertheless, my implementation (especially IndependentDropout) can well support the combination of three or more features with minor modifications.

attardi commented 4 years ago

I agree that BERT is a much richer representation, but it might be worth a try. I also wonder why you use also embeddings for words, besides BERT. Isn't BERT enough?

yzhangcs commented 4 years ago

Since BERT is a character/subword level features, intuitively, word embedding can be a useful supplement. But I did not verify this.

attardi commented 4 years ago

You can make BERT into word features by averaging the workpieces.

yzhangcs commented 4 years ago

You can make BERT into word features by averaging the workpieces.

That's exactly what the current code implements.

attardi commented 4 years ago

Then why do you need both? You might use subwords embeddings only when they are present, as a surrogate of POS, saving memory.

yzhangcs commented 4 years ago

To some extent, for API consistency :joy:.

attardi commented 4 years ago

I am looking for ways to save memory, since I still have that problem with many treebanks. I had been able to train all the UD treebanks with the master branch. Something is causing an increase in memory.

yzhangcs commented 4 years ago

Could you send me the data to reproduce this error?

attardi commented 4 years ago

You can download the data from here: http://ufal.mff.cuni.cz/~zeman/soubory/iwpt2020-train-dev.tgz

attardi commented 4 years ago

I am using bert_model = 'bert-base-multilingual-cased' for languages other than English.

yzhangcs commented 4 years ago

Hi, the issue is caused by some weird sentence annotations like

# sent_id = email-enronsent15_01-0049
# text = >----------------------------------------------------------------------------| | | >----------------------------------------------------------------------------|
1   >----------------------------------------------------------------------------|  >----------------------------------------------------------------------------|  SYM NFP _   4   punct   4:punct _
2   |   |   SYM NFP _   4   punct   4:punct _
3   |   |   SYM NFP _   4   punct   4:punct _
4   >----------------------------------------------------------------------------|  >----------------------------------------------------------------------------|  SYM NFP _   0   root    0:root  _

# newdoc id = newsgroup-groups.google.com_hiddennook_04d8cc994875d454_ENG_20050207_012300
# sent_id = newsgroup-groups.google.com_hiddennook_04d8cc994875d454_ENG_20050207_012300-0001
# text = [http://www.newsday.com/news/opinion/ny-vpnasa054135614feb05,0,5979821.story?coll=ny-editorials-headlines]
1   [   [   PUNCT   -LRB-   _   2   punct   2:punct SpaceAfter=No
2   http://www.newsday.com/news/opinion/ny-vpnasa054135614feb05,0,5979821.story?coll=ny-editorials-headlines    http://www.newsday.com/news/opinion/ny-vpnasa054135614feb05,0,5979821.story?coll=ny-editorials-headlines    X   ADD _   0   root    0:root  SpaceAfter=No
3   ]   ]   PUNCT   -RRB-   _   2   punct   2:punct _

which leads to very large word pieces. You can fix this issue by limiting the length of word pieces:

                self.FEAT = SubwordField('bert',
                                         pad=tokenizer.pad_token,
                                         unk=tokenizer.unk_token,
                                         bos=tokenizer.cls_token,
                                         fix_len=args.fix_len,
                                         tokenize=tokenizer.tokenize)
attardi commented 4 years ago

Great, it works on English! I will test it on all treebanks.