yzhangcs / parser

:rocket: State-of-the-art parsers for natural language.
https://parser.yzhang.site/
MIT License
829 stars 141 forks source link

drop in performance #34

Closed attardi closed 4 years ago

attardi commented 4 years ago

I notice a significant drop in performance in the release branch with respect to the dev branch using the same configuration with 2 n_lstm_layers and all bert data (n_feat_embed=0). Here is an example on the UD Italian corpus.

Dev version:

----------------+--------------------------
Param           |           Value          
----------------+--------------------------
bert_model      | dbmdz/bert-base-italian-xxl-cased
n_embed         |            100           
n_char_embed    |            50            
n_feat_embed    |             0            
n_bert_layers   |             0            
embed_dropout   |           0.33           
n_lstm_hidden   |            400           
n_lstm_layers   |             2            
lstm_dropout    |           0.33           
mix_dropout     |            0.1           
n_mlp_arc       |            500           
n_mlp_rel       |            100           
mlp_dropout     |           0.33           
lr              |           0.002          
mu              |            0.9           
nu              |            0.9           
epsilon         |           1e-12          
clip            |            5.0           
decay           |           0.75           
decay_steps     |           5000           
batch_size      |           5000           
epochs          |           1000           
patience        |            20            
min_freq        |             2            
fix_len         |            20            
mode            |           train          
punct           |           False          
ftrain          | ../train-dev/UD_Italian-ISDT/it_isdt-ud-train.conllu
fdev            | ../train-dev/UD_Italian-ISDT/it_isdt-ud-dev.conllu
ftest           | ../test-turkunlp/it2.conllu
fembed          |                          
lower           |           False          
unk             |           [unk]          
max_sent_length |            512           
conf            |      config-bert.ini     
file            |       exp/it-bert/       
preprocess      |           True           
device          |           cuda           
seed            |             1            
threads         |            16            
tree            |           False          
proj            |           False          
feat            |           bert           
buckets         |            32            
fields          |    exp/it-bert/fields    
model           |     exp/it-bert/model    
n_words         |           13498          
n_feats         |           31102          
n_rels          |            45            
pad_index       |             0            
unk_index       |             1            
bos_index       |             2            
feat_pad_index  |             0            
----------------+--------------------------

....
Epoch 91 / 1000:
train: Loss: 0.2317 UAS: 95.97% LAS: 93.20%
dev:   Loss: 0.3567 UAS: 95.94% LAS: 93.90%
test:  Loss: 0.4502 UAS: 95.14% LAS: 92.84%

0:01:20.747564s elapsed (saved)

Release version:

----------------+--------------------------
Param           |           Value          
----------------+--------------------------
delete          | {'', '.', ':', 'S1', '?', '``', 'TOP', '!', '-NONE-', ',', "''"}
equal           |      {'ADVP': 'PRT'}     
bert            | dbmdz/bert-base-italian-xxl-cased
n_embed         |            100           
n_char_embed    |            50            
n_feat_embed    |             0            
n_bert_layers   |             0            
embed_dropout   |           0.33           
n_lstm_hidden   |            400           
n_lstm_layers   |             2            
lstm_dropout    |           0.33           
mix_dropout     |            0.1           
n_mlp_span      |            500           
n_mlp_arc       |            500           
n_mlp_label     |            100           
n_mlp_sib       |            100           
n_mlp_rel       |            100           
mlp_dropout     |           0.33           
lr              |           0.002          
mu              |            0.9           
nu              |            0.9           
epsilon         |           1e-12          
clip            |            5.0           
decay           |           0.75           
decay_steps     |           5000           
batch_size      |           5000           
epochs          |           1000           
patience        |            20            
min_freq        |             2            
fix_len         |            20            
mode            |           train          
path            |     exp/it-bert/model    
conf            |      config-bert.ini     
device          |           cuda           
seed            |             1            
threads         |            16            
buckets         |            32            
tree            |           False          
proj            |           False          
feat            |           bert           
build           |           False          
punct           |           False          
max_len         |           None           
train           | ../train-dev/UD_Italian-ISDT/it_isdt-ud-train.conllu
dev             | ../train-dev/UD_Italian-ISDT/it_isdt-ud-dev.conllu
test            | ../test-turkunlp/it2.conllu
embed           |                          
unk             |           [unk]          
----------------+--------------------------

2020-06-15 07:52:02 INFO train: 13121 sentences,  68 batches, 32 buckets
2020-06-15 07:52:02 INFO dev:     564 sentences,  32 batches, 32 buckets
2020-06-15 07:52:02 INFO test:    489 sentences,  32 batches, 32 buckets

2020-06-15 07:52:02 INFO BiaffineParserModel(
  (word_embed): Embedding(12876, 100)
  (feat_embed): BertEmbedding(n_layers=12, n_out=768, pad_index=0)
  (embed_dropout): IndependentDropout(p=0.33)
  (lstm): BiLSTM(868, 400, num_layers=2, dropout=0.33)
  (lstm_dropout): SharedDropout(p=0.33, batch_first=True)
  (mlp_arc_d): MLP(n_in=800, n_out=500, dropout=0.33)
  (mlp_arc_h): MLP(n_in=800, n_out=500, dropout=0.33)
  (mlp_rel_d): MLP(n_in=800, n_out=100, dropout=0.33)
  (mlp_rel_h): MLP(n_in=800, n_out=100, dropout=0.33)
  (arc_attn): Biaffine(n_in=500, n_out=1, bias_x=True)
  (rel_attn): Biaffine(n_in=100, n_out=45, bias_x=True, bias_y=True)
  (criterion): CrossEntropyLoss()
)
...
2020-06-15 09:18:19 INFO Epoch 119 / 1000:
2020-06-15 09:18:59 INFO dev:   - loss: 0.6333 - UCM: 46.28% LCM: 29.96% UAS: 91.46% LAS: 87.32%
2020-06-15 09:19:01 INFO test:  - loss: 0.3640 - UCM: 55.42% LCM: 41.10% UAS: 93.57% LAS: 90.32%
2020-06-15 09:19:06 INFO 0:00:41.564757s elapsed (saved)

With 4 n_bert_layers and 100 n_feats_embed dev and release brach perform similarly:

n_feat_embed    |            100           
n_bert_layers   |             4            
embed_dropout   |           0.33           
n_lstm_hidden   |            400           
n_lstm_layers   |             3            

it works better:

2020-06-13 20:15:39 INFO Epoch 84 / 1000:
2020-06-13 20:17:08 INFO dev:   - loss: 0.4315 - UCM: 61.70% LCM: 47.87% UAS: 95.80% LAS: 93.46%
2020-06-13 20:17:13 INFO test:  - loss: 0.3368 - UCM: 60.12% LCM: 47.65% UAS: 95.05% LAS: 92.48%
2020-06-13 20:17:18 INFO 0:01:34.633723s elapsed (saved)

which is similar to the dev branch:

n_feat_embed    |            100           

n_bert_layers | 4
embed_dropout | 0.33
n_lstm_hidden | 400
n_lstm_layers | 3

Epoch 112 / 1000:

train: Loss: 0.2213 UAS: 96.16% LAS: 93.16% dev: Loss: 0.3626 UAS: 95.90% LAS: 93.51% test: Loss: 0.4056 UAS: 95.13% LAS: 92.63% 0:01:07.875049s elapsed (saved)

What could be the reason? You suggested that using all layers and all features from BERT would have been beneficial, and indeed it was in the dev branch.

yzhangcs commented 4 years ago

Does it behave similarly on PTB?

attardi commented 4 years ago

I only use UD. On the UD_English_EWT the release branch achieves:

2020-06-14 20:48:00 INFO Epoch 240 / 1000:
2020-06-14 20:48:54 INFO dev:   - loss: 0.7806 - UCM: 56.04% LCM: 45.90% UAS: 87.41% LAS: 82.86%
2020-06-14 20:49:06 INFO test:  - loss: 0.8076 - UCM: 50.16% LCM: 38.12% UAS: 89.39% LAS: 85.30%
2020-06-14 20:49:07 INFO 0:01:06.209307s elapsed (saved)

with this configuration:

bert_model      |      bert-base-cased     
n_embed         |            100           
n_char_embed    |            50            
n_feat_embed    |            100           
n_bert_layers   |             4            

while the dev branch achieves:

Epoch 177 / 1000:
train: Loss: 0.2244 UAS: 95.55% LAS: 92.70%
dev:   Loss: 0.6326 UAS: 93.16% LAS: 90.24%
test:  Loss: 1.1072 UAS: 91.84% LAS: 89.23%
0:01:00.618501s elapsed (saved)

although with a different model and configuration:

bert_model      | TurkuNLP/wikibert-base-en-cased
n_embed         |            100           
n_char_embed    |            50            
n_feat_embed    |             0            
n_bert_layers   |             0            
embed_dropout   |           0.33           
n_lstm_hidden   |            400           
n_lstm_layers   |             2            
attardi commented 4 years ago

I am running the same configuration as the dev branch, but it doesn't look promising.

bert            | TurkuNLP/wikibert-base-en-cased
n_embed         |            100           
n_char_embed    |            50            
n_feat_embed    |             0            
n_bert_layers   |             0  
embed_dropout   |           0.33           
n_lstm_hidden   |            400           
n_lstm_layers   |             2                      

At epoch 26, release gets:

2020-06-15 13:58:19 INFO Epoch 26 / 1000:
2020-06-15 13:58:54 INFO dev:   - loss: 1.0003 - UCM: 46.15% LCM: 36.16% UAS: 82.75% LAS: 76.09%
2020-06-15 13:59:01 INFO test:  - loss: 1.0627 - UCM: 36.27% LCM: 25.12% UAS: 84.05% LAS: 77.21%
2020-06-15 13:59:03 INFO 0:00:42.405768s elapsed (saved)

while the dev branch was already at:

Epoch 26 / 1000:
train: Loss: 0.5479 UAS: 89.65% LAS: 84.42%
dev:   Loss: 0.5415 UAS: 91.82% LAS: 88.41%
test:  Loss: 0.8552 UAS: 91.26% LAS: 88.26%
yzhangcs commented 4 years ago

Sorry, the release branch is still in development and some bugs may lurk in the code. I will do some checks on PTB later.

attardi commented 4 years ago

I did an experiment with the dev branch using the new model electra-base-discrimintator and achieved an improvement on UD_English_EWT. From bert-base-cased:

Epoch 177 / 1000: train: Loss: 0.2244 UAS: 95.55% LAS: 92.70% dev: Loss: 0.6326 UAS: 93.16% LAS: 90.24% test: Loss: 1.1072 UAS: 91.84% LAS: 89.23%

to electra-base-discriminator:

Epoch 128 / 1000: train: Loss: 0.2552 UAS: 95.10% LAS: 91.90% dev: Loss: 0.4805 UAS: 94.49% LAS: 91.90% test: Loss: 1.1904 UAS: 91.73% LAS: 89.20%

yzhangcs commented 4 years ago

I also notice a drop in performance on PTB. Something maybe erroneously modified.

yzhangcs commented 4 years ago

Hi @attardi, the bug has been fixed. It' because I didn't figure out the usage of from_config. To load the model weights, we should use from_pretrained instead of from_config.

attardi commented 4 years ago

I also tested the XLNet model on te dev branch, as you suggested. It is less accurate:

Epoch 147 / 1000: train: Loss: 0.2343 UAS: 95.27% LAS: 92.30% dev: Loss: 0.5309 UAS: 93.50% LAS: 90.83% test: Loss: 1.0762 UAS: 91.94% LAS: 89.35%

yzhangcs commented 4 years ago

What does less accurate mean @attardi? Is their any exception on dev branch?