yzhangcs / parser

:rocket: State-of-the-art parsers for natural language.
https://parser.yzhang.site/
MIT License
829 stars 141 forks source link

Unable to train on custom conllu data #36

Closed steysie closed 4 years ago

steysie commented 4 years ago

Hi,

I am trying to train a biaffine dependency parser on UD_Russian-SynTagRus corpus. For some reason, training script fails without any warnings or errors. Could you please help me on what could go wrong? I'm trying to run training in Google Colab.

Here's the script I'm using:

!python -m supar.cmds.biaffine_dependency train -b -d 0\
    -p exp/ptb.biaffine.dependency.char/model \
    -f char \
    --embed ft_native_300_ru_wiki_lenta_nltk_wordpunct_tokenize.vec \
    --train ./corpus/_UD/UD_Russian-SynTagRus/ru_syntagrus-ud-train.conllu \
    --dev ./corpus/_UD/UD_Russian-SynTagRus/ru_syntagrus-ud-dev.conllu \
    --test ./corpus/_UD/UD_Russian-SynTagRus/ru_syntagrus-ud-test.conllu

The output is:

2020-07-29 16:20:53.655924: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-07-29 16:21:01 INFO 
----------------+--------------------------
Param           |           Value          
----------------+--------------------------
tree            |           False          
proj            |           False          
mode            |           train          
path            | exp/ptb.biaffine.dependency.char/model
device          |             0            
seed            |             1            
threads         |            16            
batch_size      |           5000           
feat            |           char           
build           |           True           
punct           |           False          
max_len         |           None           
buckets         |            32            
train           | ./corpus/_UD/UD_Russian-SynTagRus/ru_syntagrus-ud-train.conllu
dev             | ./corpus/_UD/UD_Russian-SynTagRus/ru_syntagrus-ud-dev.conllu
test            | ./corpus/_UD/UD_Russian-SynTagRus/ru_syntagrus-ud-test.conllu
embed           | ft_native_300_ru_wiki_lenta_nltk_wordpunct_tokenize.vec
unk             |            unk           
bert            |      bert-base-cased     
----------------+--------------------------

2020-07-29 16:21:01 INFO Build the fields
^C

The training fails with no errors, so it's hard to see what exactly is wrong.

Plus, when trying to use Bert embeddings (bert-base-multilingual-cased), there was an error that the bos_token was not set.

Same failure happens when running supar.cmds.crf_dependency.

yzhangcs commented 4 years ago

Notice that the size of embeddings is 300, did you modify the n_embed to 300? You can specify -c config.ini for more details on default configs.

Yu Zhang Soochow University


From: Anastasia Nikiforova notifications@github.com Sent: Thursday, July 30, 2020 6:49:18 PM To: yzhangcs/parser parser@noreply.github.com Cc: Subscribed subscribed@noreply.github.com Subject: [yzhangcs/parser] Unable to train on custom conllu data (#36)

Hi,

I am trying to train a biaffine dependency parser on UD_Russian-SynTagRus corpus. For some reason, training script fails without any warnings or errors. Could you please help me on what could go wrong? I'm trying to run training in Google Colab.

Here's the script I'm using:

!python -m supar.cmds.biaffine_dependency train -b -d 0\ -p exp/ptb.biaffine.dependency.char/model \ -f char \ --embed ft_native_300_ru_wiki_lenta_nltk_wordpunct_tokenize.vec \ --train ./corpus/_UD/UD_Russian-SynTagRus/ru_syntagrus-ud-train.conllu \ --dev ./corpus/_UD/UD_Russian-SynTagRus/ru_syntagrus-ud-dev.conllu \ --test ./corpus/_UD/UD_Russian-SynTagRus/ru_syntagrus-ud-test.conllu

The output is:

2020-07-29 16:20:53.655924: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1 2020-07-29 16:21:01 INFO ----------------+-------------------------- Param | Value ----------------+-------------------------- tree | False proj | False mode | train path | exp/ptb.biaffine.dependency.char/model device | 0 seed | 1 threads | 16 batch_size | 5000 feat | char build | True punct | False max_len | None buckets | 32 train | ./corpus/_UD/UD_Russian-SynTagRus/ru_syntagrus-ud-train.conllu dev | ./corpus/_UD/UD_Russian-SynTagRus/ru_syntagrus-ud-dev.conllu test | ./corpus/_UD/UD_Russian-SynTagRus/ru_syntagrus-ud-test.conllu embed | ft_native_300_ru_wiki_lenta_nltk_wordpunct_tokenize.vec unk | unk bert | bert-base-cased ----------------+--------------------------

2020-07-29 16:21:01 INFO Build the fields ^C

The training fails with no errors, so it's hard to see what exactly is wrong.

Plus, when trying to use Bert embeddings (bert-base-multilingual-cased), there was an error that the bos_token was not set.

Same failure happens when running supar.cmds.crf_dependency.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/yzhangcs/parser/issues/36, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AEMMYK2PSX5YHLQYZXCLJYDR6FF25ANCNFSM4PNOIUXQ.

steysie commented 4 years ago

Thanks, I didn't realize that I needed to specify -c config.ini. Nevertheless, the problem of failing train is still there - first it shows the progress bar that quickly goes up to 100% and disappears, and soon after the script stops running again.

yzhangcs commented 4 years ago

The disappearance of the bar is an expected behaviour. There might be something wrong with the training data. Could you give me some examples?

yzhangcs commented 4 years ago

BTW, does the logger print the line with dataset information: Dataset(...)? @steysie

yzhangcs commented 4 years ago

The following cmd works for me:

$ python -m supar.cmds.biaffine_dependency train -b -d 0  \
    -p <path>  \
    -f char  \
    --punct  \
    --train <train>  \
    --dev <dev> \
    --test <test>  \
    --embed <embed >  \
    --unk <unk>  \
    --n-embed 300  \
    --bert bert-base-multilingual-cased

Please pull the update first.

steysie commented 4 years ago

@yzhangcs For some reason, the training still is interrupted :( The Dataset(...) info is not shown when running the script.

I load the data and run the script from this Colab notebook. Could the problem be in Colab? Or is it the data? Although, I have used the same dataset for training morphological taggers and there were never any problems with it.

yzhangcs commented 4 years ago

Have you tried to run the parser on your local machine. By monitoring the preprocessing step, I find it consumes a great deal of memory. I guess an OOM error is occurred when running on CoLab.

steysie commented 4 years ago

@yzhangcs Same result:

2020-08-03 08:09:40 INFO 
----------------+--------------------------
Param           |           Value          
----------------+--------------------------
tree            |           False          
proj            |           False          
mode            |           train          
path            | exp/ptb.biaffine.dependency.char/model
device          |             0            
seed            |             1            
threads         |            16            
batch_size      |           5000           
feat            |           char           
build           |           True           
punct           |           True           
max_len         |           None           
buckets         |            32            
train           | corpus/_UD/UD_Russian-Taiga/ru_taiga-ud-train.conllu
dev             | corpus/_UD/UD_Russian-Taiga/ru_taiga-ud-dev.conllu
test            | corpus/_UD/UD_Russian-Taiga/ru_taiga-ud-test.conllu
embed           |    glove.42B.300d.txt    
unk             |            unk           
n_embed         |            300           
bert            | bert-base-multilingual-cased
----------------+--------------------------

2020-08-03 08:09:40 INFO Build the fields
Killed   

I tried to run the script on a significantly smaller dataset - the training is still killed.

I'm starting to think that there may be problems with the data, like cycles in dependency trees. Could that overload the memory?

steysie commented 4 years ago

Also, it is ok to feed conll-u, not conll-x, files as input?

yzhangcs commented 4 years ago

Sorry, I can't reproduce your problem. I ran the following command on Russian CoNLL18 (ud2.2) data in conllu format, and it works:

python -m supar.cmds.biaffine_dependency train -b -d 0    
  -p ./model     
  -f bert    
  --punct     
  --train data/conll18/ru/train.auto.conllu     
  --dev data/conll18/ru/dev.gold.conllu     
  --test data/conll18/ru/test.gold.conllu    
  --embed data/fasttext/cc.ru.300.vec
  --unk ''
  --n-embed 300  
  --bert bert-base-multilingual-cased

Here are some example outputs

2020-08-03 19:20:12 INFO 
----------------+--------------------------
Param           |           Value          
----------------+--------------------------
tree            |           False          
proj            |           False          
mode            |           train          
path            |          ./model         
device          |             0            
seed            |             1            
threads         |            16            
batch_size      |           5000           
feat            |           bert           
build           |           True           
punct           |           True           
max_len         |           None           
buckets         |            32            
train           | data/conll18/ru/train.auto.conllu
dev             | data/conll18/ru/dev.gold.conllu
test            | data/conll18/ru/test.gold.conllu
embed           | data/fasttext/cc.ru.300.vec
unk             |                          
n_embed         |            300           
bert            | bert-base-multilingual-cased
----------------+--------------------------

2020-08-03 19:20:12 INFO Build the fields
2020-08-03 19:20:15 ERROR Using bos_token, but it is not set yet.
2020-08-03 19:33:39 INFO Load the data                                              
2020-08-03 19:34:56 INFO                                                             
train: Dataset(n_sentences=48814, n_batches=201, n_buckets=32)
dev:   Dataset(n_sentences=6584, n_batches=41, n_buckets=32)
test:  Dataset(n_sentences=6491, n_batches=38, n_buckets=32)

2020-08-03 19:34:56 INFO BiaffineDependencyModel(
  (word_embed): Embedding(45626, 300)
  (feat_embed): BertEmbedding(bert-base-multilingual-cased, n_layers=4, n_out=100, pad_index=0)
  (embed_dropout): IndependentDropout(p=0.33)
  (lstm): BiLSTM(400, 400, num_layers=3, dropout=0.33)
  (lstm_dropout): SharedDropout(p=0.33, batch_first=True)
  (mlp_arc_d): MLP(n_in=800, n_out=500, dropout=0.33)
  (mlp_arc_h): MLP(n_in=800, n_out=500, dropout=0.33)
  (mlp_rel_d): MLP(n_in=800, n_out=100, dropout=0.33)
  (mlp_rel_h): MLP(n_in=800, n_out=100, dropout=0.33)
  (arc_attn): Biaffine(n_in=500, n_out=1, bias_x=True)
  (rel_attn): Biaffine(n_in=100, n_out=41, bias_x=True, bias_y=True)
  (criterion): CrossEntropyLoss()
  (pretrained): Embedding(1675340, 300)
)

2020-08-03 19:34:56 INFO Epoch 1 / 5000:
100%|####################################| 201/201 01:20<00:00,  2.50it/s, lr: 1.9770e-03 - loss: 2.6031 - UCM:  9.38% LCM:  4.26% UAS: 51.66% LAS: 36.51%
2020-08-03 19:36:24 INFO dev:   - loss: 1.4177 - UCM: 18.88% LCM: 10.30% UAS: 75.79% LAS: 67.42%
2020-08-03 19:36:31 INFO test:  - loss: 1.2628 - UCM: 17.33% LCM:  9.23% UAS: 76.52% LAS: 68.35%

Note that the first line in your embedding file should be <word> <embedding>.

steysie commented 4 years ago

Could you please what is your RAM and GPU memory? I upgraded Colab RAM from 12 to 25 GB and GPU from 11 to 16 GB and now after the ERROR Using bos_token... message it doen't fail right away, but only after 3-4 minutes.

I wonder what are the recommended memory capacities.

yzhangcs commented 4 years ago

RAM: 263GB GPU: 32GB

steysie commented 4 years ago

I see. Then, low RAM is the problem. Have you by any chance tried to run the model with a much lower RAM? Maybe there are some parameter combinations that would work?

yzhangcs commented 4 years ago

A way to get around this is to shrink word embeddings and keep only words in train/dev/test set.

steysie commented 4 years ago

@yzhangcs Thank you for the idea! It works now with filtered embeddings, if the first line with is skipped.

Thank you for your help!