Closed steysie closed 4 years ago
Notice that the size of embeddings is 300, did you modify the n_embed
to 300?
You can specify -c config.ini
for more details on default configs.
Yu Zhang Soochow University
From: Anastasia Nikiforova notifications@github.com Sent: Thursday, July 30, 2020 6:49:18 PM To: yzhangcs/parser parser@noreply.github.com Cc: Subscribed subscribed@noreply.github.com Subject: [yzhangcs/parser] Unable to train on custom conllu data (#36)
Hi,
I am trying to train a biaffine dependency parser on UD_Russian-SynTagRus corpus. For some reason, training script fails without any warnings or errors. Could you please help me on what could go wrong? I'm trying to run training in Google Colab.
Here's the script I'm using:
!python -m supar.cmds.biaffine_dependency train -b -d 0\ -p exp/ptb.biaffine.dependency.char/model \ -f char \ --embed ft_native_300_ru_wiki_lenta_nltk_wordpunct_tokenize.vec \ --train ./corpus/_UD/UD_Russian-SynTagRus/ru_syntagrus-ud-train.conllu \ --dev ./corpus/_UD/UD_Russian-SynTagRus/ru_syntagrus-ud-dev.conllu \ --test ./corpus/_UD/UD_Russian-SynTagRus/ru_syntagrus-ud-test.conllu
The output is:
2020-07-29 16:20:53.655924: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1 2020-07-29 16:21:01 INFO ----------------+-------------------------- Param | Value ----------------+-------------------------- tree | False proj | False mode | train path | exp/ptb.biaffine.dependency.char/model device | 0 seed | 1 threads | 16 batch_size | 5000 feat | char build | True punct | False max_len | None buckets | 32 train | ./corpus/_UD/UD_Russian-SynTagRus/ru_syntagrus-ud-train.conllu dev | ./corpus/_UD/UD_Russian-SynTagRus/ru_syntagrus-ud-dev.conllu test | ./corpus/_UD/UD_Russian-SynTagRus/ru_syntagrus-ud-test.conllu embed | ft_native_300_ru_wiki_lenta_nltk_wordpunct_tokenize.vec unk | unk bert | bert-base-cased ----------------+--------------------------
2020-07-29 16:21:01 INFO Build the fields ^C
The training fails with no errors, so it's hard to see what exactly is wrong.
Plus, when trying to use Bert embeddings (bert-base-multilingual-cased), there was an error that the bos_token was not set.
Same failure happens when running supar.cmds.crf_dependency.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/yzhangcs/parser/issues/36, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AEMMYK2PSX5YHLQYZXCLJYDR6FF25ANCNFSM4PNOIUXQ.
Thanks, I didn't realize that I needed to specify -c config.ini
.
Nevertheless, the problem of failing train is still there - first it shows the progress bar that quickly goes up to 100% and disappears, and soon after the script stops running again.
The disappearance of the bar is an expected behaviour. There might be something wrong with the training data. Could you give me some examples?
BTW, does the logger print the line with dataset information: Dataset(...)
? @steysie
The following cmd works for me:
$ python -m supar.cmds.biaffine_dependency train -b -d 0 \
-p <path> \
-f char \
--punct \
--train <train> \
--dev <dev> \
--test <test> \
--embed <embed > \
--unk <unk> \
--n-embed 300 \
--bert bert-base-multilingual-cased
Please pull the update first.
@yzhangcs For some reason, the training still is interrupted :(
The Dataset(...)
info is not shown when running the script.
I load the data and run the script from this Colab notebook. Could the problem be in Colab? Or is it the data? Although, I have used the same dataset for training morphological taggers and there were never any problems with it.
Have you tried to run the parser on your local machine. By monitoring the preprocessing step, I find it consumes a great deal of memory. I guess an OOM error is occurred when running on CoLab.
@yzhangcs Same result:
2020-08-03 08:09:40 INFO
----------------+--------------------------
Param | Value
----------------+--------------------------
tree | False
proj | False
mode | train
path | exp/ptb.biaffine.dependency.char/model
device | 0
seed | 1
threads | 16
batch_size | 5000
feat | char
build | True
punct | True
max_len | None
buckets | 32
train | corpus/_UD/UD_Russian-Taiga/ru_taiga-ud-train.conllu
dev | corpus/_UD/UD_Russian-Taiga/ru_taiga-ud-dev.conllu
test | corpus/_UD/UD_Russian-Taiga/ru_taiga-ud-test.conllu
embed | glove.42B.300d.txt
unk | unk
n_embed | 300
bert | bert-base-multilingual-cased
----------------+--------------------------
2020-08-03 08:09:40 INFO Build the fields
Killed
I tried to run the script on a significantly smaller dataset - the training is still killed.
I'm starting to think that there may be problems with the data, like cycles in dependency trees. Could that overload the memory?
Also, it is ok to feed conll-u, not conll-x, files as input?
Sorry, I can't reproduce your problem. I ran the following command on Russian CoNLL18 (ud2.2) data in conllu format, and it works:
python -m supar.cmds.biaffine_dependency train -b -d 0
-p ./model
-f bert
--punct
--train data/conll18/ru/train.auto.conllu
--dev data/conll18/ru/dev.gold.conllu
--test data/conll18/ru/test.gold.conllu
--embed data/fasttext/cc.ru.300.vec
--unk ''
--n-embed 300
--bert bert-base-multilingual-cased
Here are some example outputs
2020-08-03 19:20:12 INFO
----------------+--------------------------
Param | Value
----------------+--------------------------
tree | False
proj | False
mode | train
path | ./model
device | 0
seed | 1
threads | 16
batch_size | 5000
feat | bert
build | True
punct | True
max_len | None
buckets | 32
train | data/conll18/ru/train.auto.conllu
dev | data/conll18/ru/dev.gold.conllu
test | data/conll18/ru/test.gold.conllu
embed | data/fasttext/cc.ru.300.vec
unk |
n_embed | 300
bert | bert-base-multilingual-cased
----------------+--------------------------
2020-08-03 19:20:12 INFO Build the fields
2020-08-03 19:20:15 ERROR Using bos_token, but it is not set yet.
2020-08-03 19:33:39 INFO Load the data
2020-08-03 19:34:56 INFO
train: Dataset(n_sentences=48814, n_batches=201, n_buckets=32)
dev: Dataset(n_sentences=6584, n_batches=41, n_buckets=32)
test: Dataset(n_sentences=6491, n_batches=38, n_buckets=32)
2020-08-03 19:34:56 INFO BiaffineDependencyModel(
(word_embed): Embedding(45626, 300)
(feat_embed): BertEmbedding(bert-base-multilingual-cased, n_layers=4, n_out=100, pad_index=0)
(embed_dropout): IndependentDropout(p=0.33)
(lstm): BiLSTM(400, 400, num_layers=3, dropout=0.33)
(lstm_dropout): SharedDropout(p=0.33, batch_first=True)
(mlp_arc_d): MLP(n_in=800, n_out=500, dropout=0.33)
(mlp_arc_h): MLP(n_in=800, n_out=500, dropout=0.33)
(mlp_rel_d): MLP(n_in=800, n_out=100, dropout=0.33)
(mlp_rel_h): MLP(n_in=800, n_out=100, dropout=0.33)
(arc_attn): Biaffine(n_in=500, n_out=1, bias_x=True)
(rel_attn): Biaffine(n_in=100, n_out=41, bias_x=True, bias_y=True)
(criterion): CrossEntropyLoss()
(pretrained): Embedding(1675340, 300)
)
2020-08-03 19:34:56 INFO Epoch 1 / 5000:
100%|####################################| 201/201 01:20<00:00, 2.50it/s, lr: 1.9770e-03 - loss: 2.6031 - UCM: 9.38% LCM: 4.26% UAS: 51.66% LAS: 36.51%
2020-08-03 19:36:24 INFO dev: - loss: 1.4177 - UCM: 18.88% LCM: 10.30% UAS: 75.79% LAS: 67.42%
2020-08-03 19:36:31 INFO test: - loss: 1.2628 - UCM: 17.33% LCM: 9.23% UAS: 76.52% LAS: 68.35%
Note that the first line in your embedding file should be <word> <embedding>
.
Could you please what is your RAM and GPU memory?
I upgraded Colab RAM from 12 to 25 GB and GPU from 11 to 16 GB and now after the ERROR Using bos_token...
message it doen't fail right away, but only after 3-4 minutes.
I wonder what are the recommended memory capacities.
RAM: 263GB GPU: 32GB
I see. Then, low RAM is the problem. Have you by any chance tried to run the model with a much lower RAM? Maybe there are some parameter combinations that would work?
A way to get around this is to shrink word embeddings and keep only words in train/dev/test set.
@yzhangcs Thank you for the idea!
It works now with filtered embeddings, if the first line with
Thank you for your help!
Hi,
I am trying to train a biaffine dependency parser on UD_Russian-SynTagRus corpus. For some reason, training script fails without any warnings or errors. Could you please help me on what could go wrong? I'm trying to run training in Google Colab.
Here's the script I'm using:
The output is:
The training fails with no errors, so it's hard to see what exactly is wrong.
Plus, when trying to use Bert embeddings (bert-base-multilingual-cased), there was an error that the
bos_token
was not set.Same failure happens when running
supar.cmds.crf_dependency
.