wietsedv / bertje

BERTje is a Dutch pre-trained BERT model developed at the University of Groningen. (EMNLP Findings 2020) "What’s so special about BERT’s layers? A closer look at the NLP pipeline in monolingual and multilingual models"
https://aclanthology.org/2020.findings-emnlp.389/
Apache License 2.0
133 stars 10 forks source link

help :) #19

Closed jwijffels closed 3 years ago

jwijffels commented 3 years ago

Ik weet niet of je wat open staat om hulp te verlenen. Ik krijg volgende error in een named entity recognition finetuning task die ik laat lopen op Google Colab. Dit is mijn config:

data:
  name: "getuigenissen-ner"
  input: "/content/getuigenissen"
  num_labels: 25

model:
  shortname: "bertje"
  name: "wietsedv/bert-base-dutch-cased"
  type: "bert"

train:
  max_epochs: 200

En deze error krijg ik bij het starten van het finetunen: python main.py data/getuigenissen-ner

/content/bertje/finetuning/v2
2020-12-10 13:47:09.458303: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
importing config from "configs/default.yaml"
importing config from "configs/data/getuigenissen-ner.yaml"
data:
  cache: cache/{}-{}
  cfgs: [data/udlassy-pos, data/lassysmall-pos, data/conll2002-ner, data/sonar-ner,
    data/udlassy-ner, data/110kdbrd, data/110kdbrd-2, data/twisty, data/twisty2, data/twisty3,
    data/twisty-merge-4, data/twisty4-merge-4]
  clip_start: false
  dev: true
  input: /content/getuigenissen
  logs: logs/{}-{}
  merge: null
  name: getuigenissen-ner
  num_labels: 25
  num_sents: 1
  output: output/{}-{}
  token_level: true
  verify: false
eval: {batch_size: 64}
force: false
model:
  cfgs: [models/bertje, models/multi, models/bertnl, models/robbert]
  checkpoint: -1
  device: cuda
  do_export: true
  do_train: true
  lower_case: false
  name: wietsedv/bert-base-dutch-cased
  shortname: bertje
  type: bert
optimizer: {adam_epsilon: 1.0e-08, learning_rate: 5.0e-05, max_grad_norm: 1.0, warmup_steps: 512,
  weight_decay: 0.05}
summary: {groups: false, method: accuracy, probs: false, type: dev}
train: {attention_dropout: 0.2, batch_size: 6, eval_steps: 0.25, gradient_accumulation_steps: 4,
  hidden_dropout: 0.3, logging_steps: 0.1, max_epochs: 200, max_grad_norm: 1.0, seed: 42323}
verbose: true
Loading tokenizer "wietsedv/bert-base-dutch-cased"
Downloading: 100% 241k/241k [00:00<00:00, 19.8MB/s]
 ➤ Loading data from train.tsv
   dataset has 26 labels
  0% 0/1402 [00:00<?, ?it/s]Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
100% 1402/1402 [00:05<00:00, 260.09it/s]
 ➤ Cached data in cache/getuigenissen-ner-bertje/train.tsv.pkl
Train data: 1402 examples, 26 labels: ['O', 'b-activiteit', 'b-bedrag', 'b-beroep', 'b-beschrijving', 'b-citaat', 'b-emotie', 'b-geo', 'b-leeftijd', 'b-misdrijf', 'b-object', 'b-persoon', 'b-tijd', 'i-activiteit', 'i-bedrag', 'i-beroep', 'i-beschrijving', 'i-citaat', 'i-emotie', 'i-geo', 'i-leeftijd', 'i-misdrijf', 'i-object', 'i-persoon', 'i-tijd', 'o']
 ➤ Loading data from dev.tsv
   dataset has 25 labels
100% 589/589 [00:02<00:00, 245.37it/s]
 ➤ Cached data in cache/getuigenissen-ner-bertje/dev.tsv.pkl
Dev data: 589 examples
Loading model "wietsedv/bert-base-dutch-cased"
Downloading: 100% 433/433 [00:00<00:00, 618kB/s]
Downloading: 100% 439M/439M [00:04<00:00, 88.5MB/s]
Some weights of the model checkpoint at wietsedv/bert-base-dutch-cased were not used when initializing BertForTokenClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForTokenClassification were not initialized from the model checkpoint at wietsedv/bert-base-dutch-cased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Start training
Global step intervals: Logging=5 Eval=14
Starting at epoch 0
 > Start epoch 0/200
Batch:   0% 0/234 [00:00<?, ?it/s]THCudaCheck FAIL file=/pytorch/aten/src/THC/generic/THCTensorMath.cu line=29 error=710 : device-side assert triggered
Batch:   0% 0/234 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "main.py", line 387, in <module>
    main()
  File "main.py", line 364, in main
    train(model, train_dataset, dev_dataset, state)
  File "main.py", line 165, in train
    loss.backward()
  File "/usr/local/lib/python3.6/dist-packages/torch/tensor.py", line 221, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/usr/local/lib/python3.6/dist-packages/torch/autograd/__init__.py", line 132, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: cuda runtime error (710) : device-side assert triggered at /pytorch/aten/src/THC/generic/THCTensorMath.cu:29
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [18,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [19,0,0] Assertion `t >= 0 && t < n_classes` failed.
RobBERT
wietsedv commented 3 years ago

Of course I'm willing to help. You should set num_labels to 26 instead of 25. If you look at the output, you can see that there are 26 labels:

 ➤ Loading data from train.tsv
   dataset has 26 labels
 ➤ Cached data in cache/getuigenissen-ner-bertje/train.tsv.pkl
Train data: 1402 examples, 26 labels: ['O', 'b-activiteit', 'b-bedrag', 'b-beroep', 'b-beschrijving', 'b-citaat', 'b-emotie', 'b-geo', 'b-leeftijd', 'b-misdrijf', 'b-object', 'b-persoon', 'b-tijd', 'i-activiteit', 'i-bedrag', 'i-beroep', 'i-beschrijving', 'i-citaat', 'i-emotie', 'i-geo', 'i-leeftijd', 'i-misdrijf', 'i-object', 'i-persoon', 'i-tijd', 'o']`

EDIT: Actually, you should keep the number of labels at 25 and upper/lower (do not remember which one) your o label. You now have an upper and lower case variant. One of them is automatically added, so you should match the one that is automatically added.

jwijffels commented 3 years ago

Ha, daarvan. Ik snapte al niet waarom dit niet matchte met het aantal categorieën in mijn train.tsv bestand. Ik probeer even met uppercasing.

jwijffels commented 3 years ago

Thanks for the help. I've uppercased all the categories and now it started training. https://colab.research.google.com/drive/16zr_LJOfVqPquGV8Idk1y1XhjyFDewJV?usp=sharing Feel free to provide feedback if this training run seems fine on Google Colab. I'll close the issue already. Thanks again for your time so far.