Cannot train using spanbert

1nefootstep commented 3 years ago

Hello, I tried to train using spanbert-base with this configuration

export OUTPUT_DIR=model_output
export CACHE_DIR=cache
export DATA_DIR=data

python run_coref.py \
        --output_dir=$OUTPUT_DIR \
        --cache_dir=$CACHE_DIR \
        --model_type=SpanBERT \
        --model_name_or_path=SpanBERT/spanbert-base-cased \
        --tokenizer_name=bert-base-cased \
        --config_name=SpanBERT/spanbert-base-cased  \
        --train_file=$DATA_DIR/train.english.jsonlines \
        --predict_file=$DATA_DIR/dev.english.jsonlines \
        --do_train \
        --do_eval \
        --num_train_epochs=129 \
        --logging_steps=500 \
        --save_steps=3000 \
        --eval_steps=1000 \
        --max_seq_length=384 \
        --train_file_cache=$DATA_DIR/train.english.384.pkl \
        --predict_file_cache=$DATA_DIR/dev.english.384.pkl \
        --gradient_accumulation_steps=1 \
        --normalise_loss \
        --max_total_seq_len=400 \
        --experiment_name="s2e-model" \
        --warmup_steps=5600 \
        --adam_epsilon=1e-6 \
        --amp \   
        --head_learning_rate=3e-4 \
        --learning_rate=1e-5 \
        --adam_beta2=0.98 \
        --weight_decay=0.01 \
        --dropout_prob=0.3 \
        --save_if_best \
        --top_lambda=0.4  \
        --tensorboard_dir=$OUTPUT_DIR/tb \
        --overwrite_output_dir \
        --conll_path_for_eval=$DATA_DIR/dev.english.v4_gold_conll

and got this error: AttributeError: 'BertConfig' object has no attribute 'attention_window'

Then I changed the model instantiated in modeling.py from self.longformer = LongformerModel(config) to self.longformer = BertModel(config)

and got this error: RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling cublasCreate(handle)

Any help would be greatly appreciated!

yuvalkirstain commented 3 years ago

Hey, perhaps try to also rename self.longformer to self.bert and try to debug exactly where it crashes (hopefully it won't). If that won't do the trick, print here the entire error traceback and I'll try to help. In general, we don't recommend using our code with SpanBERT due to its limited sequence length.

1nefootstep commented 3 years ago

Hello, thanks for replying! Unfortunately, it still ran into problems.

Traceback (most recent call last):
  File "run_coref.py", line 155, in <module>
    main()
  File "run_coref.py", line 122, in main
    global_step, tr_loss = train(args, train_dataset, model, tokenizer, evaluator)
  File "/home/l/user/github/s2e-coref/training.py", line 138, in train
    return_all_outputs=False)
  File "/home/l/user/miniconda3/envs/s2e/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/l/user/github/s2e-coref/modeling.py", line 204, in forward
    outputs = self.bert(input_ids, attention_mask=attention_mask)
  File "/home/l/user/miniconda3/envs/s2e/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/l/user/miniconda3/envs/s2e/lib/python3.6/site-packages/transformers/modeling_bert.py", line 841, in forward
    return_dict=return_dict,
  File "/home/l/user/miniconda3/envs/s2e/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/l/user/miniconda3/envs/s2e/lib/python3.6/site-packages/transformers/modeling_bert.py", line 482, in forward
    output_attentions,
  File "/home/l/user/miniconda3/envs/s2e/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/l/user/miniconda3/envs/s2e/lib/python3.6/site-packages/transformers/modeling_bert.py", line 402, in forward
    output_attentions=output_attentions,
  File "/home/l/user/miniconda3/envs/s2e/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/l/user/miniconda3/envs/s2e/lib/python3.6/site-packages/transformers/modeling_bert.py", line 339, in forward
    output_attentions,
  File "/home/l/user/miniconda3/envs/s2e/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/l/user/miniconda3/envs/s2e/lib/python3.6/site-packages/transformers/modeling_bert.py", line 240, in forward
    mixed_query_layer = self.query(hidden_states)
  File "/home/l/user/miniconda3/envs/s2e/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/l/user/miniconda3/envs/s2e/lib/python3.6/site-packages/torch/nn/modules/linear.py", line 87, in forward
    return F.linear(input, self.weight, self.bias)
  File "/home/l/user/miniconda3/envs/s2e/lib/python3.6/site-packages/apex/amp/wrap.py", line 28, in wrapper
    return orig_fn(*new_args, **kwargs)
  File "/home/l/user/miniconda3/envs/s2e/lib/python3.6/site-packages/torch/nn/functional.py", line 1372, in linear
    output = input.matmul(weight.t())
  File "/home/l/user/miniconda3/envs/s2e/lib/python3.6/site-packages/apex/amp/wrap.py", line 28, in wrapper
    return orig_fn(*new_args, **kwargs)
RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling `cublasCreate(handle)`

When running with --no_cuda:

Traceback (most recent call last):
  File "run_coref.py", line 155, in <module>
    main()
  File "run_coref.py", line 122, in main
    global_step, tr_loss = train(args, train_dataset, model, tokenizer, evaluator)
  File "/home/l/user/github/s2e-coref/training.py", line 138, in train
    return_all_outputs=False)
  File "/home/l/user/miniconda3/envs/s2e/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/l/user/github/s2e-coref/modeling.py", line 204, in forward
    outputs = self.bert(input_ids, attention_mask=attention_mask)
  File "/home/l/user/miniconda3/envs/s2e/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/l/user/miniconda3/envs/s2e/lib/python3.6/site-packages/transformers/modeling_bert.py", line 831, in forward
    input_ids=input_ids, position_ids=position_ids, token_type_ids=token_type_ids, inputs_embeds=inputs_embeds
  File "/home/l/user/miniconda3/envs/s2e/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/l/user/miniconda3/envs/s2e/lib/python3.6/site-packages/transformers/modeling_bert.py", line 197, in forward
    inputs_embeds = self.word_embeddings(input_ids)
  File "/home/l/user/miniconda3/envs/s2e/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/l/user/miniconda3/envs/s2e/lib/python3.6/site-packages/torch/nn/modules/sparse.py", line 114, in forward
    self.norm_type, self.scale_grad_by_freq, self.sparse)
  File "/home/l/user/miniconda3/envs/s2e/lib/python3.6/site-packages/torch/nn/functional.py", line 1484, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: index out of range: Tried to access index 49518 out of table with 28995 rows. at /pytorch/aten/src/TH/generic/THTensorEvenMoreMath.cpp:418

yuvalkirstain commented 3 years ago

Hey, this issue does not seem to be caused by our algorithm. Also, note that the code is pretty coupled with Longformer (if you would like to make it more generic it can be cool :) ). So, unfortunately, I don't think that I will be able to help here.

CharlesLao commented 3 years ago

        for idx, (word, speaker) in enumerate(zip(words, speakers)):
            if last_speaker != speaker:
                speaker_prefix = [SPEAKER_START] + self.tokenizer.encode(" " + speaker,
                                                                         add_special_tokens=False) + [SPEAKER_END]
                last_speaker = speaker

Here,in data.py,the index of SPEAKER_START and SPEAKER_END do not in your vocab.txt,the index out of range,just make the value of SPEAKER_START and SPEAKER_END smaller @1nefootstep

yuvalkirstain / s2e-coref

Cannot train using spanbert #2