Closed shon-otmazgin closed 3 years ago
If you decrease the max_total_seq_length
it should work fine with smaller GPUs, note that long examples will be excluded.
Yes, you should be able to change the eval batch size to speed up inference.
I decreased max_total_seq_length
from 5000 to 512 and is still crashes with no memory to allocate. It might be a combination of max_total_seq_length
and longformer
type large/base
About the eval batch. that's something configurable?
if not, I tried to change:
eval_dataloader = BucketBatchSampler(eval_dataset, max_total_seq_len=self.args.max_total_seq_len, batch_size_1=True)
to
eval_dataloader = BucketBatchSampler(eval_dataset, max_total_seq_len=self.args.max_total_seq_len, batch_size_1=False)
but now the iterator over eval_dataloader
return different tuple size.
I just ran the code using this command:
export OUTPUT_DIR=outputs
export CACHE_DIR=.cache
export MODEL_DIR=model
export DATA_DIR=data
export SPLIT_FOR_EVAL=dev
python run_coref.py \
--output_dir=$OUTPUT_DIR \
--cache_dir=$CACHE_DIR \
--model_type=longformer \
--model_name_or_path=$MODEL_DIR \
--tokenizer_name=allenai/longformer-large-4096 \
--config_name=allenai/longformer-large-4096 \
--train_file=$DATA_DIR/train.english.jsonlines \
--predict_file=$DATA_DIR/dev.english.jsonlines \
--do_eval \
--num_train_epochs=129 \
--logging_steps=500 \
--save_steps=3000 \
--eval_steps=1000 \
--max_seq_length=1024 \
--train_file_cache=$DATA_DIR/train.english.4096.pkl \
--predict_file_cache=$DATA_DIR/dev.english.4096.pkl \
--amp \
--normalise_loss \
--max_total_seq_len=1024 \
--experiment_name=eval_model \
--warmup_steps=5600 \
--adam_epsilon=1e-6 \
--head_learning_rate=3e-4 \
--learning_rate=1e-5 \
--adam_beta2=0.98 \
--weight_decay=0.01 \
--dropout_prob=0.3 \
--save_if_best \
--top_lambda=0.4 \
--tensorboard_dir=$OUTPUT_DIR/tb \
--conll_path_for_eval=$DATA_DIR/$SPLIT_FOR_EVAL.english.v4_gold_conll
and the maximum GPU memory utilization was about 3GB, and everything was ok. Make sure to change the max_seq_length
(maximum per example size) and max_total_seq_len
(maximum total batch size) fields, clear the cache_dir
, and use apex fp16 (the script should indicate that amp training: True
). Can you please try again?
Regarding the different tuple sizes, it seems that unlike training (which supports batching), during evaluation the current script does not support batching. Given that there are only ~300 examples in the dev set it did not bother us (the evaluation stage of the run whose parameters are shared above takes about 30 seconds).
@yuvalkirstain Yes that's works for me. but above script does only eval. i asked about training.
did you try adding the --do_train
flag?
Of course. I getting out of memory while training
I was able to train with max_total_seq_len=128
and max_seq_length=128
value.` I will test higher.
@yuvalkirstain This kind of problem to publish the colab since i cant publish the data.
BTW the colab follows your exact README instructions.
@shon-otmazgin Hi, were you able to train a decent model? If you did, did you notice a really unstable loss or vanishing gradients?
Hello, I am trying to train the model on GPU with 16GB memory, I getting out of memory error thought training. I tried to change
max_total_seq_length
andmax_total_seq_len
to smaller values than the README file but it still crashes, It is possible to train the model with smaller GPU memory than 32GB as reported in the paper?second question is on the evaluation part. I saw that the batch size for eval is 1. why? it is configurable? we can increase it to speed up the inference ?
Thanks.