yuvalkirstain / s2e-coref

MIT License
45 stars 15 forks source link

Training and Evaluation with smaller GPU #1

Closed shon-otmazgin closed 3 years ago

shon-otmazgin commented 3 years ago

Hello, I am trying to train the model on GPU with 16GB memory, I getting out of memory error thought training. I tried to change max_total_seq_length and max_total_seq_len to smaller values than the README file but it still crashes, It is possible to train the model with smaller GPU memory than 32GB as reported in the paper?

second question is on the evaluation part. I saw that the batch size for eval is 1. why? it is configurable? we can increase it to speed up the inference ?

Thanks.

yuvalkirstain commented 3 years ago

If you decrease the max_total_seq_length it should work fine with smaller GPUs, note that long examples will be excluded.

Yes, you should be able to change the eval batch size to speed up inference.

shon-otmazgin commented 3 years ago

I decreased max_total_seq_length from 5000 to 512 and is still crashes with no memory to allocate. It might be a combination of max_total_seq_length and longformertype large/base

About the eval batch. that's something configurable? if not, I tried to change: eval_dataloader = BucketBatchSampler(eval_dataset, max_total_seq_len=self.args.max_total_seq_len, batch_size_1=True) to eval_dataloader = BucketBatchSampler(eval_dataset, max_total_seq_len=self.args.max_total_seq_len, batch_size_1=False)

but now the iterator over eval_dataloader return different tuple size.

yuvalkirstain commented 3 years ago

I just ran the code using this command:

export OUTPUT_DIR=outputs
export CACHE_DIR=.cache
export MODEL_DIR=model
export DATA_DIR=data
export SPLIT_FOR_EVAL=dev

python run_coref.py \
        --output_dir=$OUTPUT_DIR \
        --cache_dir=$CACHE_DIR \
        --model_type=longformer \
        --model_name_or_path=$MODEL_DIR \
        --tokenizer_name=allenai/longformer-large-4096 \
        --config_name=allenai/longformer-large-4096  \
        --train_file=$DATA_DIR/train.english.jsonlines \
        --predict_file=$DATA_DIR/dev.english.jsonlines \
        --do_eval \
        --num_train_epochs=129 \
        --logging_steps=500 \
        --save_steps=3000 \
        --eval_steps=1000 \
        --max_seq_length=1024 \
        --train_file_cache=$DATA_DIR/train.english.4096.pkl \
        --predict_file_cache=$DATA_DIR/dev.english.4096.pkl \
        --amp \
        --normalise_loss \
        --max_total_seq_len=1024 \
        --experiment_name=eval_model \
        --warmup_steps=5600 \
        --adam_epsilon=1e-6 \
        --head_learning_rate=3e-4 \
        --learning_rate=1e-5 \
        --adam_beta2=0.98 \
        --weight_decay=0.01 \
        --dropout_prob=0.3 \
        --save_if_best \
        --top_lambda=0.4  \
        --tensorboard_dir=$OUTPUT_DIR/tb \
        --conll_path_for_eval=$DATA_DIR/$SPLIT_FOR_EVAL.english.v4_gold_conll

and the maximum GPU memory utilization was about 3GB, and everything was ok. Make sure to change the max_seq_length (maximum per example size) and max_total_seq_len (maximum total batch size) fields, clear the cache_dir, and use apex fp16 (the script should indicate that amp training: True). Can you please try again?

Regarding the different tuple sizes, it seems that unlike training (which supports batching), during evaluation the current script does not support batching. Given that there are only ~300 examples in the dev set it did not bother us (the evaluation stage of the run whose parameters are shared above takes about 30 seconds).

shon-otmazgin commented 3 years ago

@yuvalkirstain Yes that's works for me. but above script does only eval. i asked about training.

yuvalkirstain commented 3 years ago

did you try adding the --do_train flag?

shon-otmazgin commented 3 years ago

Of course. I getting out of memory while training

shon-otmazgin commented 3 years ago

I was able to train with max_total_seq_len=128 and max_seq_length=128 value.` I will test higher. @yuvalkirstain This kind of problem to publish the colab since i cant publish the data. BTW the colab follows your exact README instructions.

Twim17 commented 9 months ago

@shon-otmazgin Hi, were you able to train a decent model? If you did, did you notice a really unstable loss or vanishing gradients?