zjunlp / OntoProtein

[ICLR 2022] OntoProtein: Protein Pretraining With Gene Ontology Embedding
MIT License
138 stars 22 forks source link

Backward Error during Pre-training #2

Closed chang-github-00 closed 2 years ago

chang-github-00 commented 2 years ago

Dear authors: Thank you for your inspiring work ! When we try to follow your work and run the following scripts for pre-training

python -m torch.distributed.launch --nproc_per_node=4 run_pretrain.py \
  --output_dir 'output' \
  --do_train \
  --in_memory $IN_MEMORY \
  --max_protein_seq_length $MAX_PROTEIN_SEQ_LENGTH \
  --model_protein_seq_data true \
  --model_protein_go_data true \
  --model_go_go_data true \
  --use_desc $USE_DESC \
  --max_text_seq_length $MAX_TEXT_SEQ_LENGTH \
  --dataloader_protein_go_num_workers $PROTEIN_GO_NUM_WORKERS \
  --dataloader_go_go_num_workers $GO_GO_NUM_WORKERS \
  --dataloader_protein_seq_num_workers $PROTEIN_SEQ_NUM_WORKERS \
  --num_protein_go_neg_sample $NUM_PROTEIN_GO_NEG_SAMPLE \
  --num_go_go_neg_sample $NUM_GO_GO_NEG_SAMPLE \
  --negative_sampling_fn $NEGTIVE_SAMPLING_FN \
  --protein_go_sample_head $PROTEIN_GO_SAMPLE_HEAD \
  --protein_go_sample_tail $PROTEIN_GO_SAMPLE_TAIL \
  --go_go_sample_head $GO_GO_SAMPLE_HEAD \
  --go_go_sample_tail $GO_GO_SAMPLE_TAIL \
  --protein_model_file_name $PROTEIN_MODEL_PATH \
  --text_model_file_name $TEXT_MODEL_PATH \
  --go_encoder_cls $GO_ENCODER_CLS \
  --protein_encoder_cls $PROTEIN_ENCODER_CLS \
  --ke_embedding_size $KE_EMBEDDING_SIZE \
  --double_entity_embedding_size $DOUBLE_ENTITY_EMBEDDING_SIZE \
  --max_steps $MAX_STEPS \
  --per_device_train_batch_size $BATCH_SIZE \
  --weight_decay $WEIGHT_DECAY \
  --optimize_memory $OPTIMIZE_MEMORY \
  --gradient_accumulation_steps $ACCUMULATION_STEPS \
  --lr_scheduler_type $SCHEDULER_TYPE \
  --mlm_lambda $MLM_LAMBDA \
  --lm_learning_rate $MLM_LEARNING_RATE \
  --lm_warmup_ratio $LM_WARMUP_RATIO \
  --ke_warmup_ratio $KE_WARMUP_RATIO \
  --ke_lambda $KE_LAMBDA \
  --ke_learning_rate $KE_LEARNING_RATE \
  --ke_max_score $KE_MAX_SCORE \
  --ke_score_fn $KE_SCORE_FN \
  --ke_warmup_ratio $KE_WARMUP_RATIO \
  --seed 2021 \
  --fp16 \
  --dataloader_pin_memory \

we get error :

RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parameter outside the `forward` function. Please make sure model parameters are not shared across multiple concurrent forward-backward passes. or try to use _set_static_graph() as a workaround if this module graph does not change during training loop.2) Reused parameters in multiple reentrant backward passes. For example, if you use multiple `checkpoint` functions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple times. DDP does not support such use cases in default. You can try to use _set_static_graph() as a workaround if your module graph does not change over iterations.
Parameter at index 486 with name protein_lm.bert.encoder.layer.29.output.LayerNorm.weight has been marked as ready twice. This means that multiple autograd engine  hooks have fired for this particular parameter during this iteration.RuntimeError

It seems that this error only appears when performing DDP training, besides we have tried deep speed but an error occurs during loading optimizer at line 168 in trainer.py.

Wonder how I can solve this. Thank you very much!

Alexzhuan commented 2 years ago

Hi, thanks for your interest in our work!

Our training framework is based on deepspeed, it may exist some errors when directly running the script for pertaining using torch.distributed.launch.

Could you show the detailed error information about deepspeed?

We will check and release the new version of our code. Thank you!