Trainer problem for pre-training

nihaowxh commented 1 year ago

Hello, researchers Thanks for your research. I have some problems with the pre-training phases.

did the pre-training code only support the training manner with epochs? Because when I replace the parameter "--max_steps" with "--num_train_epochs," I get an exception. Therefore, I am not sure whether trainer.py support the training manner with epochs.
if Q1 is not, I got the following results for steps, therefore, is the loss for each step meaningful? and Another question, Why does the loss of "mlm" keep oscillating after certain steps? Could you give me any advice about this situation? {'mlm': 1.3232421875, 'protein_go_ke': 0.66796875, 'go_go_ke': 1.9619140625, 'global_step': 180, 'learning_rate': [9.965326633165831e-06, 9.965326633165831e-06, 1.9930653266331662e-05]} {'mlm': 0.77783203125, 'protein_go_ke': 0.66650390625, 'go_go_ke': 1.857421875, 'global_step': 181, 'learning_rate': [9.964824120603016e-06, 9.964824120603016e-06, 1.9929648241206033e-05]} {'mlm': 0.7373046875, 'protein_go_ke': 0.64111328125, 'go_go_ke': 1.984375, 'global_step': 182, 'learning_rate': [9.964321608040202e-06, 9.964321608040202e-06, 1.9928643216080404e-05]} {'mlm': 0.447509765625, 'protein_go_ke': 2.140625, 'go_go_ke': 2.029296875, 'global_step': 183, 'learning_rate': [9.963819095477387e-06, 9.963819095477387e-06, 1.9927638190954775e-05]} {'mlm': 1.3056640625, 'protein_go_ke': 0.64990234375, 'go_go_ke': 1.91015625, 'global_step': 184, 'learning_rate': [9.963316582914575e-06, 9.963316582914575e-06, 1.992663316582915e-05]} {'mlm': 2.1015625, 'protein_go_ke': 0.6806640625, 'go_go_ke': 1.8505859375, 'global_step': 185, 'learning_rate': [9.96281407035176e-06, 9.96281407035176e-06, 1.992562814070352e-05]} {'mlm': 1.146484375, 'protein_go_ke': 0.6494140625, 'go_go_ke': 1.9150390625, 'global_step': 186, 'learning_rate': [9.962311557788946e-06, 9.962311557788946e-06, 1.992462311557789e-05]} {'mlm': 1.3505859375, 'protein_go_ke': 0.666015625, 'go_go_ke': 1.8994140625, 'global_step': 187, 'learning_rate': [9.961809045226131e-06, 9.961809045226131e-06, 1.9923618090452263e-05]} {'mlm': 1.359375, 'protein_go_ke': 2.775390625, 'go_go_ke': 1.8330078125, 'global_step': 188, 'learning_rate': [9.961306532663317e-06, 9.961306532663317e-06, 1.9922613065326634e-05]} {'mlm': 1.0927734375, 'protein_go_ke': 0.65087890625, 'go_go_ke': 1.8271484375, 'global_step': 189, 'learning_rate': [9.960804020100502e-06, 9.960804020100502e-06, 1.9921608040201005e-05]} [2022-10-13 09:32:21,562] [INFO] [logging.py:68:log_dist] [Rank 0] step=190, skipped=11, lr=[9.96030150753769e-06, 9.96030150753769e-06, 1.992060301507538e-05], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] [2022-10-13 09:32:21,992] [INFO] [timer.py:157:stop] 0/190, SamplesPerSec=4.138257351966589 {'mlm': 1.7353515625, 'protein_go_ke': 0.6669921875, 'go_go_ke': 1.7998046875, 'global_step': 190, 'learning_rate': [9.96030150753769e-06, 9.96030150753769e-06, 1.992060301507538e-05]} {'mlm': 1.2763671875, 'protein_go_ke': 0.6787109375, 'go_go_ke': 1.8154296875, 'global_step': 191, 'learning_rate': [9.959798994974875e-06, 9.959798994974875e-06, 1.991959798994975e-05]} {'mlm': 0.80712890625, 'protein_go_ke': 0.6708984375, 'go_go_ke': 1.876953125, 'global_step': 192, 'learning_rate': [9.95929648241206e-06, 9.95929648241206e-06, 1.991859296482412e-05]} {'mlm': 0.59716796875, 'protein_go_ke': 0.6787109375, 'go_go_ke': 1.7919921875, 'global_step': 193, 'learning_rate': [9.958793969849248e-06, 9.958793969849248e-06, 1.9917587939698496e-05]} {'mlm': 0.7734375, 'protein_go_ke': 0.6611328125, 'go_go_ke': 1.90625, 'global_step': 194, 'learning_rate': [9.958291457286433e-06, 9.958291457286433e-06, 1.9916582914572867e-05]} {'mlm': 0.77587890625, 'protein_go_ke': 0.6865234375, 'go_go_ke': 1.76171875, 'global_step': 195, 'learning_rate': [9.957788944723619e-06, 9.957788944723619e-06, 1.9915577889447238e-05]} {'mlm': 0.89404296875, 'protein_go_ke': 0.6533203125, 'go_go_ke': 1.91015625, 'global_step': 196, 'learning_rate': [9.957286432160806e-06, 9.957286432160806e-06, 1.9914572864321612e-05]} {'mlm': 1.1416015625, 'protein_go_ke': 0.654296875, 'go_go_ke': 1.78125, 'global_step': 197, 'learning_rate': [9.956783919597992e-06, 9.956783919597992e-06, 1.9913567839195983e-05]} {'mlm': 1.0224609375, 'protein_go_ke': 0.66162109375, 'go_go_ke': 1.7841796875, 'global_step': 198, 'learning_rate': [9.956281407035177e-06, 9.956281407035177e-06, 1.9912562814070354e-05]} {'mlm': 0.56005859375, 'protein_go_ke': 0.65966796875, 'go_go_ke': 1.806640625, 'global_step': 199, 'learning_rate': [9.955778894472363e-06, 9.955778894472363e-06, 1.9911557788944725e-05]} [2022-10-13 09:33:02,238] [INFO] [logging.py:68:log_dist] [Rank 0] step=200, skipped=11, lr=[9.955276381909548e-06, 9.955276381909548e-06, 1.9910552763819096e-05], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] [2022-10-13 09:33:02,671] [INFO] [timer.py:157:stop] 0/200, SamplesPerSec=4.141814535981229

Best regards, Xinghao

Alexzhuan commented 1 year ago

Hi,

Due to the pretraining with the KG and MLM synchronously and the difference in the number of samples between the above datasets, it is hard to control the consistency during training with num_epochs .
About the oscillation of MLM loss, we think there are three reasons. First, the loss printed is only from one GPU (main process). Second, due to the difficulty of MLM task, it is difficult to converge on an optimal solution. Third, our pretraining contains the KG module, and it may cause a slight effect on the capacity of language modeling. In addition, from the corpus, our protein corpus is different from the original protein corpus. Thus, we think that the seemingly oscillating values are reasonable.

nihaowxh commented 1 year ago

OK, I get it. Thanks for your reply.

zjunlp / OntoProtein

Trainer problem for pre-training #20