Closed nihaowxh closed 1 year ago
Hi,
Due to the pretraining with the KG and MLM synchronously and the difference in the number of samples between the above datasets, it is hard to control the consistency during training with num_epochs
.
About the oscillation of MLM loss, we think there are three reasons. First, the loss printed is only from one GPU (main process). Second, due to the difficulty of MLM task, it is difficult to converge on an optimal solution. Third, our pretraining contains the KG module, and it may cause a slight effect on the capacity of language modeling. In addition, from the corpus, our protein corpus is different from the original protein corpus. Thus, we think that the seemingly oscillating values are reasonable.
OK, I get it. Thanks for your reply.
Hello, researchers Thanks for your research. I have some problems with the pre-training phases.
{'mlm': 1.3232421875, 'protein_go_ke': 0.66796875, 'go_go_ke': 1.9619140625, 'global_step': 180, 'learning_rate': [9.965326633165831e-06, 9.965326633165831e-06, 1.9930653266331662e-05]} {'mlm': 0.77783203125, 'protein_go_ke': 0.66650390625, 'go_go_ke': 1.857421875, 'global_step': 181, 'learning_rate': [9.964824120603016e-06, 9.964824120603016e-06, 1.9929648241206033e-05]} {'mlm': 0.7373046875, 'protein_go_ke': 0.64111328125, 'go_go_ke': 1.984375, 'global_step': 182, 'learning_rate': [9.964321608040202e-06, 9.964321608040202e-06, 1.9928643216080404e-05]} {'mlm': 0.447509765625, 'protein_go_ke': 2.140625, 'go_go_ke': 2.029296875, 'global_step': 183, 'learning_rate': [9.963819095477387e-06, 9.963819095477387e-06, 1.9927638190954775e-05]} {'mlm': 1.3056640625, 'protein_go_ke': 0.64990234375, 'go_go_ke': 1.91015625, 'global_step': 184, 'learning_rate': [9.963316582914575e-06, 9.963316582914575e-06, 1.992663316582915e-05]} {'mlm': 2.1015625, 'protein_go_ke': 0.6806640625, 'go_go_ke': 1.8505859375, 'global_step': 185, 'learning_rate': [9.96281407035176e-06, 9.96281407035176e-06, 1.992562814070352e-05]} {'mlm': 1.146484375, 'protein_go_ke': 0.6494140625, 'go_go_ke': 1.9150390625, 'global_step': 186, 'learning_rate': [9.962311557788946e-06, 9.962311557788946e-06, 1.992462311557789e-05]} {'mlm': 1.3505859375, 'protein_go_ke': 0.666015625, 'go_go_ke': 1.8994140625, 'global_step': 187, 'learning_rate': [9.961809045226131e-06, 9.961809045226131e-06, 1.9923618090452263e-05]} {'mlm': 1.359375, 'protein_go_ke': 2.775390625, 'go_go_ke': 1.8330078125, 'global_step': 188, 'learning_rate': [9.961306532663317e-06, 9.961306532663317e-06, 1.9922613065326634e-05]} {'mlm': 1.0927734375, 'protein_go_ke': 0.65087890625, 'go_go_ke': 1.8271484375, 'global_step': 189, 'learning_rate': [9.960804020100502e-06, 9.960804020100502e-06, 1.9921608040201005e-05]} [2022-10-13 09:32:21,562] [INFO] [logging.py:68:log_dist] [Rank 0] step=190, skipped=11, lr=[9.96030150753769e-06, 9.96030150753769e-06, 1.992060301507538e-05], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] [2022-10-13 09:32:21,992] [INFO] [timer.py:157:stop] 0/190, SamplesPerSec=4.138257351966589 {'mlm': 1.7353515625, 'protein_go_ke': 0.6669921875, 'go_go_ke': 1.7998046875, 'global_step': 190, 'learning_rate': [9.96030150753769e-06, 9.96030150753769e-06, 1.992060301507538e-05]} {'mlm': 1.2763671875, 'protein_go_ke': 0.6787109375, 'go_go_ke': 1.8154296875, 'global_step': 191, 'learning_rate': [9.959798994974875e-06, 9.959798994974875e-06, 1.991959798994975e-05]} {'mlm': 0.80712890625, 'protein_go_ke': 0.6708984375, 'go_go_ke': 1.876953125, 'global_step': 192, 'learning_rate': [9.95929648241206e-06, 9.95929648241206e-06, 1.991859296482412e-05]} {'mlm': 0.59716796875, 'protein_go_ke': 0.6787109375, 'go_go_ke': 1.7919921875, 'global_step': 193, 'learning_rate': [9.958793969849248e-06, 9.958793969849248e-06, 1.9917587939698496e-05]} {'mlm': 0.7734375, 'protein_go_ke': 0.6611328125, 'go_go_ke': 1.90625, 'global_step': 194, 'learning_rate': [9.958291457286433e-06, 9.958291457286433e-06, 1.9916582914572867e-05]} {'mlm': 0.77587890625, 'protein_go_ke': 0.6865234375, 'go_go_ke': 1.76171875, 'global_step': 195, 'learning_rate': [9.957788944723619e-06, 9.957788944723619e-06, 1.9915577889447238e-05]} {'mlm': 0.89404296875, 'protein_go_ke': 0.6533203125, 'go_go_ke': 1.91015625, 'global_step': 196, 'learning_rate': [9.957286432160806e-06, 9.957286432160806e-06, 1.9914572864321612e-05]} {'mlm': 1.1416015625, 'protein_go_ke': 0.654296875, 'go_go_ke': 1.78125, 'global_step': 197, 'learning_rate': [9.956783919597992e-06, 9.956783919597992e-06, 1.9913567839195983e-05]} {'mlm': 1.0224609375, 'protein_go_ke': 0.66162109375, 'go_go_ke': 1.7841796875, 'global_step': 198, 'learning_rate': [9.956281407035177e-06, 9.956281407035177e-06, 1.9912562814070354e-05]} {'mlm': 0.56005859375, 'protein_go_ke': 0.65966796875, 'go_go_ke': 1.806640625, 'global_step': 199, 'learning_rate': [9.955778894472363e-06, 9.955778894472363e-06, 1.9911557788944725e-05]} [2022-10-13 09:33:02,238] [INFO] [logging.py:68:log_dist] [Rank 0] step=200, skipped=11, lr=[9.955276381909548e-06, 9.955276381909548e-06, 1.9910552763819096e-05], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] [2022-10-13 09:33:02,671] [INFO] [timer.py:157:stop] 0/200, SamplesPerSec=4.141814535981229
Best regards, Xinghao