Closed lpc-eol closed 1 year ago
Hi Leo, Thanks for trying this out. Perhaps this issue could be related to your TF installation? What TF and CUDA versions are you using? I'm able to successfully train with the above command, which is why I'm thinking that it might be worth checking the details of your setup. Does your loss usually go up before you get NaN? Here's what the initial losses look like for me:
Epoch 1/1000 32/32 [==============================] - 31s 967ms/step - loss: 0.5310 Epoch 2/1000 32/32 [==============================] - 25s 776ms/step - loss: 0.2052 Epoch 3/1000 32/32 [==============================] - 25s 769ms/step - loss: 0.1407 Epoch 4/1000 32/32 [==============================] - 24s 757ms/step - loss: 0.1249 Epoch 5/1000 32/32 [==============================] - 24s 764ms/step - loss: 0.1138 Epoch 6/1000 32/32 [==============================] - 26s 797ms/step - loss: 0.1034 Epoch 7/1000 32/32 [==============================] - 24s 757ms/step - loss: 0.0967 Epoch 8/1000 32/32 [==============================] - 24s 765ms/step - loss: 0.0885 Epoch 9/1000 32/32 [==============================] - 24s 752ms/step - loss: 0.0814 Epoch 10/1000 32/32 [==============================] - 178s 6s/step - loss: 0.0776 - val_loss: 0.0773
I'm also not totally sure what's the best way to debug this. As a sanity check, you could verify your TF installation by training some simple model from one of the TF tutorials. You could also try to make the learning rate very low (like 1e-6 or 1e-7), just to see what happens.
Thank you so much for your reply! I am using TF 2.7.0 and here is the details:
cuda-version 10.2 h4767cc1_2 conda-forge
cudatoolkit 10.2.89 hdec6ad0_12 conda-forge
cudnn 7.6.5.32 h01f27c4_1 conda-forge
tensorflow 2.7.0 cuda102py38h32e99bf_0 conda-forge
tensorflow-addons 0.13.0 pypi_0 pypi
tensorflow-base 2.7.0 cuda102py38h021f141_0 conda-forge
tensorflow-estimator 2.7.0 cuda102py38h4357c17_0 conda-forge
tensorflow-gpu 2.7.0 cuda102py38hf05f184_0 conda-forge
tensorflow-probability 0.11.1 pypi_0 pypi
I tried to set the learning rate to 1e-6. i.e. -lr 1e-6
and -dwd 1e-6
here is the result:
2023-08-17 00:55:33.072818: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:380] Filling up shuffle buffer (this may take a while): 725 of 734
2023-08-17 00:55:38.301424: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:380] Filling up shuffle buffer (this may take a while): 728 of 734
2023-08-17 00:55:48.925498: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:405] Shuffle buffer filled.
2023-08-17 01:02:28.225712: I tensorflow/stream_executor/cuda/cuda_dnn.cc:366] Loaded cuDNN version 7605
32/32 [==============================] - 1884s 723ms/step - loss: 0.6924
Epoch 2/1000
32/32 [==============================] - 23s 738ms/step - loss: 0.6928
Epoch 3/1000
32/32 [==============================] - 24s 751ms/step - loss: 0.6920
Epoch 4/1000
32/32 [==============================] - 24s 742ms/step - loss: 0.6917
Epoch 5/1000
32/32 [==============================] - 24s 740ms/step - loss: 0.6919
Epoch 6/1000
32/32 [==============================] - 24s 744ms/step - loss: 0.6919
Epoch 7/1000
32/32 [==============================] - 24s 740ms/step - loss: 0.6919
Epoch 8/1000
32/32 [==============================] - 24s 744ms/step - loss: 0.6912
Epoch 9/1000
32/32 [==============================] - 24s 749ms/step - loss: 0.6916
Epoch 10/1000
32/32 [==============================] - 369s 12s/step - loss: 0.6903 - val_loss: 0.6903
Epoch 11/1000
32/32 [==============================] - ETA: 0s - loss: 0.69072023-08-17 01:12:38.409781: W tensorflow/python/util/util.cc:368] Sets are not currently considered sequences, but this may change in the fu
ture, so consider avoiding using them.
32/32 [==============================] - 35s 1s/step - loss: 0.6907
Epoch 12/1000
1/32 [..............................] - ETA: 15s - loss: 0.6922<frozen importlib._bootstrap>:219: RuntimeWarning: scipy._lib.messagestream.MessageStream size changed, may indicate binary incompatibility
. Expected 56 from C header, got 64 from PyObject
32/32 [==============================] - 24s 762ms/step - loss: 0.6897
Epoch 13/1000
32/32 [==============================] - 24s 760ms/step - loss: 0.6901
Epoch 14/1000
32/32 [==============================] - 24s 764ms/step - loss: 0.6899
Epoch 15/1000
32/32 [==============================] - 24s 762ms/step - loss: 0.6893
Epoch 16/1000
32/32 [==============================] - 24s 761ms/step - loss: 0.6881
Epoch 17/1000
32/32 [==============================] - 24s 760ms/step - loss: 0.6887
Epoch 18/1000
32/32 [==============================] - 24s 763ms/step - loss: 0.6882
Epoch 19/1000
32/32 [==============================] - 24s 766ms/step - loss: 0.6878
Epoch 20/1000
32/32 [==============================] - 50s 2s/step - loss: 0.6881 - val_loss: 0.6876
Epoch 21/1000
32/32 [==============================] - ETA: 0s - loss: 0.6876INFO:root: *** Got new best validation metric (average_map_tight) of 0.002679356077683897
INFO:root: *** Done saving models and evaluation files.
32/32 [==============================] - 551s 18s/step - loss: 0.6816
Epoch 32/1000
32/32 [==============================] - 25s 790ms/step - loss: 0.6818
Epoch 33/1000
32/32 [==============================] - 25s 779ms/step - loss: 0.6813
Epoch 34/1000
32/32 [==============================] - 25s 773ms/step - loss: 0.6801
Epoch 35/1000
32/32 [==============================] - 25s 784ms/step - loss: 0.6794
Epoch 36/1000
32/32 [==============================] - 25s 778ms/step - loss: 0.6790
Epoch 37/1000
32/32 [==============================] - 25s 785ms/step - loss: 0.6786
Epoch 38/1000
32/32 [==============================] - 24s 770ms/step - loss: 0.6763
Epoch 39/1000
32/32 [==============================] - 24s 768ms/step - loss: 0.6755
Epoch 40/1000
32/32 [==============================] - 51s 2s/step - loss: 0.6735 - val_loss: 0.6683
Epoch 41/1000
32/32 [==============================] - ETA: 0s - loss: nanINFO:root:Validation metric (average_map_tight): 0.002562228553085689
32/32 [==============================] - 811s 26s/step - loss: nan
Epoch 42/1000
32/32 [==============================] - 25s 799ms/step - loss: nan
Epoch 43/1000
32/32 [==============================] - 25s 794ms/step - loss: nan
Epoch 44/1000
32/32 [==============================] - 25s 787ms/step - loss: nan
Epoch 45/1000
32/32 [==============================] - 25s 789ms/step - loss: nan
Epoch 46/1000
32/32 [==============================] - 25s 784ms/step - loss: nan
Epoch 47/1000
32/32 [==============================] - 25s 792ms/step - loss: nan
Epoch 48/1000
32/32 [==============================] - 25s 776ms/step - loss: nan
Epoch 49/1000
32/32 [==============================] - 25s 775ms/step - loss: nan
Epoch 50/1000
32/32 [==============================] - 52s 2s/step - loss: nan - val_loss: nan
Epoch 51/1000
32/32 [==============================] - ETA: 0s - loss: nanINFO:root:Validation metric (average_map_tight): 0.0
32/32 [==============================] - 781s 25s/step - loss: nan
Epoch 52/1000
32/32 [==============================] - 26s 814ms/step - loss: nan
Epoch 53/1000
32/32 [==============================] - 25s 783ms/step - loss: nan
I appreciate you taking the time to provide guidance on the environment settings. I will be sure to pay closer attention to properly configuring the environment going forward. Would you recommend using TensorFlow 2.3.0 instead of 2.7.0 for this project?
Thank you for sharing the details! I just tested this example now with TF 2.7.0 and it works for me. However, my setup was on CUDA 11.2. The table in the following link suggests a newer version of CUDA than what you have might be needed for TF 2.7.0?: https://www.tensorflow.org/install/source#gpu We've tested our code also on 2.3.0, so you could also try it out, since the table indicates it would be compatible with CUDA 10.1 (and hopefully 10.2, which is what you have now?). I guess you might want to run some sanity checks to test your installation before messing with that, though.
Thank you so much for your guidance. I updated the CUDA to 11.2, and the problem is fixed. The loss value looks good!
cuda-version 11.2 hb11dac2_2 conda-forge
cudatoolkit 11.2.2 hc23eb0c_12 conda-forge
cudnn 8.8.0.121 h0800d71_1 conda-forge
2023-08-18 11:14:20.794201: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:380] Filling up shuffle buffer (this may take a while): 729 of 734
2023-08-18 11:14:32.289327: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:405] Shuffle buffer filled.
2023-08-18 11:14:38.187317: I tensorflow/stream_executor/cuda/cuda_dnn.cc:366] Loaded cuDNN version 8800
2023-08-18 11:14:39.116424: I tensorflow/stream_executor/cuda/cuda_blas.cc:1774] TensorFloat-32 will be used for the matrix multiplication. This will only be logged
once.
32/32 [==============================] - 1554s 748ms/step - loss: 0.5153
Epoch 2/1000
32/32 [==============================] - 24s 741ms/step - loss: 0.1879
Epoch 3/1000
32/32 [==============================] - 24s 748ms/step - loss: 0.1343
Epoch 4/1000
32/32 [==============================] - 24s 741ms/step - loss: 0.1221
Epoch 5/1000
32/32 [==============================] - 24s 743ms/step - loss: 0.1153
Epoch 6/1000
32/32 [==============================] - 24s 740ms/step - loss: 0.1042
Epoch 7/1000
32/32 [==============================] - 24s 747ms/step - loss: 0.0982
Epoch 8/1000
32/32 [==============================] - 24s 750ms/step - loss: 0.0920
Epoch 9/1000
32/32 [==============================] - 24s 742ms/step - loss: 0.0835
Epoch 10/1000
32/32 [==============================] - ETA: 0s - loss: 0.0792
32/32 [==============================] - 512s 16s/step - loss: 0.0792 - val_loss: 0.0761
Thank you for the update. Glad it worked out!
Could you please guide me to debug what is causing the loss Nan? I only modify the
MEMORY_LIMIT_IN_MB = 13 * 1024
to leave some memory space for running the validation set and the-lr
anddwd
for debugging purposes. Before running the commands below, I have followed theFeature pre-processing
and run the commands for generating baidu2.0 features.I am trying to reproducing results from your experiments:
Challenge Validated protocol
inExperiments using Combination×2 features
in SoccerNet-action-spotting-challenge-2022.md and facing loss = Nan after 1-4 epochs training after running the command :By running the command on the above, I got loss=Nan in the
second epochs
,first epoch
loss=0.685, and I found after setting thelr
to1e-5
, the loss=Nan occurs after4-5
epochs.(after multiple attemps)Sample output:
Model summary(skip)
I am more familiar with Pytorch and trying my best to understand the code. Thank you for your assistance!