train loss equal to 0 - Githubissues

suzy0223 commented 1 year ago

Hi, I try to run the train.py on METR-LA. Due to the TensorFlow version, I use the tf_upgrad_v2 to migrate the model.py to TF 2. X. Specifically, (1) line 76 'tf.nn.rnn_cell.GRUCell' to 'tf.compat.v1.nn.rnn_cell.GRUCell' (2)line 80 and 87 'tf.layers.dense' to 'tf.compat.v1.layers.dense'

Then, when I run train.py and test.py, there're several issues, as follows: (1) line 47 in test.py and line 58 in train.py, x.value() error, x is int. After I change the "x.value for x in xxx" to "x for x in xxx" it could work (2) After 4-5 epochs, the training and validation losses become 0. The test result becomes nan. I run the code several times, the issue does not disappear. Meanwhile, the test.py is normal and outputs the results. Since the dataset doesn't contain the metr-la.h5, I use the document download from IGNNK.

Now, I am not sure of the reasons for the issues. Hope to here some suggestions. Much appreciated.

suzy0223 commented 1 year ago

Besides, add 'tf.compat.disable_eager_execution()' at the beginning of the def placeholder(h).

wujiangzhu commented 1 year ago

I met the same problem that the loss became 0 after several epochs, could you help me, appeciated!

Aminsheykh98 commented 9 months ago

Hi, I try to run the train.py on METR-LA. Due to the TensorFlow version, I use the tf_upgrad_v2 to migrate the model.py to TF 2. X. Specifically, (1) line 76 'tf.nn.rnn_cell.GRUCell' to 'tf.compat.v1.nn.rnn_cell.GRUCell' (2)line 80 and 87 'tf.layers.dense' to 'tf.compat.v1.layers.dense'

Then, when I run train.py and test.py, there're several issues, as follows: (1) line 47 in test.py and line 58 in train.py, x.value() error, x is int. After I change the "x.value for x in xxx" to "x for x in xxx" it could work (2) After 4-5 epochs, the training and validation losses become 0. The test result becomes nan. I run the code several times, the issue does not disappear. Meanwhile, the test.py is normal and outputs the results. Since the dataset doesn't contain the metr-la.h5, I use the document download from IGNNK.

Now, I am not sure of the reasons for the issues. Hope to here some suggestions. Much appreciated.

I translated their code into PyTorch. I also encountered the same issue you mentioned. And I think the problem is that they didn't normalize the inputs (so that masking NaN values in the loss function would not be difficult). However, it is causing the gradient to explode after 4 or 5 epochs.

zhusuwen commented 5 months ago

When I got the learning rate down to 0.0001, it worked fine, but the results were not as good as in the paper

Reset-quick commented 4 months ago

When I got the learning rate down to 0.0001, it worked fine, but the results were not as good as in the paper

Hello, may I ask if you have solved this problem now? Can you run the effect in the paper?

zhengchuanpan / INCREASE

train loss equal to 0 #2