williamljb / HumanMultiView

45 stars 16 forks source link

Code getting stuck at 635 in trainer.py #9

Closed NiranthS closed 3 years ago

NiranthS commented 3 years ago

I tried to run the training script, the terminal shows "Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 8694 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080 Ti, compute capability: 7.5)" and gets stuck at line 644 in trainer.py(found using pdb)

williamljb commented 3 years ago

Hi NiranthS,

I am not sure. Can you print debugging messages around line 644 to accurately locate this? Line 644 seems normal to me and shouldn't cause problems.

NiranthS commented 3 years ago

Sorry, it was line 635 that has sess.run. Also, the problem was on my side. The conversion of datasets into tfrecords was not done properly. Now it is running.

But, the loss is increasing and going to nan eventually. Any ideas what might be causing this?

williamljb commented 3 years ago

This might because the learning rate is too large. Try reducing the learning rate to 1/10 of original.

NiranthS commented 3 years ago

Tried it nut there was no difference by changing the learning rate. Will try to change other parameters. Thank You.