No response in training process

dexter2406 commented 3 years ago

Hi I found the program doesn't respond when I start training. The displayed information is like the following. There is no error report either.

 np_resource = np.dtype([("resource", np.ubyte, 1)])
{'add_dispnet': True,
 'add_flownet': False,
 'add_posenet': True,
 'alpha_recon_image': 0.85,
 'batch_size': 4,
 'checkpoint_dir': 'models\\geonet_posenet\\results',
 'dataset_dir': 'data\\kitti\\formatted_data',
 'depth_test_split': 'eigen',
 'disp_smooth_weight': 0.5,
 'dispnet_encoder': 'resnet50',
...
 'output_dir': None,
 'pose_test_seq': 9,
 'rigid_warp_weight': 1.0,
 'save_ckpt_freq': 5000,
 'scale_normalize': False,
 'seq_length': 5}
2020-11-24 15:04:21.853792: W c:\tf_jenkins\home\workspace\release-win\m\windows\py\36\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE
instructions, but these are available on your machine and could speed up CPU computations.
...
2020-11-24 15:04:21.933181: W c:\tf_jenkins\home\workspace\release-win\m\windows\py\36\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA
instructions, but these are available on your machine and could speed up CPU computations.
Trainable variables:
depth_net/Conv/weights:0
depth_net/Conv/BatchNorm/beta:0
depth_net/Conv_1/weights:0
depth_net/Conv_1/BatchNorm/beta:0
depth_net/Conv_2/weights:0
...
pose_net/Conv_3/BatchNorm/beta:0
pose_net/Conv_4/weights:0
pose_net/Conv_4/BatchNorm/beta:0
pose_net/Conv_5/weights:0
pose_net/Conv_5/BatchNorm/beta:0
pose_net/Conv_6/weights:0
pose_net/Conv_6/BatchNorm/beta:0
pose_net/Conv_7/weights:0
pose_net/Conv_7/biases:0
parameter_count = 60047292

dexter2406 commented 3 years ago

I wait for about 20min and notice that there are following files are generated:

graph.pbtxt
events.out.tfevents.1606226671.DESKTOP-AVNMGK4

even though there's still no progress shown - maybe because your code has no visualization for training process? And what are these two files for?

Thanks for your time!

yzcjtr commented 3 years ago

Hi, can you confirm the library version you are using? From the signal above, the training hasn't started at all; otherwise, the loss value per iteration will be printed.

dexter2406 commented 3 years ago

Thanks for the reply. I'm using (mainly):

python=3.6.12
tensorflow==1.2.0
scipy==1.1.0
numpy==1.19.4
matplotlib==3.3.3
opencv-python==4.4.0
pillow==8.0.1

I know it's stated that this code is only tested in python==2.7 and tf==1.1, but they are not supported right now, so I tried new versions. I slightly modified the code according to the error repoort, but then I came to this where I didn't know what went wrong.

yzcjtr commented 3 years ago

TF 1.2 should be alright, but I'm not sure if python 3 is okay for this repo. I would suggest adding some checkpoints in the code and locate where it's stuck?

yzcjtr / GeoNet

No response in training process #69