tensorflow.python.framework.errors_impl.InvalidArgumentError

witignite commented 4 years ago

Branch: develop-wit

Environment

g++ 5.4.0
Anaconda 4.6.11
Python 3.6.9
tensorflow-gpu 1.14.0 (installed with conda install tensorflow-gpu)
CUDA 9.0 (release 9.0, V9.0.176)
NumPy 1.17.2

Problem When trying to evaluate a pre-trained model by running (set --model_path to your pre-trained model):

python train/test.py --gpu 0 --num_point 1024 --model frustum_pointnets_v2 --model_path models_pretrained/author/log_v2/model.ckpt --output train/detection_results_v2 --data_path kitti/frustum_carpedcyc_val_rgb_detection.pickle --from_rgb_detection --idx_path kitti/image_sets/val.txt --from_rgb_detection

it gives tensorflow.python.framework.errors_impl.InvalidArgumentError:

25392

2019-10-23 01:58:40.203553: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2019-10-23 01:58:40.208387: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3407925000 Hz
2019-10-23 01:58:40.208686: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55578ca1daf0 executing computations on platform Host. Devices:
2019-10-23 01:58:40.208721: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): <undefined>, <undefined>
2019-10-23 01:58:40.209506: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcuda.so.1
2019-10-23 01:58:40.226525: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-10-23 01:58:40.227059: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties: 
name: GeForce GTX 1070 major: 6 minor: 1 memoryClockRate(GHz): 1.7085
pciBusID: 0000:02:00.0
2019-10-23 01:58:40.227118: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.9.0
2019-10-23 01:58:40.227951: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.9.0
2019-10-23 01:58:40.228834: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.9.0
2019-10-23 01:58:40.229044: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.9.0
2019-10-23 01:58:40.229438: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcusolver.so.9.0'; dlerror: ~/.conda/envs/py36/lib/python3.6/site-packages/tensorflow/python/../../../../libcusolver.so.9.0: undefined symbol: GOMP_critical_end; LD_LIBRARY_PATH: /usr/local/cuda/lib64:
2019-10-23 01:58:40.230460: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusparse.so.9.0
2019-10-23 01:58:40.233024: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2019-10-23 01:58:40.233043: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1663] Cannot dlopen some GPU libraries. Skipping registering GPU devices...
2019-10-23 01:58:40.309071: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-10-23 01:58:40.309107: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187]      0 
2019-10-23 01:58:40.309137: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0:   N 
2019-10-23 01:58:40.310519: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-10-23 01:58:40.311096: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55578d43fd10 executing computations on platform CUDA. Devices:
2019-10-23 01:58:40.311122: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): GeForce GTX 1070, Compute Capability 6.1

Traceback (most recent call last):
  File "~/.conda/envs/py36/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1356, in _do_call
    return fn(*args)
  File "~/.conda/envs/py36/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1339, in _run_fn
    self._extend_graph()
  File "~/.conda/envs/py36/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1374, in _extend_graph
    tf_session.ExtendSession(self._session)
tensorflow.python.framework.errors_impl.InvalidArgumentError: No OpKernel was registered to support Op 'FarthestPointSample' used by {{node layer1/FarthestPointSample}}with these attrs: [npoint=128]
Registered devices: [CPU, XLA_CPU, XLA_GPU]
Registered kernels:
  device='GPU'

     [[layer1/FarthestPointSample]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "~/.conda/envs/py36/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 1286, in restore
    {self.saver_def.filename_tensor_name: save_path})
  File "~/.conda/envs/py36/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 950, in run
    run_metadata_ptr)
  File "~/.conda/envs/py36/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1173, in _run
    feed_dict_tensor, options, run_metadata)
  File "~/.conda/envs/py36/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1350, in _do_run
    run_metadata)
  File "~/.conda/envs/py36/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1370, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: No OpKernel was registered to support Op 'FarthestPointSample' used by node layer1/FarthestPointSample (defined at <string>:155) with these attrs: [npoint=128]
Registered devices: [CPU, XLA_CPU, XLA_GPU]
Registered kernels:
  device='GPU'

     [[layer1/FarthestPointSample]]

Errors may have originated from an input operation.
Input Source operations connected to node layer1/FarthestPointSample:
 Slice (defined at ~/Frustum-PointNet/models/frustum_pointnets_v2.py:37)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "train/test.py", line 355, in <module>
    test_from_rgb_detection(FLAGS.output+'.pickle', FLAGS.output)
  File "train/test.py", line 217, in test_from_rgb_detection
    sess, ops = get_session_and_ops(batch_size=batch_size, num_point=NUM_POINT)
  File "train/test.py", line 77, in get_session_and_ops
    saver.restore(sess, MODEL_PATH)
  File "~/.conda/envs/py36/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 1322, in restore
    err, "a mismatch between the current graph and the graph")
tensorflow.python.framework.errors_impl.InvalidArgumentError: Restoring from checkpoint failed. This is most likely due to a mismatch between the current graph and the graph from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:

No OpKernel was registered to support Op 'FarthestPointSample' used by node layer1/FarthestPointSample (defined at <string>:155) with these attrs: [npoint=128]
Registered devices: [CPU, XLA_CPU, XLA_GPU]
Registered kernels:
  device='GPU'

     [[layer1/FarthestPointSample]]

Errors may have originated from an input operation.
Input Source operations connected to node layer1/FarthestPointSample:
 Slice (defined at ~/Frustum-PointNet/models/frustum_pointnets_v2.py:37)

witignite commented 4 years ago

It turns out to be the GPU problem of the tensorflow-gpu library installed in my environment:

Registered devices: [CPU, XLA_CPU, XLA_GPU]
Registered kernels: device='GPU'

After checking the TensorFlow document, I reinstalled with pip install tensorflow-gpu=1.12.0 and the test script seems to work. Now I will try running the train script.

witignite commented 4 years ago

Confirm that it is working. However, please beware that this code uses Python3, so there could be some broken code. (Original work are tested with Python2 although it's claimed that Python3 should also work.)

By running: sh scripts/command_train_v2.sh

pid: 4543
2019-10-23 06:55:27.236815: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-10-23 06:55:27.336185: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-10-23 06:55:27.336665: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties: 
name: GeForce GTX 1070 major: 6 minor: 1 memoryClockRate(GHz): 1.7085
pciBusID: 0000:02:00.0
totalMemory: 7.92GiB freeMemory: 7.83GiB
2019-10-23 06:55:27.336694: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-10-23 06:55:27.543854: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-10-23 06:55:27.543898: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0 
2019-10-23 06:55:27.543907: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N 
2019-10-23 06:55:27.544109: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7553 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1070, pci bus id: 0000:02:00.0, compute capability: 6.1)
**** EPOCH 000 ****
2019-10-23 06:55:30.274318
2019-10-23 06:55:45.848762: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.22GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-10-23 06:55:45.849772: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.03GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-10-23 06:55:45.890843: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.74GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-10-23 06:55:45.892336: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.73GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-10-23 06:55:45.895451: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.23GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-10-23 06:55:45.896786: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.13GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-10-23 06:55:46.068403: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.22GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-10-23 06:55:46.074569: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.05GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
 -- 010 / 6122 --
mean loss: 73.648424
segmentation accuracy: 0.673047
box IoU (ground/3D): 0.058166 / 0.019219
box estimation accuracy (IoU=0.7): 0.000000
 -- 020 / 6122 --
mean loss: 51.751960
segmentation accuracy: 0.718156
box IoU (ground/3D): 0.116223 / 0.066769
box estimation accuracy (IoU=0.7): 0.000000
 -- 030 / 6122 --
mean loss: 46.989214
segmentation accuracy: 0.753507
box IoU (ground/3D): 0.176942 / 0.098821
box estimation accuracy (IoU=0.7): 0.000000
 -- 040 / 6122 --
mean loss: 33.884403
segmentation accuracy: 0.780387
box IoU (ground/3D): 0.251711 / 0.157620
box estimation accuracy (IoU=0.7): 0.000000
 -- 050 / 6122 --
mean loss: 32.261572
segmentation accuracy: 0.773682
box IoU (ground/3D): 0.246830 / 0.183422
box estimation accuracy (IoU=0.7): 0.000000
 -- 060 / 6122 --
mean loss: 33.621707
segmentation accuracy: 0.766968
box IoU (ground/3D): 0.222743 / 0.158293
box estimation accuracy (IoU=0.7): 0.000000
 -- 070 / 6122 --
mean loss: 36.880329
segmentation accuracy: 0.795565
box IoU (ground/3D): 0.234813 / 0.182264
box estimation accuracy (IoU=0.7): 0.000000
 -- 080 / 6122 --
mean loss: 29.811518
segmentation accuracy: 0.780485
box IoU (ground/3D): 0.272311 / 0.215362
box estimation accuracy (IoU=0.7): 0.008333
 -- 090 / 6122 --
mean loss: 20.526078
segmentation accuracy: 0.790299
box IoU (ground/3D): 0.329929 / 0.269595
box estimation accuracy (IoU=0.7): 0.016667
 -- 100 / 6122 --
mean loss: 25.132062
segmentation accuracy: 0.821436
box IoU (ground/3D): 0.311218 / 0.254065
box estimation accuracy (IoU=0.7): 0.000000
...

giangnguyen2412 commented 4 years ago

Did you finish training? Have any qualitative results?

witignite commented 4 years ago

Not finished, but I think it is working fine. Here is the last epoch that I got so far:

 -- 6120 / 6122 --
mean loss: 2.943508
segmentation accuracy: 0.952629
box IoU (ground/3D): 0.742650 / 0.681488
box estimation accuracy (IoU=0.7): 0.616667
2019-10-23 13:07:00.879426
---- EPOCH 015 EVALUATION ----
eval mean loss: 4.013138
eval segmentation accuracy: 0.916610
eval segmentation avg class acc: 0.918955
eval box IoU (ground/3D): 0.729562 / 0.673533
eval box estimation accuracy (IoU=0.7): 0.573711
**** EPOCH 016 ****
2019-10-23 13:09:10.807745
 -- 010 / 6122 --
mean loss: 1.875684
segmentation accuracy: 0.952360
box IoU (ground/3D): 0.768769 / 0.702892
box estimation accuracy (IoU=0.7): 0.591667
 -- 020 / 6122 --
mean loss: 2.155424
segmentation accuracy: 0.955501
box IoU (ground/3D): 0.746031 / 0.684788
box estimation accuracy (IoU=0.7): 0.541667

witignite / Frustum-PointNet

tensorflow.python.framework.errors_impl.InvalidArgumentError #3