Closed witignite closed 4 years ago
It turns out to be the GPU problem of the tensorflow-gpu
library installed in my environment:
Registered devices: [CPU, XLA_CPU, XLA_GPU]
Registered kernels: device='GPU'
After checking the TensorFlow document, I reinstalled with pip install tensorflow-gpu=1.12.0
and the test script seems to work. Now I will try running the train script.
Confirm that it is working. However, please beware that this code uses Python3
, so there could be some broken code. (Original work are tested with Python2
although it's claimed that Python3
should also work.)
By running: sh scripts/command_train_v2.sh
pid: 4543
2019-10-23 06:55:27.236815: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-10-23 06:55:27.336185: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-10-23 06:55:27.336665: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: GeForce GTX 1070 major: 6 minor: 1 memoryClockRate(GHz): 1.7085
pciBusID: 0000:02:00.0
totalMemory: 7.92GiB freeMemory: 7.83GiB
2019-10-23 06:55:27.336694: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-10-23 06:55:27.543854: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-10-23 06:55:27.543898: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2019-10-23 06:55:27.543907: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2019-10-23 06:55:27.544109: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7553 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1070, pci bus id: 0000:02:00.0, compute capability: 6.1)
**** EPOCH 000 ****
2019-10-23 06:55:30.274318
2019-10-23 06:55:45.848762: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.22GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-10-23 06:55:45.849772: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.03GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-10-23 06:55:45.890843: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.74GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-10-23 06:55:45.892336: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.73GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-10-23 06:55:45.895451: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.23GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-10-23 06:55:45.896786: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.13GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-10-23 06:55:46.068403: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.22GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-10-23 06:55:46.074569: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.05GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
-- 010 / 6122 --
mean loss: 73.648424
segmentation accuracy: 0.673047
box IoU (ground/3D): 0.058166 / 0.019219
box estimation accuracy (IoU=0.7): 0.000000
-- 020 / 6122 --
mean loss: 51.751960
segmentation accuracy: 0.718156
box IoU (ground/3D): 0.116223 / 0.066769
box estimation accuracy (IoU=0.7): 0.000000
-- 030 / 6122 --
mean loss: 46.989214
segmentation accuracy: 0.753507
box IoU (ground/3D): 0.176942 / 0.098821
box estimation accuracy (IoU=0.7): 0.000000
-- 040 / 6122 --
mean loss: 33.884403
segmentation accuracy: 0.780387
box IoU (ground/3D): 0.251711 / 0.157620
box estimation accuracy (IoU=0.7): 0.000000
-- 050 / 6122 --
mean loss: 32.261572
segmentation accuracy: 0.773682
box IoU (ground/3D): 0.246830 / 0.183422
box estimation accuracy (IoU=0.7): 0.000000
-- 060 / 6122 --
mean loss: 33.621707
segmentation accuracy: 0.766968
box IoU (ground/3D): 0.222743 / 0.158293
box estimation accuracy (IoU=0.7): 0.000000
-- 070 / 6122 --
mean loss: 36.880329
segmentation accuracy: 0.795565
box IoU (ground/3D): 0.234813 / 0.182264
box estimation accuracy (IoU=0.7): 0.000000
-- 080 / 6122 --
mean loss: 29.811518
segmentation accuracy: 0.780485
box IoU (ground/3D): 0.272311 / 0.215362
box estimation accuracy (IoU=0.7): 0.008333
-- 090 / 6122 --
mean loss: 20.526078
segmentation accuracy: 0.790299
box IoU (ground/3D): 0.329929 / 0.269595
box estimation accuracy (IoU=0.7): 0.016667
-- 100 / 6122 --
mean loss: 25.132062
segmentation accuracy: 0.821436
box IoU (ground/3D): 0.311218 / 0.254065
box estimation accuracy (IoU=0.7): 0.000000
...
Did you finish training? Have any qualitative results?
Not finished, but I think it is working fine. Here is the last epoch that I got so far:
-- 6120 / 6122 --
mean loss: 2.943508
segmentation accuracy: 0.952629
box IoU (ground/3D): 0.742650 / 0.681488
box estimation accuracy (IoU=0.7): 0.616667
2019-10-23 13:07:00.879426
---- EPOCH 015 EVALUATION ----
eval mean loss: 4.013138
eval segmentation accuracy: 0.916610
eval segmentation avg class acc: 0.918955
eval box IoU (ground/3D): 0.729562 / 0.673533
eval box estimation accuracy (IoU=0.7): 0.573711
**** EPOCH 016 ****
2019-10-23 13:09:10.807745
-- 010 / 6122 --
mean loss: 1.875684
segmentation accuracy: 0.952360
box IoU (ground/3D): 0.768769 / 0.702892
box estimation accuracy (IoU=0.7): 0.591667
-- 020 / 6122 --
mean loss: 2.155424
segmentation accuracy: 0.955501
box IoU (ground/3D): 0.746031 / 0.684788
box estimation accuracy (IoU=0.7): 0.541667
Branch:
develop-wit
Environment
conda install tensorflow-gpu
)release 9.0, V9.0.176
)Problem When trying to evaluate a pre-trained model by running (set
--model_path
to your pre-trained model):it gives
tensorflow.python.framework.errors_impl.InvalidArgumentError
: