Hi !
I am trying to run on the YCB-Video dataset after succesfully execute the demo.sh but when i try to execute the lov_color_2d_train.sh script I get this error :
2019-05-13 15:22:42.290693: W tensorflow/core/framework/op_kernel.cc:1318] OP_REQUIRES failed at assign_op.h:112 : Resource exhausted: OOM when allocating tensor with shape[4096] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
It appears just after the loading of the models.
and it keep "running" to finish by this:
CancelledError (see above for traceback): Enqueue operation was cancelled
[[Node: fifo_queue_enqueue = QueueEnqueueV2[Tcomponents=[DT_FLOAT, DT_INT32, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], timeout_ms=-1, _device="/job:localhost/replica:0/task:0/device:CPU:0"](fifo_queue, _arg_Placeholder_0_0, _arg_Placeholder_1_0_1, _arg_Placeholder_2_0_2, _arg_Placeholder_3_0_3, _arg_Placeholder_4_0_4, _arg_Placeholder_5_0_5, _arg_Placeholder_6_0_6, _arg_Placeholder_7_0_7, _arg_Placeholder_8_0_8, _arg_Placeholder_9_0_9)]]
I search on github and lot of people speak about the reservation of memory by tensorflow like this :
with tf.Session(config=tf.ConfigProto(allow_soft_placement=True)) as sess:
And now I get this error :
2019-05-13 14:57:04.300897: E tensorflow/stream_executor/cuda/cuda_dnn.cc:455] could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2019-05-13 14:57:04.300929: F tensorflow/core/kernels/conv_ops.cc:713] Check failed: stream->parent()->GetConvolveAlgorithms( conv_parameters.ShouldIncludeWinogradNonfusedAlgo(), &algorithms)
./experiments/scripts/lov_color_2d_train.sh: line 23: 19835 Aborted (core dumped) ./tools/train_net.py --gpu 0 --network vgg16_convs --weights data/imagenet_models/vgg16.npy --imdb lov_train --cfg experiments/cfgs/lov_color_2d.yml --cad data/LOV/models.txt --pose data/LOV/poses.txt --iters 160000
This appear just after 'conv2_1 biases assigned'. The problem seems to be at the call of the fonction :
Hi ! I am trying to run on the YCB-Video dataset after succesfully execute the demo.sh but when i try to execute the lov_color_2d_train.sh script I get this error : 2019-05-13 15:22:42.290693: W tensorflow/core/framework/op_kernel.cc:1318] OP_REQUIRES failed at assign_op.h:112 : Resource exhausted: OOM when allocating tensor with shape[4096] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
It appears just after the loading of the models.
and it keep "running" to finish by this: CancelledError (see above for traceback): Enqueue operation was cancelled [[Node: fifo_queue_enqueue = QueueEnqueueV2[Tcomponents=[DT_FLOAT, DT_INT32, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], timeout_ms=-1, _device="/job:localhost/replica:0/task:0/device:CPU:0"](fifo_queue, _arg_Placeholder_0_0, _arg_Placeholder_1_0_1, _arg_Placeholder_2_0_2, _arg_Placeholder_3_0_3, _arg_Placeholder_4_0_4, _arg_Placeholder_5_0_5, _arg_Placeholder_6_0_6, _arg_Placeholder_7_0_7, _arg_Placeholder_8_0_8, _arg_Placeholder_9_0_9)]]
I search on github and lot of people speak about the reservation of memory by tensorflow like this :
In order to reserve and limit the memory used by tensorflow.
So what I tried :
So I add theses lines to the train_net.py file (in the main) and I also modified the train.py file at function train_net line 538 like this :
And now I get this error : 2019-05-13 14:57:04.300897: E tensorflow/stream_executor/cuda/cuda_dnn.cc:455] could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR 2019-05-13 14:57:04.300929: F tensorflow/core/kernels/conv_ops.cc:713] Check failed: stream->parent()->GetConvolveAlgorithms( conv_parameters.ShouldIncludeWinogradNonfusedAlgo(), &algorithms)
./experiments/scripts/lov_color_2d_train.sh: line 23: 19835 Aborted (core dumped) ./tools/train_net.py --gpu 0 --network vgg16_convs --weights data/imagenet_models/vgg16.npy --imdb lov_train --cfg experiments/cfgs/lov_color_2d.yml --cad data/LOV/models.txt --pose data/LOV/poses.txt --iters 160000
This appear just after 'conv2_1 biases assigned'. The problem seems to be at the call of the fonction :
I also tried :
And I add the this directory to my LD_LIBRARY_PATH :
My configuration : Ubuntu 18.04 Cuda 10.0 Cudnn 7.5.1 Tensorflow r1.8 (install from source with bazel 0.10.0)
Graphics Card : Nvidia RTX2070 8GB