yuxng / PoseCNN

A Convolutional Neural Network for 6D Object Pose Estimation in Cluttered Scenes
https://rse-lab.cs.washington.edu/projects/posecnn/
MIT License
746 stars 242 forks source link

cudnn handle: CUDNN_STATUS_INTERNAL_ERROR when training the YCB-Video dataset #96

Closed Nathan-81 closed 5 years ago

Nathan-81 commented 5 years ago

Hi ! I am trying to run on the YCB-Video dataset after succesfully execute the demo.sh but when i try to execute the lov_color_2d_train.sh script I get this error : 2019-05-13 15:22:42.290693: W tensorflow/core/framework/op_kernel.cc:1318] OP_REQUIRES failed at assign_op.h:112 : Resource exhausted: OOM when allocating tensor with shape[4096] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc

It appears just after the loading of the models.

and it keep "running" to finish by this: CancelledError (see above for traceback): Enqueue operation was cancelled [[Node: fifo_queue_enqueue = QueueEnqueueV2[Tcomponents=[DT_FLOAT, DT_INT32, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], timeout_ms=-1, _device="/job:localhost/replica:0/task:0/device:CPU:0"](fifo_queue, _arg_Placeholder_0_0, _arg_Placeholder_1_0_1, _arg_Placeholder_2_0_2, _arg_Placeholder_3_0_3, _arg_Placeholder_4_0_4, _arg_Placeholder_5_0_5, _arg_Placeholder_6_0_6, _arg_Placeholder_7_0_7, _arg_Placeholder_8_0_8, _arg_Placeholder_9_0_9)]]

I search on github and lot of people speak about the reservation of memory by tensorflow like this :

config = tf.ConfigProto() config.gpu_options.per_process_gpu_memory_fraction=0.5 sess = tf.Session(config=config)

In order to reserve and limit the memory used by tensorflow.

So what I tried :

So I add theses lines to the train_net.py file (in the main) and I also modified the train.py file at function train_net line 538 like this :

I uncommented this :

config = tf.ConfigProto() config.gpu_options.per_process_gpu_memory_fraction = 0.5

config.gpu_options.allow_growth = True

with tf.Session(config=config) as sess:

I commented this :

with tf.Session(config=tf.ConfigProto(allow_soft_placement=True)) as sess:

And now I get this error : 2019-05-13 14:57:04.300897: E tensorflow/stream_executor/cuda/cuda_dnn.cc:455] could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR 2019-05-13 14:57:04.300929: F tensorflow/core/kernels/conv_ops.cc:713] Check failed: stream->parent()->GetConvolveAlgorithms( conv_parameters.ShouldIncludeWinogradNonfusedAlgo(), &algorithms) ./experiments/scripts/lov_color_2d_train.sh: line 23: 19835 Aborted (core dumped) ./tools/train_net.py --gpu 0 --network vgg16_convs --weights data/imagenet_models/vgg16.npy --imdb lov_train --cfg experiments/cfgs/lov_color_2d.yml --cad data/LOV/models.txt --pose data/LOV/poses.txt --iters 160000

This appear just after 'conv2_1 biases assigned'. The problem seems to be at the call of the fonction :

loss_value, loss_cls_value, loss_vertex_value, loss_posevalue, lr, = sess.run([loss, loss_cls, loss_vertex, loss_pose, learning_rate, train_op])

I also tried :

sudo rm -rf .nv/

And I add the this directory to my LD_LIBRARY_PATH :

/usr/local/cuda/extras/CUPTI/lib64

My configuration : Ubuntu 18.04 Cuda 10.0 Cudnn 7.5.1 Tensorflow r1.8 (install from source with bazel 0.10.0)

Graphics Card : Nvidia RTX2070 8GB

Nathan-81 commented 5 years ago

Does anyone know the minimum memory capacity of the GPU we need to do this training?