Closed 7017227 closed 6 years ago
I think this is way to resolve problem, but I can't understand exactly about how to do this
Here is a bit more info on how I temporarily resolved it. I believe these issues are all related to GPU memory allocation and have nothing to do with the errors being reported. There were other errors before this indicating some sort of memory allocation problem but the program continued to progress, eventually giving the cudnn errors that everyone is getting. The reason I believe it works sometimes is that if you use the gpu for other things besides tensorflow such as your primary display, the available memory fluctuates. Sometimes you can allocate what you need and other times it can't.
From the API https://www.tensorflow.org/versions/r0.12/how_tos/using_gpu/ "By default, TensorFlow maps nearly all of the GPU memory of all GPUs (subject to CUDA_VISIBLE_DEVICES) visible to the process. This is done to more efficiently use the relatively precious GPU memory resources on the devices by reducing memory fragmentation."
I think this default allocation is broken in some way that causes this erratic behavior and certain situations to work and others to fail.
I have resolved this issue by changing the default behavior of TF to allocate a minimum amount of memory and grow as needed as detailed in the webpage. config = tf.ConfigProto() config.gpu_options.allow_growth = True session = tf.Session(config=config, ...)
I have also tried the alternate way and was able to get it to work and fail with experimentally choosing a percentage that worked. In my case it ended up being about .7.
config = tf.ConfigProto() config.gpu_options.per_process_gpu_memory_fraction = 0.4 session = tf.Session(config=config, ...)
Still no word from anyone on the TF team confirming this but it is worth a shot to see if others can confirm similar behavior.
Joints shape: (16, 2) 1989 0%| | 0/16 [00:00<?, ?it/s]2017-11-10 01:01:49.488958: E tensorflow/stream_executor/cuda/cuda_dnn.cc:385] could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR 2017-11-10 01:01:49.489003: E tensorflow/stream_executor/cuda/cuda_dnn.cc:352] could not destroy cudnn handle: CUDNN_STATUS_BAD_PARAM 2017-11-10 01:01:49.489015: F tensorflow/core/kernels/conv_ops.cc:667] Check failed: stream->parent()->GetConvolveAlgorithms( conv_parameters.ShouldIncludeWinogradNonfusedAlgo(), &algorithms)
Aborted (core dumped)
(tf3) wonjinlee@alpha:~/deeppose$ Process ForkPoolWorker-1:
Traceback (most recent call last):
File "/usr/lib/python3.5/multiprocessing/pool.py", line 125, in worker
put((job, i, result))
File "/usr/lib/python3.5/multiprocessing/queues.py", line 355, in put
self._writer.send_bytes(obj)
File "/usr/lib/python3.5/multiprocessing/connection.py", line 200, in send_bytes
self._send_bytes(m[offset:offset + size])
File "/usr/lib/python3.5/multiprocessing/connection.py", line 404, in _send_bytes
self._send(header + buf)
File "/usr/lib/python3.5/multiprocessing/connection.py", line 368, in _send
n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/usr/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap self.run() File "/usr/lib/python3.5/multiprocessing/process.py", line 93, in run self._target(*self._args, **self._kwargs) File "/usr/lib/python3.5/multiprocessing/pool.py", line 130, in worker put((job, i, (False, wrapped))) File "/usr/lib/python3.5/multiprocessing/queues.py", line 355, in put self._writer.send_bytes(obj) File "/usr/lib/python3.5/multiprocessing/connection.py", line 200, in send_bytes self._send_bytes(m[offset:offset + size]) File "/usr/lib/python3.5/multiprocessing/connection.py", line 404, in _send_bytes self._send(header + buf) File "/usr/lib/python3.5/multiprocessing/connection.py", line 368, in _send n = write(self._handle, buf) BrokenPipeError: [Errno 32] Broken pipe ^C (tf3) wonjinlee@alpha:~/deeppose$ ^C (tf3) wonjinlee@alpha:~/deeppose$