zldrobit / onnx_tflite_yolov3

A Conversion tool to convert YOLO v3 Darknet weights to TF Lite model (YOLO v3 PyTorch > ONNX > TensorFlow > TF Lite), and to TensorRT (YOLO v3 Pytorch > ONNX > TensorRT).
GNU General Public License v3.0
69 stars 26 forks source link

pytorch和tensorflow的cudnn冲突? #1

Closed Chase2816 closed 4 years ago

Chase2816 commented 4 years ago

pt模型转成onnx模型后,测试通过。onnx模型转成pb模型,使用tf_infer.py推理没有错误,但是在使用tf_detect.py时报错。错误如下:`File "D:\Anaconda3\envs\py365\lib\site-packages\tensorflow_core\python\client\session.py", line 1365, in _do_call return fn(*args) File "D:\Anaconda3\envs\py365\lib\site-packages\tensorflow_core\python\client\session.py", line 1350, in _run_fn target_list, run_metadata) File "D:\Anaconda3\envs\py365\lib\site-packages\tensorflow_core\python\client\session.py", line 1443, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.UnknownError: 2 root error(s) found. (0) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above. [[{{node convolution}}]] [[815/_27]] (1) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above. [[{{node convolution}}]] 0 successful operations. 0 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "D:/ccpd_dataset/onnx_tflite_yolov3-master/tf_detect.py", line 213, in detect() File "D:/ccpd_dataset/onnx_tflite_yolov3-master/tf_detect.py", line 117, in detect pred = sess.run("815:0", feed_dict={'input.1:0': img}) File "D:\Anaconda3\envs\py365\lib\site-packages\tensorflow_core\python\client\session.py", line 956, in run run_metadata_ptr) File "D:\Anaconda3\envs\py365\lib\site-packages\tensorflow_core\python\client\session.py", line 1180, in _run feed_dict_tensor, options, run_metadata) File "D:\Anaconda3\envs\py365\lib\site-packages\tensorflow_core\python\client\session.py", line 1359, in _do_run run_metadata) File "D:\Anaconda3\envs\py365\lib\site-packages\tensorflow_core\python\client\session.py", line 1384, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.UnknownError: 2 root error(s) found. (0) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above. [[node convolution (defined at \Anaconda3\envs\py365\lib\site-packages\tensorflow_core\python\framework\ops.py:1748) ]] [[815/_27]] (1) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above. [[node convolution (defined at \Anaconda3\envs\py365\lib\site-packages\tensorflow_core\python\framework\ops.py:1748) ]] 0 successful operations. 0 derived errors ignored.

Original stack trace for 'convolution': File "/ccpd_dataset/onnx_tflite_yolov3-master/tf_detect.py", line 213, in detect() File "/ccpd_dataset/onnx_tflite_yolov3-master/tf_detect.py", line 43, in detect name="") File "\Anaconda3\envs\py365\lib\site-packages\tensorflow_core\python\util\deprecation.py", line 507, in new_func return func(*args, **kwargs) File "\Anaconda3\envs\py365\lib\site-packages\tensorflow_core\python\framework\importer.py", line 405, in import_graph_def producer_op_list=producer_op_list) File "\Anaconda3\envs\py365\lib\site-packages\tensorflow_core\python\framework\importer.py", line 517, in _import_graph_def_internal _ProcessNewOps(graph) File "\Anaconda3\envs\py365\lib\site-packages\tensorflow_core\python\framework\importer.py", line 243, in _ProcessNewOps for new_op in graph._add_new_tf_operations(compute_devices=False): # pylint: disable=protected-access File "\Anaconda3\envs\py365\lib\site-packages\tensorflow_core\python\framework\ops.py", line 3561, in _add_new_tf_operations for c_op in c_api_util.new_tf_operations(self) File "\Anaconda3\envs\py365\lib\site-packages\tensorflow_core\python\framework\ops.py", line 3561, in for c_op in c_api_util.new_tf_operations(self) File "\Anaconda3\envs\py365\lib\site-packages\tensorflow_core\python\framework\ops.py", line 3451, in _create_op_from_tf_operation ret = Operation(c_op, self) File "\Anaconda3\envs\py365\lib\site-packages\tensorflow_core\python\framework\ops.py", line 1748, in init self._traceback = tf_stack.extract_stack()`,我是按照requirements.txt配置的环境,不知道你是否遇见过这个问题,希望指教,感谢!

zldrobit commented 4 years ago

应该是cudnn和cuda版本的原因,你可以试一下我构建的docker image: docker pull zldrobit/onnx:10.0-cudnn7-devel 我用这个image时可以正常转换的

zldrobit commented 4 years ago

我google了一下,很多人也遇到过这个问题,你可以参考一下: https://github.com/tensorflow/tensorflow/issues/24828 https://github.com/tensorflow/tensorflow/issues/28326 https://stackoverflow.com/questions/53698035/failed-to-get-convolution-algorithm-this-is-probably-because-cudnn-failed-to-in

或者尝试一下新的配置环境,requirements.txt配置文件已经升级了。

Chase2816 commented 4 years ago

感谢,我在试试。我更改了cudnn版本后,尝试更改pytorch版本和tensorflow版本,最后在tensorflow==1.15.0,并把pytorch==1.3.1的版本换成了CPU版本情况下,运行成功,不报错误。我在训练自己数据集的时,在除数据集参数外,默认配置下,使用这个版本的yolov3检测效果没有yunyang1994的yolov3检测效果好,出现少框和多框的情况,请问您有遇见过吗?

zldrobit commented 4 years ago

这个仓库的代码主要是用来验证Darknet weights > ONNX(PyTorch) > TensorFlow > TFLite的转换,暂时没有关注训练效果,如果你想使用PyTorch训练的话可以参考 https://github.com/ultralytics/yolov3 我推荐你先使用原版的yolov3 github仓库训练权重,将得到的weights再用这个仓库进行转换。 这样准确度应该没有问题。