zju3dv / mvpose

Code for "Fast and Robust Multi-Person 3D Pose Estimation from Multiple Views" (CVPR 2019, T-PAMI 2021)
https://zju3dv.github.io/mvpose/
Apache License 2.0
515 stars 79 forks source link

failed to enqueue forward pooling on stream: CUDNN_STATUS_EXECUTION_FAILED #87

Closed MolianWH closed 2 years ago

MolianWH commented 2 years ago

Env

Description

I have compiled successfully. When I evaluate on the Campus datasets, it returns errors cudnn PoolForward launch failed and failed to enqueue forward pooling on stream: CUDNN_STATUS_EXECUTION_FAILED. I have checked all issues and there is no issue like this. I also google the error, and modified cudnn from 7.6.5.32 to 7.0.0.5. My GPU memory is 10G and CPU memory is 15G. Until the error occur, GPU memory up to 2G and CPU memory up to 6G. So I'm sure it's not out of memory question.

Errors

In order to view easy, I have bolded the key information. And bellow are details.

python ./src/m_utils/demo.py -d Campus /home/dreamdeck/anaconda3/envs/mvpose/lib/python3.6/site-packages/sklearn/utils/deprecation.py:144: FutureWarning: The sklearn.metrics.base module is deprecated in version 0.22 and will be removed in version 0.24. The corresponding classes / functions should instead be imported from sklearn.metrics. Anything that cannot be imported from sklearn.metrics is now part of the private API. warnings.warn(message, FutureWarning) 2022-04-01 16:59:23.375152: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA 2022-04-01 16:59:23.431099: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:897] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2022-04-01 16:59:23.431203: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1392] Found device 0 with properties: name: NVIDIA GeForce RTX 3080 major: 8 minor: 6 memoryClockRate(GHz): 1.71 pciBusID: 0000:01:00.0 totalMemory: 9.78GiB freeMemory: 8.79GiB 2022-04-01 16:59:23.431216: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1471] Adding visible gpu devices: 0 2022-04-01 17:08:18.636708: I tensorflow/core/common_runtime/gpu/gpu_device.cc:952] Device interconnect StreamExecutor with strength 1 edge matrix: 2022-04-01 17:08:18.636729: I tensorflow/core/common_runtime/gpu/gpu_device.cc:958] 0 2022-04-01 17:08:18.636734: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0: N 2022-04-01 17:08:18.636808: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1084] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 8331 MB memory) -> physical GPU (device: 0, name: NVIDIA GeForce RTX 3080, pci bus id: 0000:01:00.0, compute capability: 8.6) 2022-04-01 17:08:24.022598: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1471] Adding visible gpu devices: 0 2022-04-01 17:08:24.022628: I tensorflow/core/common_runtime/gpu/gpu_device.cc:952] Device interconnect StreamExecutor with strength 1 edge matrix: 2022-04-01 17:08:24.022635: I tensorflow/core/common_runtime/gpu/gpu_device.cc:958] 0 2022-04-01 17:08:24.022639: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0: N 2022-04-01 17:08:24.022683: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1084] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 8331 MB memory) -> physical GPU (device: 0, name: NVIDIA GeForce RTX 3080, pci bus id: 0000:01:00.0, compute capability: 8.6) 04-01 17:08:24 Generating testing graph on 1 GPUs ... 04-01 17:08:26 Initialized model weights from /home/dreamdeck/Documents/MJJ/code/PoseEstimation/mvpose/backend/tf_cpn/log/model_dump/snapshot_350.ckpt ... 04-01 17:08:29 Current epoch is 350. /home/dreamdeck/Documents/MJJ/code/PoseEstimation/mvpose/backend/CamStyle/reid/models/resnet.py:49: UserWarning: nn.init.kaiming_normal is now deprecated in favor of nn.init.kaimingnormal. init.kaiming_normal(self.feat.weight, mode='fanout') /home/dreamdeck/Documents/MJJ/code/PoseEstimation/mvpose/backend/CamStyle/reid/models/resnet.py:50: UserWarning: nn.init.constant is now deprecated in favor of nn.init.constant. init.constant(self.feat.bias, 0) /home/dreamdeck/Documents/MJJ/code/PoseEstimation/mvpose/backend/CamStyle/reid/models/resnet.py:51: UserWarning: nn.init.constant is now deprecated in favor of nn.init.constant_. init.constant(self.featbn.weight, 1) /home/dreamdeck/Documents/MJJ/code/PoseEstimation/mvpose/backend/CamStyle/reid/models/resnet.py:52: UserWarning: nn.init.constant is now deprecated in favor of nn.init.constant. init.constant(self.featbn.bias, 0) /home/dreamdeck/Documents/MJJ/code/PoseEstimation/mvpose/backend/CamStyle/reid/models/resnet.py:60: UserWarning: nn.init.normal is now deprecated in favor of nn.init.normal. init.normal(self.classifier.weight, std=0.001) /home/dreamdeck/Documents/MJJ/code/PoseEstimation/mvpose/backend/CamStyle/reid/models/resnet.py:61: UserWarning: nn.init.constant is now deprecated in favor of nn.init.constant_. init.constant(self.classifier.bias, 0) => Loaded checkpoint '/home/dreamdeck/Documents/MJJ/code/PoseEstimation/mvpose/backend/CamStyle/logs/market-ide-camstyle-re/checkpoint.pth.tar' => Start epoch 50 0%| | 0/79 [00:00<?, ?it/s]2022-04-01 17:09:31.471629: E tensorflow/stream_executor/cuda/cuda_dnn.cc:3900] failed to enqueue forward pooling on stream: CUDNN_STATUS_EXECUTION_FAILED Traceback (most recent call last): File "/home/dreamdeck/anaconda3/envs/mvpose/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1322, in _do_call return fn(args) File "/home/dreamdeck/anaconda3/envs/mvpose/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1307, in _run_fn options, feed_dict, fetch_list, target_list, run_metadata) File "/home/dreamdeck/anaconda3/envs/mvpose/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1409, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.InternalError: cudnn PoolForward launch failed [[Node: light_resnet_v1_101/pool1/MaxPool = MaxPoolT=DT_FLOAT, data_format="NCHW", ksize=[1, 1, 3, 3], padding="SAME", strides=[1, 1, 2, 2], _device="/job:localhost/replica:0/task:0/device:GPU:0"]] [[Node: add/_1107 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_2576_add", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]] During handling of the above exception, another exception occurred: Traceback (most recent call last): File "./src/m_utils/demo.py", line 92, in pose_in_range = export ( test_model, test_loader, is_info_dicts=bool ( args.dumped_dir ), show=True ) File "./src/m_utils/demo.py", line 44, in export show=show, plt_id=img_id ) File "/home/dreamdeck/Documents/MJJ/code/PoseEstimation/mvpose/src/models/estimate3d.py", line 40, in predict info_dict = self._infer_single2d ( imgs ) File "/home/dreamdeck/Documents/MJJ/code/PoseEstimation/mvpose/src/models/estimate3d.py", line 48, in _infer_single2d results = self.est2d.estimate_2d ( img, img_id ) File "/home/dreamdeck/Documents/MJJ/code/PoseEstimation/mvpose/backend/estimator_2d.py", line 23, in estimate_2d bbox_result = self.bbox_detector.detect ( img, img_id ) File "/home/dreamdeck/Documents/MJJ/code/PoseEstimation/mvpose/backend/light_head_rcnn/person_detector.py", line 60, in detect result_dict = self.inference ( self.func, self.inputs, data_dict ) File "/home/dreamdeck/Documents/MJJ/code/PoseEstimation/mvpose/backend/light_head_rcnn/persondetector.py", line 152, in inference , scores, pred_boxes, rois = val_func ( feed_dict=feed_dict ) File "/home/dreamdeck/anaconda3/envs/mvpose/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 900, in run run_metadata_ptr) File "/home/dreamdeck/anaconda3/envs/mvpose/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1135, in _run feed_dict_tensor, options, run_metadata) File "/home/dreamdeck/anaconda3/envs/mvpose/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1316, in _do_run run_metadata) File "/home/dreamdeck/anaconda3/envs/mvpose/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1335, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.InternalError: cudnn PoolForward launch failed [[Node: light_resnet_v1_101/pool1/MaxPool = MaxPoolT=DT_FLOAT, data_format="NCHW", ksize=[1, 1, 3, 3], padding="SAME", strides=[1, 1, 2, 2], _device="/job:localhost/replica:0/task:0/device:GPU:0"]] [[Node: add/_1107 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_2576_add", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]] Caused by op 'light_resnet_v1_101/pool1/MaxPool', defined at: File "./src/m_utils/demo.py", line 59, in test_model = MultiEstimator ( cfg=model_cfg ) File "/home/dreamdeck/Documents/MJJ/code/PoseEstimation/mvpose/src/models/estimate3d.py", line 34, in init self.est2d = Estimator_2d ( DEBUGGING=debug ) File "/home/dreamdeck/Documents/MJJ/code/PoseEstimation/mvpose/backend/estimator_2d.py", line 19, in init self.bbox_detector = PersonDetector ( show_image=DEBUGGING ) File "/home/dreamdeck/Documents/MJJ/code/PoseEstimation/mvpose/backend/light_head_rcnn/person_detector.py", line 50, in init self.func, self.inputs = self._load_model ( self.model_file ) File "/home/dreamdeck/Documents/MJJ/code/PoseEstimation/mvpose/backend/light_head_rcnn/person_detector.py", line 122, in _load_model net.inference ( 'TEST', inputs ) File "/home/dreamdeck/Documents/MJJ/code/PoseEstimation/mvpose/backend/light_head_rcnn/network_desp.py", line 109, in inference net, [3, 3], stride=2, padding='SAME', scope='pool1') File "/home/dreamdeck/anaconda3/envs/mvpose/lib/python3.6/site-packages/tensorflow/contrib/framework/python/ops/arg_scope.py", line 183, in func_with_args return func(args, current_args) File "/home/dreamdeck/anaconda3/envs/mvpose/lib/python3.6/site-packages/tensorflow/contrib/layers/python/layers/layers.py", line 2404, in max_pool2d outputs = layer.apply(inputs) File "/home/dreamdeck/anaconda3/envs/mvpose/lib/python3.6/site-packages/tensorflow/python/keras/engine/base_layer.py", line 774, in apply return self.call(inputs, *args, *kwargs) File "/home/dreamdeck/anaconda3/envs/mvpose/lib/python3.6/site-packages/tensorflow/python/layers/base.py", line 329, in call outputs = super(Layer, self).call(inputs, args, kwargs) File "/home/dreamdeck/anaconda3/envs/mvpose/lib/python3.6/site-packages/tensorflow/python/keras/engine/base_layer.py", line 703, in call outputs = self.call(inputs, *args, kwargs) File "/home/dreamdeck/anaconda3/envs/mvpose/lib/python3.6/site-packages/tensorflow/python/keras/layers/pooling.py", line 223, in call data_format=conv_utils.convert_data_format(self.data_format, 4)) File "/home/dreamdeck/anaconda3/envs/mvpose/lib/python3.6/site-packages/tensorflow/python/ops/nn_ops.py", line 2153, in max_pool name=name) File "/home/dreamdeck/anaconda3/envs/mvpose/lib/python3.6/site-packages/tensorflow/python/ops/gen_nn_ops.py", line 4640, in max_pool data_format=data_format, name=name) File "/home/dreamdeck/anaconda3/envs/mvpose/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper op_def=op_def) File "/home/dreamdeck/anaconda3/envs/mvpose/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3414, in create_op op_def=op_def) File "/home/dreamdeck/anaconda3/envs/mvpose/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1740, in init self._traceback = self._graph._extract_stack() # pylint: disable=protected-access InternalError (see above for traceback): cudnn PoolForward launch failed** [[Node: light_resnet_v1_101/pool1/MaxPool = MaxPoolT=DT_FLOAT, data_format="NCHW", ksize=[1, 1, 3, 3], padding="SAME", strides=[1, 1, 2, 2], _device="/job:localhost/replica:0/task:0/device:GPU:0"]] [[Node: add/_1107 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_2576_add", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

MolianWH commented 2 years ago

sloved. The reason comes from GPU high version. I changed the 1070Ti GPU and 440 driver and it run successfully.