world-modelz / dreamax

A scalable Dreamer implementation in JAX
MIT License
11 stars 2 forks source link

CuDNN related warnings #1

Open andreaskoepf opened 2 years ago

andreaskoepf commented 2 years ago

On a RTX A6000 with Driver Version: 470.103.01, CUDA Version: 11.4 I get the following errors/warnings:

2022-04-08 08:11:14.361125: W external/org_tensorflow/tensorflow/compiler/xla/service/gpu/gpu_conv_algorithm_picker.cc:727] None of the algorithms provided by cuDNN heuristics worked; trying fallback algorithms.  Conv: (f16[1,5,5,128]{3,2,1,0}, u8[0]{0}) custom-call(f16[1,9,9,1024]{3,2,1,0}, f16[5,5,128,1024]{3,1,0,2}), window={size=5x5}, dim_labels=b01f_01oi->b01f, custom_call_target="__cudnn$convForward", backend_config="{\"conv_result_scale\":1,\"activation_mode\":\"0\",\"side_input_scale\":0}"
2022-04-08 08:17:10.165017: W external/org_tensorflow/tensorflow/compiler/xla/service/gpu/gpu_conv_algorithm_picker.cc:727] None of the algorithms provided by cuDNN heuristics worked; trying fallback algorithms.  Conv: (f16[1600,30,30,32]{3,2,1,0}, u8[0]{0}) custom-call(f16[1600,64,64,4]{3,2,1,0}, f16[6,6,4,32]{2,1,0,3}), window={size=6x6 stride=2x2}, dim_labels=b01f_01io->b01f, custom_call_target="__cudnn$convForward", backend_config="{\"conv_result_scale\":1,\"activation_mode\":\"0\",\"side_input_scale\":0}"
2022-04-08 08:17:10.241714: E external/org_tensorflow/tensorflow/compiler/xla/service/gpu/buffer_comparator.cc:730] Difference at 0: 3442 vs 3876
2022-04-08 08:17:10.241744: E external/org_tensorflow/tensorflow/compiler/xla/service/gpu/buffer_comparator.cc:730] Difference at 1: 3454 vs 3874
(...)
2022-04-08 08:17:10.241808: E external/org_tensorflow/tensorflow/compiler/xla/service/gpu/buffer_comparator.cc:730] Difference at 9: 3456 vs 3874
2022-04-08 08:17:10.241839: E external/org_tensorflow/tensorflow/compiler/xla/service/gpu/gpu_conv_algorithm_picker.cc:612] Results mismatch between different convolution algorithms. This is likely a bug/unexpected loss of precision in cudnn.
%custom-call.447 = (f16[4,4,4,32]{2,1,0,3}, u8[0]{0}) custom-call(f16[1600,64,64,4]{3,2,1,0} %pad.436, f16[1600,31,31,32]{3,2,1,0} %select.50292), window={size=4x4 stride=2x2}, dim_labels=b01f_01io->b01f, custom_call_target="__cudnn$convBackwardFilter", metadata={op_name="jit(_update)/jit(main)/conv_general_dilated[window_strides=(1, 1) padding=((0, 0), (0, 0)) lhs_dilation=(1, 1) rhs_dilation=(2, 2) dimension_numbers=ConvDimensionNumbers(lhs_spec=(3, 0, 1, 2), rhs_spec=(3, 0, 1, 2), out_spec=(2, 3, 0, 1)) feature_group_count=1 batch_group_count=1 lhs_shape=(1600, 64, 64, 3) rhs_shape=(1600, 31, 31, 32) precision=None preferred_element_type=None]" source_file="/usr/local/lib/python3.8/dist-packages/haiku/_src/conv.py" source_line=205}, backend_config="{\"conv_result_scale\":1,\"activation_mode\":\"0\",\"side_input_scale\":0}" for eng20{k2=6,k3=0} vs eng33{k2=7,k5=3,k14=5,k0=41}
2022-04-08 08:17:10.241850: E external/org_tensorflow/tensorflow/compiler/xla/service/gpu/gpu_conv_algorithm_picker.cc:245] Device: NVIDIA RTX A6000
2022-04-08 08:17:10.241855: E external/org_tensorflow/tensorflow/compiler/xla/service/gpu/gpu_conv_algorithm_picker.cc:246] Platform: Compute Capability 8.6
2022-04-08 08:17:10.241859: E external/org_tensorflow/tensorflow/compiler/xla/service/gpu/gpu_conv_algorithm_picker.cc:247] Driver: 11040 (470.103.1)
2022-04-08 08:17:10.241864: E external/org_tensorflow/tensorflow/compiler/xla/service/gpu/gpu_conv_algorithm_picker.cc:248] Runtime: <undefined>
2022-04-08 08:17:10.241872: E external/org_tensorflow/tensorflow/compiler/xla/service/gpu/gpu_conv_algorithm_picker.cc:255] cudnn version: 8.2.4
2022-04-08 08:17:10.243720: E external/org_tensorflow/tensorflow/compiler/xla/service/gpu/buffer_comparator.cc:730] Difference at 0: 3442 vs 3864
2022-04-08 08:17:10.243734: E external/org_tensorflow/tensorflow/compiler/xla/service/gpu/buffer_comparator.cc:730] Difference at 1: 3454 vs 3864
(...)
2022-04-08 08:17:10.243800: E external/org_tensorflow/tensorflow/compiler/xla/service/gpu/buffer_comparator.cc:730] Difference at 9: 3456 vs 3862
2022-04-08 08:17:10.243833: E external/org_tensorflow/tensorflow/compiler/xla/service/gpu/gpu_conv_algorithm_picker.cc:612] Results mismatch between different convolution algorithms. This is likely a bug/unexpected loss of precision in cudnn.
(...)

According to https://github.com/google/jax/issues/8746 "None of the algorithms provided by cuDNN heuristics worked" also appears when not enough GPU memory is available, but I tested with XLA_PYTHON_CLIENT_MEM_FRACTION=0.75 and the warning was still printed even though nvidia-smi indicated lots of free memory.

I currently suspect that both warnings are related to CUDA driver or JAX issues and might be solved automatically in the future by new releases. I want to track them here in case someone sees similar outputs or finds a solution.

dfm794 commented 2 years ago

I see the same error with a different card and driver version...fwiw.. 2022-05-05 03:41:20.671403: E external/org_tensorflow/tensorflow/compiler/xla/service/gpu/buffer_comparator.cc:730] Difference at 0: 3876 vs 3448 2022-05-05 03:41:20.671433: E external/org_tensorflow/tensorflow/compiler/xla/service/gpu/buffer_comparator.cc:730] Difference at 1: 3874 vs 3444 2022-05-05 03:41:20.671439: E external/org_tensorflow/tensorflow/compiler/xla/service/gpu/buffer_comparator.cc:730] Difference at 2: 3874 vs 3448 2022-05-05 03:41:20.671444: E external/org_tensorflow/tensorflow/compiler/xla/service/gpu/buffer_comparator.cc:730] Difference at 3: 3876 vs 3448 2022-05-05 03:41:20.671449: E external/org_tensorflow/tensorflow/compiler/xla/service/gpu/buffer_comparator.cc:730] Difference at 4: 3874 vs 3444 2022-05-05 03:41:20.671453: E external/org_tensorflow/tensorflow/compiler/xla/service/gpu/buffer_comparator.cc:730] Difference at 5: 3874 vs 3446 2022-05-05 03:41:20.671458: E external/org_tensorflow/tensorflow/compiler/xla/service/gpu/buffer_comparator.cc:730] Difference at 6: 3876 vs 3442 2022-05-05 03:41:20.671463: E external/org_tensorflow/tensorflow/compiler/xla/service/gpu/buffer_comparator.cc:730] Difference at 7: 3876 vs 3448 2022-05-05 03:41:20.671467: E external/org_tensorflow/tensorflow/compiler/xla/service/gpu/buffer_comparator.cc:730] Difference at 8: 3874 vs 3442 2022-05-05 03:41:20.671472: E external/org_tensorflow/tensorflow/compiler/xla/service/gpu/buffer_comparator.cc:730] Difference at 9: 3876 vs 3454 2022-05-05 03:41:20.671491: E external/org_tensorflow/tensorflow/compiler/xla/service/gpu/gpu_conv_algorithm_picker.cc:612] Results mismatch between different convolution algorithms. This is likely a bug/unexpected loss of precision in cudnn. %cudnn-conv-bw-filter.3 = (f16[4,4,4,32]{2,1,0,3}, u8[0]{0}) custom-call(f16[1600,64,64,4]{3,2,1,0} %pad.436, f16[1600,31,31,32]{3,2,1,0} %select.47277), window={size=4x4 stride=2x2}, dim_labels=b01f_01io->b01f, custom_call_target="__cudnn$convBackwardFilter", metadata={op_name="jit(_update)/jit(main)/conv_general_dilated[window_strides=(1, 1) padding=((0, 0), (0, 0)) lhs_dilation=(1, 1) rhs_dilation=(2, 2) dimension_numbers=ConvDimensionNumbers(lhs_spec=(3, 0, 1, 2), rhs_spec=(3, 0, 1, 2), out_spec=(2, 3, 0, 1)) feature_group_count=1 batch_group_count=1 lhs_shape=(1600, 64, 64, 3) rhs_shape=(1600, 31, 31, 32) precision=None preferred_element_type=None]" source_file="/usr/local/lib/python3.8/dist-packages/haiku/_src/conv.py" source_line=205}, backend_config="{\"conv_result_scale\":1,\"activation_mode\":\"0\",\"side_input_scale\":0}" for eng54{k2=10,k6=1,k22=2,k12=119,k13=1,k14=0,k15=0,k17=120} vs eng20{k2=7,k3=0} 2022-05-05 03:41:20.671497: E external/org_tensorflow/tensorflow/compiler/xla/service/gpu/gpu_conv_algorithm_picker.cc:245] Device: Quadro RTX 6000 2022-05-05 03:41:20.671501: E external/org_tensorflow/tensorflow/compiler/xla/service/gpu/gpu_conv_algorithm_picker.cc:246] Platform: Compute Capability 7.5 2022-05-05 03:41:20.671505: E external/org_tensorflow/tensorflow/compiler/xla/service/gpu/gpu_conv_algorithm_picker.cc:247] Driver: 11060 (510.47.3) 2022-05-05 03:41:20.671509: E external/org_tensorflow/tensorflow/compiler/xla/service/gpu/gpu_conv_algorithm_picker.cc:248] Runtime: 2022-05-05 03:41:20.671514: E external/org_tensorflow/tensorflow/compiler/xla/service/gpu/gpu_conv_algorithm_picker.cc:255] cudnn version: 8.4.0