Open andreaskoepf opened 2 years ago
I see the same error with a different card and driver version...fwiw..
2022-05-05 03:41:20.671403: E external/org_tensorflow/tensorflow/compiler/xla/service/gpu/buffer_comparator.cc:730] Difference at 0: 3876 vs 3448
2022-05-05 03:41:20.671433: E external/org_tensorflow/tensorflow/compiler/xla/service/gpu/buffer_comparator.cc:730] Difference at 1: 3874 vs 3444
2022-05-05 03:41:20.671439: E external/org_tensorflow/tensorflow/compiler/xla/service/gpu/buffer_comparator.cc:730] Difference at 2: 3874 vs 3448
2022-05-05 03:41:20.671444: E external/org_tensorflow/tensorflow/compiler/xla/service/gpu/buffer_comparator.cc:730] Difference at 3: 3876 vs 3448
2022-05-05 03:41:20.671449: E external/org_tensorflow/tensorflow/compiler/xla/service/gpu/buffer_comparator.cc:730] Difference at 4: 3874 vs 3444
2022-05-05 03:41:20.671453: E external/org_tensorflow/tensorflow/compiler/xla/service/gpu/buffer_comparator.cc:730] Difference at 5: 3874 vs 3446
2022-05-05 03:41:20.671458: E external/org_tensorflow/tensorflow/compiler/xla/service/gpu/buffer_comparator.cc:730] Difference at 6: 3876 vs 3442
2022-05-05 03:41:20.671463: E external/org_tensorflow/tensorflow/compiler/xla/service/gpu/buffer_comparator.cc:730] Difference at 7: 3876 vs 3448
2022-05-05 03:41:20.671467: E external/org_tensorflow/tensorflow/compiler/xla/service/gpu/buffer_comparator.cc:730] Difference at 8: 3874 vs 3442
2022-05-05 03:41:20.671472: E external/org_tensorflow/tensorflow/compiler/xla/service/gpu/buffer_comparator.cc:730] Difference at 9: 3876 vs 3454
2022-05-05 03:41:20.671491: E external/org_tensorflow/tensorflow/compiler/xla/service/gpu/gpu_conv_algorithm_picker.cc:612] Results mismatch between different convolution algorithms. This is likely a bug/unexpected loss of precision in cudnn.
%cudnn-conv-bw-filter.3 = (f16[4,4,4,32]{2,1,0,3}, u8[0]{0}) custom-call(f16[1600,64,64,4]{3,2,1,0} %pad.436, f16[1600,31,31,32]{3,2,1,0} %select.47277), window={size=4x4 stride=2x2}, dim_labels=b01f_01io->b01f, custom_call_target="__cudnn$convBackwardFilter", metadata={op_name="jit(_update)/jit(main)/conv_general_dilated[window_strides=(1, 1) padding=((0, 0), (0, 0)) lhs_dilation=(1, 1) rhs_dilation=(2, 2) dimension_numbers=ConvDimensionNumbers(lhs_spec=(3, 0, 1, 2), rhs_spec=(3, 0, 1, 2), out_spec=(2, 3, 0, 1)) feature_group_count=1 batch_group_count=1 lhs_shape=(1600, 64, 64, 3) rhs_shape=(1600, 31, 31, 32) precision=None preferred_element_type=None]" source_file="/usr/local/lib/python3.8/dist-packages/haiku/_src/conv.py" source_line=205}, backend_config="{\"conv_result_scale\":1,\"activation_mode\":\"0\",\"side_input_scale\":0}" for eng54{k2=10,k6=1,k22=2,k12=119,k13=1,k14=0,k15=0,k17=120} vs eng20{k2=7,k3=0}
2022-05-05 03:41:20.671497: E external/org_tensorflow/tensorflow/compiler/xla/service/gpu/gpu_conv_algorithm_picker.cc:245] Device: Quadro RTX 6000
2022-05-05 03:41:20.671501: E external/org_tensorflow/tensorflow/compiler/xla/service/gpu/gpu_conv_algorithm_picker.cc:246] Platform: Compute Capability 7.5
2022-05-05 03:41:20.671505: E external/org_tensorflow/tensorflow/compiler/xla/service/gpu/gpu_conv_algorithm_picker.cc:247] Driver: 11060 (510.47.3)
2022-05-05 03:41:20.671509: E external/org_tensorflow/tensorflow/compiler/xla/service/gpu/gpu_conv_algorithm_picker.cc:248] Runtime:
On a RTX A6000 with Driver Version: 470.103.01, CUDA Version: 11.4 I get the following errors/warnings:
According to https://github.com/google/jax/issues/8746 "None of the algorithms provided by cuDNN heuristics worked" also appears when not enough GPU memory is available, but I tested with
XLA_PYTHON_CLIENT_MEM_FRACTION=0.75
and the warning was still printed even thoughnvidia-smi
indicated lots of free memory.I currently suspect that both warnings are related to CUDA driver or JAX issues and might be solved automatically in the future by new releases. I want to track them here in case someone sees similar outputs or finds a solution.