wuxxin / aur-packages

archlinux AUR packages i maintain
1 stars 0 forks source link

python-torchvision-rocm 0.17.2-1 fails to build: Assertion err == hipSuccess failed #11

Closed haplo closed 6 months ago

haplo commented 6 months ago

I cannot upgrade python-torchvision-rocm from 0.17.1-1 to 0.17.2-1, build fails because of an assertion:

Full log of yay -Sua forcing a clean build:

:: Searching AUR for updates...
-> Packages not in AUR: khotkeys  kpeoplevcard  python-lib-agent  python-onlykey-agent
-> Orphan (unmaintained) AUR Packages: kjs
-> Flagged Out Of Date AUR Packages: featherwallet-bin  ttf-indieflower  ttf-pacifico
:: 1 package to upgrade/install.
1  aur/python-torchvision-rocm  0.17.1-1 -> 0.17.2-1
==> Packages to exclude: (eg: "1 2 3", "1-3", "^4" or repo name)
-> Excluding packages may cause partial upgrades and break systems
==>
AUR Explicit (1): python-torchvision-rocm-0.17.2-1
:: PKGBUILD up to date, skipping download: python-torchvision-rocm
1 python-torchvision-rocm          (Installed) (Build Files Exist)
==> Packages to cleanBuild?
==> [N]one [A]ll [Ab]ort [I]nstalled [No]tInstalled or (1 2 3, 1-3, ^4)
==> A
:: Deleting (1/1): /home/fidel/.cache/yay/python-torchvision-rocm
HEAD is now at c54f87a upgpkg: python-torchvision-rocm 0.17.2-1
warning: could not open directory 'pkg/': Permission denied
Removing pkg/
Removing python-torchvision-rocm-0.17.1-1-x86_64.pkg.tar.zst
Removing python-torchvision-rocm-debug-0.17.1-1-x86_64.pkg.tar.zst
Removing src/
Removing torchvision-rocm-0.17.1-1-x86_64.pkg.tar.zst
Removing vision-0.17.1.tar.gz
Removing vision-0.17.2.tar.gz
1 python-torchvision-rocm          (Installed) (Build Files Exist)
==> Diffs to show?
==> [N]one [A]ll [Ab]ort [I]nstalled [No]tInstalled or (1 2 3, 1-3, ^4)
==> A
-> python-torchvision-rocm: No changes -- skipping

:: Proceed with install? [Y/n] y
==> Making package: python-torchvision-rocm 0.17.2-1 (Mon 29 Apr 2024 03:15:05 PM WEST)
==> Retrieving sources...
-> Downloading vision-0.17.2.tar.gz...
% Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
Dload  Upload   Total   Spent    Left  Speed
0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 12.4M  100 12.4M    0     0  9658k      0  0:00:01  0:00:01 --:--:-- 13.1M
-> Found pytorch-vision-8096.patch
-> Found pytorch-vision-8112.patch
-> Found torchvision-0_17_1-fix-build.patch
==> WARNING: Skipping verification of source file PGP signatures.
==> Validating source files with sha256sums...
vision-0.17.2.tar.gz ... Passed
pytorch-vision-8096.patch ... Skipped
pytorch-vision-8112.patch ... Skipped
torchvision-0_17_1-fix-build.patch ... Skipped
:: (1/1) Parsing SRCINFO: python-torchvision-rocm
==> Making package: python-torchvision-rocm 0.17.2-1 (Mon 29 Apr 2024 03:15:08 PM WEST)
==> Checking runtime dependencies...
==> Checking buildtime dependencies...
==> Retrieving sources...
-> Found vision-0.17.2.tar.gz
-> Found pytorch-vision-8096.patch
-> Found pytorch-vision-8112.patch
-> Found torchvision-0_17_1-fix-build.patch
==> Validating source files with sha256sums...
vision-0.17.2.tar.gz ... Passed
pytorch-vision-8096.patch ... Skipped
pytorch-vision-8112.patch ... Skipped
torchvision-0_17_1-fix-build.patch ... Skipped
==> Removing existing $srcdir/ directory...
==> Extracting sources...
-> Extracting vision-0.17.2.tar.gz with bsdtar
==> Starting prepare()...
patching file torchvision/csrc/io/decoder/stream.cpp
patching file setup.py
==> Sources are ready.
==> Making package: python-torchvision-rocm 0.17.2-1 (Mon 29 Apr 2024 03:15:11 PM WEST)
==> Checking runtime dependencies...
==> Checking buildtime dependencies...
==> WARNING: Using existing $srcdir/ tree
==> Starting build()...
building for PYTORCH_ROCM_ARCH=gfx906;gfx908;gfx90a;gfx940;gfx941;gfx942;gfx1010;gfx1012;gfx1030;gfx1100;gfx1101;gfx1102
-- The C compiler identification is GNU 13.2.1
-- The CXX compiler identification is GNU 13.2.1
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Caffe2: Found gflags with new-style gflags target.
-- Caffe2: Found glog with new-style glog target.
-- Found ZLIB: /usr/lib/libz.so (found version "1.3.1")
-- Caffe2: Found protobuf with new-style protobuf targets.
-- Caffe2: Protobuf version 25.3.0
Building PyTorch for GPU arch: gfx906;gfx908;gfx90a;gfx940;gfx941;gfx942;gfx1010;gfx1012;gfx1030;gfx1100;gfx1101;gfx1102
-- Found HIP: /opt/rocm (found suitable version "6.0.32831-", minimum required is "1.0")
HIP VERSION: 6.0.32831-
-- Caffe2: Header version is: 6.0.2

***** ROCm version from rocm_version.h ****

ROCM_VERSION_DEV: 6.0.2
ROCM_VERSION_DEV_MAJOR: 6
ROCM_VERSION_DEV_MINOR: 0
ROCM_VERSION_DEV_PATCH: 2
ROCM_VERSION_DEV_INT:   60002
HIP_VERSION_MAJOR: 6
HIP_VERSION_MINOR: 0
TORCH_HIP_VERSION: 600

***** Library versions from dpkg *****

***** Library versions from cmake find_package *****

-- Found Threads: TRUE
hip VERSION: 6.0.0
hsa-runtime64 VERSION: 1.12.0
amd_comgr VERSION: 2.6.0
rocrand VERSION: 3.0.0
hiprand VERSION: 2.10.16
rocblas VERSION: 4.0.0
hipblas VERSION: 2.0.0
hipblaslt VERSION: 0.6.0
miopen VERSION: 3.00.0
hipfft VERSION: 1.0.13
hipsparse VERSION: 3.0.0
rccl VERSION: 2.18.3
rocprim VERSION: 3.0.0
hipcub VERSION: 3.0.0
rocthrust VERSION: 3.0.0
hipsolver VERSION: 2.0.0
hipblaslt is NOT using custom data type
hipblaslt is NOT using custom compute type
hipblaslt provides getIndexFromAlgo
HIP is using new type enums
CMake Warning at /usr/lib/cmake/Torch/TorchConfig.cmake:22 (message):
static library kineto_LIBRARY-NOTFOUND not found.
Call Stack (most recent call first):
/usr/lib/cmake/Torch/TorchConfig.cmake:127 (append_torchlib_if_found)
CMakeLists.txt:24 (find_package)

-- Found Torch: /usr/lib/libtorch.so
-- Found PNG: /usr/lib/libpng.so (found version "1.6.43")
-- Found JPEG: /usr/lib/libjpeg.so (found version "80")
-- Configuring done (1.5s)
-- Generating done (0.0s)
-- Build files have been written to: /home/fidel/.cache/yay/python-torchvision-rocm/src/vision-0.17.2/build
[1/28] Building CXX object CMakeFiles/torchvision.dir/torchvision/csrc/io/image/cpu/common_jpeg.cpp.o
cc1plus: warning: command-line option ‘-Wno-duplicate-decl-specifier’ is valid for C/ObjC but not for C++
[2/28] Building CXX object CMakeFiles/torchvision.dir/torchvision/csrc/vision.cpp.o
cc1plus: warning: command-line option ‘-Wno-duplicate-decl-specifier’ is valid for C/ObjC but not for C++
In file included from /home/fidel/.cache/yay/python-torchvision-rocm/src/vision-0.17.2/torchvision/csrc/vision.cpp:1:
/home/fidel/.cache/yay/python-torchvision-rocm/src/vision-0.17.2/torchvision/csrc/vision.h:10:40: warning: ‘_register_ops’ initialized and declared ‘extern’
10 | extern "C" VISION_INLINE_VARIABLE auto _register_ops = &cuda_version;
|                                        ^~~~~~~~~~~~~
[3/28] Building CXX object CMakeFiles/torchvision.dir/torchvision/csrc/ops/cpu/roi_pool_kernel.cpp.o
cc1plus: warning: command-line option ‘-Wno-duplicate-decl-specifier’ is valid for C/ObjC but not for C++
[4/28] Building CXX object CMakeFiles/torchvision.dir/torchvision/csrc/io/image/cpu/decode_png.cpp.o
cc1plus: warning: command-line option ‘-Wno-duplicate-decl-specifier’ is valid for C/ObjC but not for C++
[5/28] Building CXX object CMakeFiles/torchvision.dir/torchvision/csrc/io/image/cpu/encode_jpeg.cpp.o
cc1plus: warning: command-line option ‘-Wno-duplicate-decl-specifier’ is valid for C/ObjC but not for C++
[6/28] Building CXX object CMakeFiles/torchvision.dir/torchvision/csrc/io/image/image.cpp.o
cc1plus: warning: command-line option ‘-Wno-duplicate-decl-specifier’ is valid for C/ObjC but not for C++
[7/28] Building CXX object CMakeFiles/torchvision.dir/torchvision/csrc/io/image/cpu/decode_image.cpp.o
cc1plus: warning: command-line option ‘-Wno-duplicate-decl-specifier’ is valid for C/ObjC but not for C++
[8/28] Building CXX object CMakeFiles/torchvision.dir/torchvision/csrc/io/image/cpu/read_write_file.cpp.o
cc1plus: warning: command-line option ‘-Wno-duplicate-decl-specifier’ is valid for C/ObjC but not for C++
[9/28] Building CXX object CMakeFiles/torchvision.dir/torchvision/csrc/ops/ps_roi_pool.cpp.o
cc1plus: warning: command-line option ‘-Wno-duplicate-decl-specifier’ is valid for C/ObjC but not for C++
[10/28] Building CXX object CMakeFiles/torchvision.dir/torchvision/csrc/ops/nms.cpp.o
cc1plus: warning: command-line option ‘-Wno-duplicate-decl-specifier’ is valid for C/ObjC but not for C++
[11/28] Building CXX object CMakeFiles/torchvision.dir/torchvision/csrc/ops/cpu/ps_roi_align_kernel.cpp.o
cc1plus: warning: command-line option ‘-Wno-duplicate-decl-specifier’ is valid for C/ObjC but not for C++
[12/28] Building CXX object CMakeFiles/torchvision.dir/torchvision/csrc/ops/deform_conv2d.cpp.o
cc1plus: warning: command-line option ‘-Wno-duplicate-decl-specifier’ is valid for C/ObjC but not for C++
[13/28] Building CXX object CMakeFiles/torchvision.dir/torchvision/csrc/io/image/cuda/decode_jpeg_cuda.cpp.o
cc1plus: warning: command-line option ‘-Wno-duplicate-decl-specifier’ is valid for C/ObjC but not for C++
[14/28] Building CXX object CMakeFiles/torchvision.dir/torchvision/csrc/ops/cpu/deform_conv2d_kernel.cpp.o
cc1plus: warning: command-line option ‘-Wno-duplicate-decl-specifier’ is valid for C/ObjC but not for C++
[15/28] Building CXX object CMakeFiles/torchvision.dir/torchvision/csrc/ops/cpu/roi_align_kernel.cpp.o
cc1plus: warning: command-line option ‘-Wno-duplicate-decl-specifier’ is valid for C/ObjC but not for C++
[16/28] Building CXX object CMakeFiles/torchvision.dir/torchvision/csrc/io/image/cpu/decode_jpeg.cpp.o
cc1plus: warning: command-line option ‘-Wno-duplicate-decl-specifier’ is valid for C/ObjC but not for C++
[17/28] Building CXX object CMakeFiles/torchvision.dir/torchvision/csrc/ops/roi_pool.cpp.o
cc1plus: warning: command-line option ‘-Wno-duplicate-decl-specifier’ is valid for C/ObjC but not for C++
[18/28] Building CXX object CMakeFiles/torchvision.dir/torchvision/csrc/io/image/cpu/encode_png.cpp.o
cc1plus: warning: command-line option ‘-Wno-duplicate-decl-specifier’ is valid for C/ObjC but not for C++
[19/28] Building CXX object CMakeFiles/torchvision.dir/torchvision/csrc/ops/cpu/ps_roi_pool_kernel.cpp.o
cc1plus: warning: command-line option ‘-Wno-duplicate-decl-specifier’ is valid for C/ObjC but not for C++
[20/28] Building CXX object CMakeFiles/torchvision.dir/torchvision/csrc/ops/ps_roi_align.cpp.o
cc1plus: warning: command-line option ‘-Wno-duplicate-decl-specifier’ is valid for C/ObjC but not for C++
[21/28] Building CXX object CMakeFiles/torchvision.dir/torchvision/csrc/ops/roi_align.cpp.o
cc1plus: warning: command-line option ‘-Wno-duplicate-decl-specifier’ is valid for C/ObjC but not for C++
[22/28] Building CXX object CMakeFiles/torchvision.dir/torchvision/csrc/ops/autograd/ps_roi_align_kernel.cpp.o
cc1plus: warning: command-line option ‘-Wno-duplicate-decl-specifier’ is valid for C/ObjC but not for C++
[23/28] Building CXX object CMakeFiles/torchvision.dir/torchvision/csrc/ops/autograd/deform_conv2d_kernel.cpp.o
cc1plus: warning: command-line option ‘-Wno-duplicate-decl-specifier’ is valid for C/ObjC but not for C++
[24/28] Building CXX object CMakeFiles/torchvision.dir/torchvision/csrc/ops/cpu/nms_kernel.cpp.o
cc1plus: warning: command-line option ‘-Wno-duplicate-decl-specifier’ is valid for C/ObjC but not for C++
[25/28] Building CXX object CMakeFiles/torchvision.dir/torchvision/csrc/ops/autograd/roi_pool_kernel.cpp.o
cc1plus: warning: command-line option ‘-Wno-duplicate-decl-specifier’ is valid for C/ObjC but not for C++
[26/28] Building CXX object CMakeFiles/torchvision.dir/torchvision/csrc/ops/autograd/ps_roi_pool_kernel.cpp.o
cc1plus: warning: command-line option ‘-Wno-duplicate-decl-specifier’ is valid for C/ObjC but not for C++
[27/28] Building CXX object CMakeFiles/torchvision.dir/torchvision/csrc/ops/autograd/roi_align_kernel.cpp.o
cc1plus: warning: command-line option ‘-Wno-duplicate-decl-specifier’ is valid for C/ObjC but not for C++
[28/28] Linking CXX shared library libtorchvision.so
python: /usr/src/debug/hip-runtime-amd/clr-rocm-6.0.2/hipamd/src/hip_code_object.cpp:762: hip::FatBinaryInfo** hip::StatCO::addFatBinary(const void*, bool): Assertion `err == hipSuccess' failed.
/home/fidel/.cache/yay/python-torchvision-rocm/PKGBUILD: line 70: 134069 Aborted                 (core dumped) TORCHVISION_INCLUDE=${srcdir} TORCHVISION_LIBRARY=/usr/lib TORCHVISION_USE_NVJPEG=0 TORCHVISION_USE_VIDEO_CODEC=0 TORCHVISION_USE_FFMPEG=1 python setup.py build
==> ERROR: A failure occurred in build().
Aborting...
-> error making: python-torchvision-rocm-exit status 4
-> Failed to install the following packages. Manual intervention is required:
python-torchvision-rocm - exit status 4

Any ideas of what could be the issue or how to debug it?

Thanks for your support!

wuxxin commented 6 months ago

does your python-pytorch currently work ? could you paste the output ot:

/opt/rocm/bin/rocminfo | grep -E "(Name|ID):"
export | grep -E \
  "(GPU_TARGETS|AMDGPU_TARGETS|PYTORCH_ROCM_ARCH|HSA_OVERRIDE_GFX_VERSION|ROCR_VISIBLE_DEVICES)"
python -c 'import torch.version as v; \
  print("torch: {}\nrocm: {}\n".format(v.git_version, v.hip))'
haplo commented 6 months ago

does your python-pytorch currently work ? could you paste the output ot:

/opt/rocm/bin/rocminfo | grep -E "(Name|ID):"
export | grep -E \
  "(GPU_TARGETS|AMDGPU_TARGETS|PYTORCH_ROCM_ARCH|HSA_OVERRIDE_GFX_VERSION|ROCR_VISIBLE_DEVICES)"
python -c 'import torch.version as v; \
  print("torch: {}\nrocm: {}\n".format(v.git_version, v.hip))'

rocminfo works, but pytorch import fails:

>>> import torch
python: /usr/src/debug/hip-runtime-amd/clr-rocm-6.0.2/hipamd/src/hip_code_object.cpp:762: hip::FatBinaryInfo** hip::StatCO::addFatBinary(const void*, bool): Assertion `err == hipSuccess' failed.
fish: Job 1, 'python' terminated by signal SIGABRT (Abort)
wuxxin commented 6 months ago

ack, you need a running/working pytorch to compile torch-vision, closing this as its unrelated to torchvision. Try to get your pytorch running, then build torchvision

haplo commented 6 months ago

ack, you need a running/working pytorch to compile torch-vision, closing this as its unrelated to torchvision. Try to get your pytorch running, then build torchvision

Definitely, thank you for the help debugging!

Rubby2001 commented 6 months ago

ack, you need a running/working pytorch to compile torch-vision, closing this as its unrelated to torchvision. Try to get your pytorch running, then build torchvision

Definitely, thank you for the help debugging!

Have you solved your issue with pytorch? I have met exactly the same problem with you.

haplo commented 6 months ago

Have you solved your issue with pytorch? I have met exactly the same problem with you.

I haven't. This is a known issue tracked at Arch Linux Gitlab, I'm waiting for an update there. As you can read in the issue you can install python-pytorch-rocm-bin from AUR as a workaround.