xsdk-project / xsdk-issues

A repository under which GitHub issues not related to a specific xSDK repo can be filed.
7 stars 0 forks source link

Hypre build failure with `xsdk+rocm` #227

Open balay opened 11 months ago

balay commented 11 months ago
balay@petsc-gpu-02:/scratch/balay/spack$ ./bin/spack spec xsdk+rocm amdgpu_target=gfx90a |grep hypre@
 -       ^hypre@2.30.0%gcc@11.4.0~caliper~complex~cuda~debug+fortran~gptune~int64~internal-superlu~magma~mixedint+mpi~openmp~rocm+shared+superlu-dist~sycl~umpire~unified-memory build_system=autotools arch=linux-ubuntu22.04-zen4

I thought this build was successful last week [but don't know for sure]

ref:./bin/spack install -j64 xsdk+rocm amdgpu_target=gfx90a

spack-build-out.txt

balay commented 11 months ago

tried the following change and still get errors:

diff --git a/var/spack/repos/builtin/packages/xsdk/package.py b/var/spack/repos/builtin/packages/xsdk/package.py
index b52d692b78..629f240a8f 100644
--- a/var/spack/repos/builtin/packages/xsdk/package.py
+++ b/var/spack/repos/builtin/packages/xsdk/package.py
@@ -109,8 +109,8 @@ class Xsdk(BundlePackage, CudaPackage, ROCmPackage):
     variant("hiop", default=True, description="Enable hiop build")
     variant("raja", default=(sys.platform != "darwin"), description="Enable raja for hiop, exago")

-    xsdk_depends_on("hypre@develop+superlu-dist+shared", when="@develop", cuda_var="cuda")
-    xsdk_depends_on("hypre@2.30.0+superlu-dist+shared", when="@1.0.0", cuda_var="cuda")
+    xsdk_depends_on("hypre@develop+superlu-dist+shared", when="@develop", cuda_var="cuda", rocm_var="rocm")
+    xsdk_depends_on("hypre@2.30.0+superlu-dist+shared", when="@1.0.0", cuda_var="cuda", rocm_var="rocm")
     xsdk_depends_on("hypre@2.26.0+superlu-dist+shared", when="@0.8.0", cuda_var="cuda")
     xsdk_depends_on("hypre@2.23.0+superlu-dist+shared", when="@0.7.0", cuda_var="cuda")

spack-build-out.txt

victorapm commented 11 months ago

We need to merge PR https://github.com/hypre-space/hypre/pull/869 into hypre's master to make hypre +rocm +superlu-dist work

Related: https://github.com/xsdk-project/xsdk-issues/issues/225

victorapm commented 11 months ago

Oops, now I notice that the issue here was hypre ~rocm +superlu-dist

I'm wondering why SuperLU_dist is defining GPU stuff while this build isn't supposed to use GPUs:

/scratch/balay/spack/opt/spack/linux-ubuntu22.04-zen4/gcc-11.4.0/openmpi-4.1.6-726wgocxnrdcpvhnyctkrwk5brwxpble/bin/mpicc -O2  -fPIC -DHAVE_CONFIG_H -I.. -I../distributed_ls/Euclid -I. -I./.. -I./../blas -I./../lapack -I./../multivector -I./../utilities -I./../krylov -I./../seq_mv -I./../parcsr_mv -I./../distributed_matrix -I./../matrix_matrix -I./../IJ_mv -I./../parcsr_block_mv -I/scratch/balay/spack/opt/spack/linux-ubuntu22.04-zen4/gcc-11.4.0/superlu-dist-8.2.0-r3bbhy5cr5g4b2xqq3slhqf7cy63ei4u/include           -I/scratch/balay/spack/opt/spack/linux-ubuntu22.04-zen4/gcc-11.4.0/openmpi-4.1.6-726wgocxnrdcpvhnyctkrwk5brwxpble/include -c par_nodal_systems.c
In file included from /scratch/balay/spack/opt/spack/linux-ubuntu22.04-zen4/gcc-11.4.0/superlu-dist-8.2.0-r3bbhy5cr5g4b2xqq3slhqf7cy63ei4u/include/gpu_api_utils.h:26,
                 from /scratch/balay/spack/opt/spack/linux-ubuntu22.04-zen4/gcc-11.4.0/superlu-dist-8.2.0-r3bbhy5cr5g4b2xqq3slhqf7cy63ei4u/include/superlu_defs.h:104,
                 from /scratch/balay/spack/opt/spack/linux-ubuntu22.04-zen4/gcc-11.4.0/superlu-dist-8.2.0-r3bbhy5cr5g4b2xqq3slhqf7cy63ei4u/include/superlu_ddefs.h:37,
                 from par_amg.c:18:
/scratch/balay/spack/opt/spack/linux-ubuntu22.04-zen4/gcc-11.4.0/superlu-dist-8.2.0-r3bbhy5cr5g4b2xqq3slhqf7cy63ei4u/include/gpu_wrapper.h:151:20: error: unknown type name 'hipEvent_t'
  151 | #define gpuEvent_t hipEvent_t
      |                    ^~~~~~~~~~
/scratch/balay/spack/opt/spack/linux-ubuntu22.04-zen4/gcc-11.4.0/superlu-dist-8.2.0-r3bbhy5cr5g4b2xqq3slhqf7cy63ei4u/include/util_dist.h:128:5: note: in expansion of macro 'gpuEvent_t'
  128 |     gpuEvent_t *GemmStart, *GemmEnd, *ScatterEnd;  /*GPU events to store gemm and scatter's begin and end*/
      |     ^~~~~~~~~~
/scratch/balay/spack/opt/spack/linux-ubuntu22.04-zen4/gcc-11.4.0/superlu-dist-8.2.0-r3bbhy5cr5g4b2xqq3slhqf7cy63ei4u/include/gpu_wrapper.h:151:20: error: unknown type name 'hipEvent_t'
  151 | #define gpuEvent_t hipEvent_t
      |                    ^~~~~~~~~~
/scratch/balay/spack/opt/spack/linux-ubuntu22.04-zen4/gcc-11.4.0/superlu-dist-8.2.0-r3bbhy5cr5g4b2xqq3slhqf7cy63ei4u/include/util_dist.h:129:5: note: in expansion of macro 'gpuEvent_t'
  129 |     gpuEvent_t *ePCIeH2D;
      |     ^~~~~~~~~~
/scratch/balay/spack/opt/spack/linux-ubuntu22.04-zen4/gcc-11.4.0/superlu-dist-8.2.0-r3bbhy5cr5g4b2xqq3slhqf7cy63ei4u/include/gpu_wrapper.h:151:20: error: unknown type name 'hipEvent_t'
  151 | #define gpuEvent_t hipEvent_t
      |                    ^~~~~~~~~~
/scratch/balay/spack/opt/spack/linux-ubuntu22.04-zen4/gcc-11.4.0/superlu-dist-8.2.0-r3bbhy5cr5g4b2xqq3slhqf7cy63ei4u/include/util_dist.h:130:5: note: in expansion of macro 'gpuEvent_t'
  130 |     gpuEvent_t *ePCIeD2H_Start;
      |     ^~~~~~~~~~
/scratch/balay/spack/opt/spack/linux-ubuntu22.04-zen4/gcc-11.4.0/superlu-dist-8.2.0-r3bbhy5cr5g4b2xqq3slhqf7cy63ei4u/include/gpu_wrapper.h:151:20: error: unknown type name 'hipEvent_t'
  151 | #define gpuEvent_t hipEvent_t
      |                    ^~~~~~~~~~
/scratch/balay/spack/opt/spack/linux-ubuntu22.04-zen4/gcc-11.4.0/superlu-dist-8.2.0-r3bbhy5cr5g4b2xqq3slhqf7cy63ei4u/include/util_dist.h:131:5: note: in expansion of macro 'gpuEvent_t'
  131 |     gpuEvent_t *ePCIeD2H_End;
      |     ^~~~~~~~~~
balay commented 11 months ago

Related: https://github.com/xsdk-project/xsdk-issues/issues/225

Ah, sorry for creating a duplicate issue. We can close this one [if needed]

I'm wondering why SuperLU_dist is defining GPU stuff while this build isn't supposed to use GPUs:

Yeah - Ideally we should have both superlu-dist+rocm and hypre+rocm

Current mode of superlu-dist+rocm hypre~rocm is a carry-over from prior xsdk release.

victorapm commented 11 months ago

Ideally we should have both superlu-dist+rocm and hypre+rocm

Ah ok, I see! We will have that with the hypre PR I mentioned

balay commented 11 months ago

Ok - adding 'hypre+rocm' to 'xsdk@1.0.0' now [so this is the mode that will get tested].

balay commented 11 months ago

We need to merge PR https://github.com/hypre-space/hypre/pull/869 into hypre's master to make hypre +rocm +superlu-dist work

@victorapm , I tried building with the above change [i.e use dsuperlu branch instead of master branch] - I still see build failures

spack-build-out.txt

In file included from dsuperlu.c:12:
In file included from ./dsuperlu.h:11:
In file included from /scratch/balay/spack/opt/spack/linux-ubuntu22.04-zen4/gcc-11.4.0/superlu-dist-8.2.0-r3bbhy5cr5g4b2xqq3slhqf7cy63ei4u/include/superlu_ddefs.h:37:
In file included from /scratch/balay/spack/opt/spack/linux-ubuntu22.04-zen4/gcc-11.4.0/superlu-dist-8.2.0-r3bbhy5cr5g4b2xqq3slhqf7cy63ei4u/include/superlu_defs.h:104:
In file included from /scratch/balay/spack/opt/spack/linux-ubuntu22.04-zen4/gcc-11.4.0/superlu-dist-8.2.0-r3bbhy5cr5g4b2xqq3slhqf7cy63ei4u/include/gpu_api_utils.h:26:
/scratch/balay/spack/opt/spack/linux-ubuntu22.04-zen4/gcc-11.4.0/superlu-dist-8.2.0-r3bbhy5cr5g4b2xqq3slhqf7cy63ei4u/include/gpu_wrapper.h:108:10: fatal error: 'hipblas.h' file not found
#include "hipblas.h"
         ^~~~~~~~~~~

Hm - maybe its an issue with hypre [or superlu-dist?] spec in spack wrt rocm dependencies..

cc: @xiaoyeli @liuyangzhuan

victorapm commented 11 months ago

Thanks for the feedback!

hypre links to rocblas with the rocm build. It seems superlu_dist needs hipblas? We could add this as an additional LDFLAGS maybe?

balay commented 11 months ago

This gets the hypre+rocm build going for me

diff --git a/var/spack/repos/builtin/packages/hypre/package.py b/var/spack/repos/builtin/packages/hypre/package.py
index ede99fafcc..5364a3bb73 100644
--- a/var/spack/repos/builtin/packages/hypre/package.py
+++ b/var/spack/repos/builtin/packages/hypre/package.py
@@ -24,7 +24,7 @@ class Hypre(AutotoolsPackage, CudaPackage, ROCmPackage):
     test_requires_compiler = True

     version("develop", branch="master")
-    version("2.30.0", branch="master")
+    version("2.30.0", branch="dsuperlu")
     version("2.29.0", sha256="98b72115407a0e24dbaac70eccae0da3465f8f999318b2c9241631133f42d511")
     version("2.28.0", sha256="2eea68740cdbc0b49a5e428f06ad7af861d1e169ce6a12d2cf0aa2fc28c4a2ae")
     version("2.27.0", sha256="507a3d036bb1ac21a55685ae417d769dd02009bde7e09785d0ae7446b4ae1f98")
@@ -108,6 +108,7 @@ def patch(self):  # fix sequential compilation in 'src/seq_mv'
     depends_on("rocthrust", when="+rocm")
     depends_on("rocrand", when="+rocm")
     depends_on("rocprim", when="+rocm")
+    depends_on("hipblas", when="+rocm")
     depends_on("umpire", when="+umpire")
     depends_on("caliper", when="+caliper")

@@ -258,7 +259,7 @@ def configure_args(self):
                 configure_args.append("--disable-cub")

         if "+rocm" in spec:
-            rocm_pkgs = ["rocsparse", "rocthrust", "rocprim", "rocrand"]
+            rocm_pkgs = ["rocsparse", "rocthrust", "rocprim", "rocrand", "hipblas"]
             rocm_inc = ""
             for pkg in rocm_pkgs:
                 if "^" + pkg in spec:
balay commented 11 months ago

Looks like this issue will be with all pkgs that use superlu-dist.

@xiaoyeli can this dependency [on hipblas.h] be avoided from public include files? [assuming its primarily required in superlu-dist sources]

victorapm commented 11 months ago

Maybe we need to incorporate this into hypre's configure/CMakeLists for folks not building it via spack. The spack fix wouldn't be necessary then (although much appreciated!)

victorapm commented 11 months ago

can this dependency [on hipblas.h] be avoided from public include files

This sounds great if possible :)

balay commented 11 months ago

@xiaoyeli I get the following with petsc [this warning breaks the build]

stderr:
In file included from /scratch/balay/spack/opt/spack/linux-ubuntu22.04-zen4/gcc-11.4.0/superlu-dist-8.2.0-r3bbhy5cr5g4b2xqq3slhqf7cy63ei4u/include/gpu_wrapper.h:108,
                 from /scratch/balay/spack/opt/spack/linux-ubuntu22.04-zen4/gcc-11.4.0/superlu-dist-8.2.0-r3bbhy5cr5g4b2xqq3slhqf7cy63ei4u/include/gpu_api_utils.h:26,
                 from /scratch/balay/spack/opt/spack/linux-ubuntu22.04-zen4/gcc-11.4.0/superlu-dist-8.2.0-r3bbhy5cr5g4b2xqq3slhqf7cy63ei4u/include/superlu_defs.h:104,
                 from /scratch/balay/spack/opt/spack/linux-ubuntu22.04-zen4/gcc-11.4.0/superlu-dist-8.2.0-r3bbhy5cr5g4b2xqq3slhqf7cy63ei4u/include/superlu_ddefs.h:37,
                 from /tmp/petsc-xuiyjmih/config.headers/conftest.cc:3:
/scratch/balay/spack/opt/spack/linux-ubuntu22.04-zen4/gcc-11.4.0/hipblas-5.5.1-inaizutkrwbopn5i5pvghmqlmuhj2ahw/include/hipblas.h:16:2: warning: #warning "This file is deprecated. Use the header file from /scratch/balay/spack/opt/spack/linux-ubuntu22.04-zen4/gcc-11.4.0/hipblas-5.5.1-inaizutkr\
wbopn5i5pvghmqlmuhj2ahw/include/hipblas/hipblas.h by using #include <hipblas/hipblas.h>" [-Wcpp]
   16 | #warning "This file is deprecated. Use the header file from /scratch/balay/spack/opt/spack/linux-ubuntu22.04-zen4/gcc-11.4.0/hipblas-5.5.1-inaizutkrwbopn5i5pvghmqlmuhj2ahw/include/hipblas/hipblas.h by using #include <hipblas/hipblas.h>"
      |  ^~~~~~~
Source:
#include "confdefs.h"
#include "conffix.h"
#include <superlu_ddefs.h>

I guess this should go into a "new" issue..

liuyangzhuan commented 11 months ago

@xiaoyeli I get the following with petsc [this warning breaks the build]

stderr:
In file included from /scratch/balay/spack/opt/spack/linux-ubuntu22.04-zen4/gcc-11.4.0/superlu-dist-8.2.0-r3bbhy5cr5g4b2xqq3slhqf7cy63ei4u/include/gpu_wrapper.h:108,
                 from /scratch/balay/spack/opt/spack/linux-ubuntu22.04-zen4/gcc-11.4.0/superlu-dist-8.2.0-r3bbhy5cr5g4b2xqq3slhqf7cy63ei4u/include/gpu_api_utils.h:26,
                 from /scratch/balay/spack/opt/spack/linux-ubuntu22.04-zen4/gcc-11.4.0/superlu-dist-8.2.0-r3bbhy5cr5g4b2xqq3slhqf7cy63ei4u/include/superlu_defs.h:104,
                 from /scratch/balay/spack/opt/spack/linux-ubuntu22.04-zen4/gcc-11.4.0/superlu-dist-8.2.0-r3bbhy5cr5g4b2xqq3slhqf7cy63ei4u/include/superlu_ddefs.h:37,
                 from /tmp/petsc-xuiyjmih/config.headers/conftest.cc:3:
/scratch/balay/spack/opt/spack/linux-ubuntu22.04-zen4/gcc-11.4.0/hipblas-5.5.1-inaizutkrwbopn5i5pvghmqlmuhj2ahw/include/hipblas.h:16:2: warning: #warning "This file is deprecated. Use the header file from /scratch/balay/spack/opt/spack/linux-ubuntu22.04-zen4/gcc-11.4.0/hipblas-5.5.1-inaizutkr\
wbopn5i5pvghmqlmuhj2ahw/include/hipblas/hipblas.h by using #include <hipblas/hipblas.h>" [-Wcpp]
   16 | #warning "This file is deprecated. Use the header file from /scratch/balay/spack/opt/spack/linux-ubuntu22.04-zen4/gcc-11.4.0/hipblas-5.5.1-inaizutkrwbopn5i5pvghmqlmuhj2ahw/include/hipblas/hipblas.h by using #include <hipblas/hipblas.h>"
      |  ^~~~~~~
Source:
#include "confdefs.h"
#include "conffix.h"
#include <superlu_ddefs.h>

I guess this should go into a "new" issue..

Fixed in https://github.com/xsdk-project/xsdk-issues/issues/236#event-10648868613

balay commented 11 months ago

This gets the hypre+rocm build going for me

The updated change(for hypre+rocm with superlu-dist+rocm) is now at https://github.com/spack/spack/pull/40980