xsdk-project / xsdk-issues

A repository under which GitHub issues not related to a specific xSDK repo can be filed.
7 stars 0 forks source link

trilinos failure with superlu-dist with xsdk+rocm #235

Open balay opened 9 months ago

balay commented 9 months ago

spack-build-out.txt


In file included from /scratch/balay/spack/opt/spack/linux-ubuntu22.04-zen4/gcc-11.4.0/superlu-dist-8.2.0-r3bbhy5cr5g4b2xqq3slhqf7cy63ei4u/include/gpu_wrapper.h:106,
                 from /scratch/balay/spack/opt/spack/linux-ubuntu22.04-zen4/gcc-11.4.0/superlu-dist-8.2.0-r3bbhy5cr5g4b2xqq3slhqf7cy63ei4u/include/gpu_api_utils.h:26,
                 from /scratch/balay/spack/opt/spack/linux-ubuntu22.04-zen4/gcc-11.4.0/superlu-dist-8.2.0-r3bbhy5cr5g4b2xqq3slhqf7cy63ei4u/include/superlu_defs.h:104,
                 from /scratch/balay/spack/opt/spack/linux-ubuntu22.04-zen4/gcc-11.4.0/superlu-dist-8.2.0-r3bbhy5cr5g4b2xqq3slhqf7cy63ei4u/include/superlu_ddefs.h:37,
                 from /scratch/balay/spack/spack-stage/spack-stage-trilinos-14.4.0-t5tcua5g5rbji3nfnwty3l3rdjltrtga/spack-src/packages/amesos/src/Amesos_Superludist.cpp:38:
/scratch/balay/spack/opt/spack/linux-ubuntu22.04-zen4/gcc-11.4.0/hip-5.5.1-36grozs3lkqmnph77fzw7tfbykoccwci/include/hip/hip_runtime_api.h:7337:2: error: #error ("Must define exactly one of __HIP_PLATFORM_AMD__ or __HIP_PLATFORM_NVIDIA__");
 7337 | #error("Must define exactly one of __HIP_PLATFORM_AMD__ or __HIP_PLATFORM_NVIDIA__");
      |  ^~~~~

cc: @cgcgcg @lucbv

balay commented 9 months ago

Here its building trilinos~rocm

balay@petsc-gpu-02:/scratch/balay/spack$ ./bin/spack spec xsdk+rocm amdgpu_target=gfx90a |grep trilinos@
 -       ^trilinos@14.4.0%gcc@11.4.0~adelus~adios2+amesos+amesos2+anasazi+aztec~basker+belos+boost~chaco~complex~cuda~cuda_rdc~debug~dtk+epetra+epetraext~epetraextbtf~epetraextexperimental~epetraextgraphreorderings~exodus+explicit_template_instantiation~float+fortran~gtest+hdf5+hypre+ifpack+ifpack2~intrepid+intrepid2~ipo~isorropia+kokkos~mesquite~minitensor+ml+mpi+muelu~mumps+nox~openmp~panzer~phalanx~piro~python~rocm~rocm_rdc~rol~rythmos+sacado~scorec+shards+shared~shylu~stk~stokhos+stratimikos~strumpack~suite-sparse~superlu+superlu-dist~teko~tempus~test+thyra+tpetra~trilinoscouplings~wrapper~x11+zoltan+zoltan2 build_system=cmake build_type=Release cxxstd=17 generator=make gotype=int arch=linux-ubuntu22.04-zen4
cgcgcg commented 9 months ago

While this probably should be fixed, is there a reason for building both old (host-only) and new solver stacks for a HIP platform?

liuyangzhuan commented 8 months ago

@balay Can you try the latest commit https://github.com/xiaoyeli/superlu_dist/commit/0c9ea165c25da7cc432623b254eddfcfea7179e7 to see if this is fixed?

xiaoyeli commented 8 months ago

@balay Is this good now? If so, we can close it.

balay commented 8 months ago

I don't see the above error anymore - but trilinos build continues to fail.

 In file included from /scratch/balay/spack/opt/spack/linux-ubuntu22.04-zen4/gcc-11.4.0/hip-5.6.1-olxkmdjitey5gszct57gyagmg4kg33xh/include/hip/amd_detail/amd_channel_descriptor.h:28,
                 from /scratch/balay/spack/opt/spack/linux-ubuntu22.04-zen4/gcc-11.4.0/hip-5.6.1-olxkmdjitey5gszct57gyagmg4kg33xh/include/hip/channel_descriptor.h:32,
                 from /scratch/balay/spack/opt/spack/linux-ubuntu22.04-zen4/gcc-11.4.0/hip-5.6.1-olxkmdjitey5gszct57gyagmg4kg33xh/include/hip/texture_types.h:38,
                 from /scratch/balay/spack/opt/spack/linux-ubuntu22.04-zen4/gcc-11.4.0/hip-5.6.1-olxkmdjitey5gszct57gyagmg4kg33xh/include/hip/hip_runtime_api.h:489,
                 from /scratch/balay/spack/opt/spack/linux-ubuntu22.04-zen4/gcc-11.4.0/superlu-dist-8.2.0-gzzebo7ca4h6q7pjqnzv2elmjkfy66i6/include/gpu_wrapper.h:110,
                 from /scratch/balay/spack/opt/spack/linux-ubuntu22.04-zen4/gcc-11.4.0/superlu-dist-8.2.0-gzzebo7ca4h6q7pjqnzv2elmjkfy66i6/include/gpu_api_utils.h:26,
                 from /scratch/balay/spack/opt/spack/linux-ubuntu22.04-zen4/gcc-11.4.0/superlu-dist-8.2.0-gzzebo7ca4h6q7pjqnzv2elmjkfy66i6/include/superlu_defs.h:104,
                 from /scratch/balay/spack/spack-stage/spack-stage-trilinos-14.4.0-mt4rxovcrszkydmjlmmmzrahyjf6j42s/spack-src/packages/amesos2/src/Amesos2_Superludist_TypeMap.hpp:88,
                 from /scratch/balay/spack/spack-stage/spack-stage-trilinos-14.4.0-mt4rxovcrszkydmjlmmmzrahyjf6j42s/spack-src/packages/amesos2/src/Amesos2_Superludist_FunctionMap.hpp:63,
                 from /scratch/balay/spack/spack-stage/spack-stage-trilinos-14.4.0-mt4rxovcrszkydmjlmmmzrahyjf6j42s/spack-src/packages/amesos2/src/Amesos2_Superludist_decl.hpp:58,
                 from /scratch/balay/spack/spack-stage/spack-stage-trilinos-14.4.0-mt4rxovcrszkydmjlmmmzrahyjf6j42s/spack-src/packages/amesos2/src/Amesos2_Superludist.hpp:47,
                 from /scratch/balay/spack/spack-stage/spack-stage-trilinos-14.4.0-mt4rxovcrszkydmjlmmmzrahyjf6j42s/spack-src/packages/amesos2/src/Amesos2_Factory.hpp:108,
                 from /scratch/balay/spack/spack-stage/spack-stage-trilinos-14.4.0-mt4rxovcrszkydmjlmmmzrahyjf6j42s/spack-src/packages/amesos2/src/Amesos2_Factory.cpp:44:
/scratch/balay/spack/opt/spack/linux-ubuntu22.04-zen4/gcc-11.4.0/hip-5.6.1-olxkmdjitey5gszct57gyagmg4kg33xh/include/hip/amd_detail/amd_hip_vector_types.h:144:5: error: template with C linkage
  144 |     template<typename T, unsigned int n> struct HIP_vector_base;
      |     ^~~~~~~~
In file included from /scratch/balay/spack/spack-stage/spack-stage-trilinos-14.4.0-mt4rxovcrszkydmjlmmmzrahyjf6j42s/spack-src/packages/amesos2/src/Amesos2_Superludist_FunctionMap.hpp:63,
                 from /scratch/balay/spack/spack-stage/spack-stage-trilinos-14.4.0-mt4rxovcrszkydmjlmmmzrahyjf6j42s/spack-src/packages/amesos2/src/Amesos2_Superludist_decl.hpp:58,
                 from /scratch/balay/spack/spack-stage/spack-stage-trilinos-14.4.0-mt4rxovcrszkydmjlmmmzrahyjf6j42s/spack-src/packages/amesos2/src/Amesos2_Superludist.hpp:47,
                 from /scratch/balay/spack/spack-stage/spack-stage-trilinos-14.4.0-mt4rxovcrszkydmjlmmmzrahyjf6j42s/spack-src/packages/amesos2/src/Amesos2_Factory.hpp:108,
                 from /scratch/balay/spack/spack-stage/spack-stage-trilinos-14.4.0-mt4rxovcrszkydmjlmmmzrahyjf6j42s/spack-src/packages/amesos2/src/Amesos2_Factory.cpp:44:
/scratch/balay/spack/spack-stage/spack-stage-trilinos-14.4.0-mt4rxovcrszkydmjlmmmzrahyjf6j42s/spack-src/packages/amesos2/src/Amesos2_Superludist_TypeMap.hpp:75:1: note: 'extern "C"' linkage started here
   75 | extern "C" {
      | ^~~~~~~~~~

spack-build-out.txt

xiaoyeli commented 8 months ago

@cgcgcg Can you or someone from Trilinos team take a look at this place? Why does it complain about "extern ..."?

https://github.com/trilinos/Trilinos/blob/c4f035ce9aab54e50654c9a400f2b4c041331670/packages/amesos2/src/Amesos2_Superludist_TypeMap.hpp#L75

cgcgcg commented 8 months ago

@srajama1 @ndellingwood Can you have a look at this Amesos2 issue?

balay commented 8 months ago

@xiaoyeli for this xsdk release - I'm continuing with disabling cuda and rocm from superlu-dist and trilinos (same as last xsdk release) - to avoid these issues.

And the builds are currently working in this mode.

https://gitlab.com/xsdk-project/spack-xsdk/-/pipelines/1064707391

xiaoyeli commented 8 months ago

@balay Two questions: 1) Are you saying that this is a long-standing problem, and also appeared during the testing from last release a year ago? 2) Can you enable cuda and rocm for the other packages that call superlu?

cc: @srajama1 @ndellingwood

balay commented 8 months ago

Are you saying that this is a long-standing problem, and also appeared during the testing from last release a year ago?

@xiaoyeli I'm not sure if the issues were the same - but we could not enable superlu-dist+cuda [and trilinos+cuda] in the past release cycles aswell.

Here is one prior issue that was filed

https://github.com/xsdk-project/xsdk-issues/issues/162

Can you enable cuda and rocm for the other packages that call superlu?

You mean with superlu+cuda, petsc+cuda - without trilinos? Will try. Right now the build uses: petsc+superlu-dist+cuda, superlu-dist~cuda trilinos~cuda

balay commented 8 months ago

ref: ./bin/spack install -j64 xsdk@1.0.0+rocm~trilinos amdgpu_target=gfx90a

I'm seeing failures with:

cc: @jthies @v-dobrev

balay commented 8 months ago

ref: ./bin/spack install -j24 xsdk@1.0.0%gcc@11.3.1~trilinos +cuda cuda_arch=70 ^cuda@11.7.1 ^openmpi

xsdk+cuda build with superlu-dist+cuda is successful when trilinos is disabled.

balay@xsdk:/data/balay/spack>./bin/spack find -v | grep superlu-dist
hypre@2.30.0~caliper~complex+cuda~debug+fortran~gptune~int64~internal-superlu~magma~mixedint+mpi~openmp~rocm+shared+superlu-dist~sycl~umpire~unified-memory build_system=autotools cuda_arch=70
mfem@4.6.0~amgx~conduit+cuda~debug+examples~exceptions~fms~ginkgo~gnutls~gslib~hiop~lapack~libceed~libunwind+metis+miniapps~mpfr+mpi~netcdf~occa~openmp+petsc~pumi~raja~rocm+shared~slepc+static~strumpack~suite-sparse+sundials+superlu-dist~threadsafe~umpire+zlib build_system=generic cuda_arch=70 patches=718f073 timer=auto
petsc@3.20.1~X~batch~cgns~complex+cuda~debug+double~exodusii~fftw+fortran~giflib+hdf5~hpddm~hwloc+hypre~int64~jpeg~knl~kokkos~libpng~libyaml~memkind+metis~mkl-pardiso~mmg~moab~mpfr+mpi~mumps~openmp~p4est~parmmg~ptscotch~random123~rocm~saws~scalapack+shared~strumpack~suite-sparse+superlu-dist~sycl~tetgen~trilinos~valgrind build_system=generic clanguage=C cuda_arch=70 memalign=none
sundials@6.6.2+ARKODE+CVODE+CVODES+IDA+IDAS+KINSOL+cuda+examples+examples-install~f2003~fcmix+generic-math+ginkgo+hypre~int64~ipo~klu~kokkos~kokkos-kernels~lapack+magma~monitoring+mpi~openmp+petsc~profiling~pthread~raja~rocm+shared+static+superlu-dist~superlu-mt~sycl~trilinos build_system=cmake build_type=Release cstd=99 cuda_arch=70 cxxstd=14 generator=make logging-level=0 logging-mpi=OFF precision=double
superlu-dist@8.2.0+cuda~int64~ipo~openmp+parmetis~rocm+shared build_system=cmake build_type=Release cuda_arch=70 generator=make
balay commented 8 months ago

Perhaps we should set trilinos~superlu-dist for the GPU builds... will check..

balay commented 8 months ago

wrt mfem and superlu-dist+rocm - I'm testing with:

diff --git a/var/spack/repos/builtin/packages/mfem/package.py b/var/spack/repos/builtin/packages/mfem/package.py
index f4821e63c2..75eeda7b1f 100644
--- a/var/spack/repos/builtin/packages/mfem/package.py
+++ b/var/spack/repos/builtin/packages/mfem/package.py
@@ -967,6 +967,9 @@ def find_optional_library(name, prefix):
             if "^rocthrust" in spec and not spec["hip"].external:
                 # petsc+rocm needs the rocthrust header path
                 hip_headers += spec["rocthrust"].headers
+            if "^hipblas" in spec and not spec["hip"].external:
+                # superlu-dist+rocm needs the hipblas header path
+                hip_headers += spec["hipblas"].headers
             if "%cce" in spec:
                 # We assume the proper Cray CCE module (cce) is loaded:
                 craylibs_path = env["CRAYLIBS_" + machine().upper()]
balay commented 8 months ago

Perhaps we should set trilinos~superlu-dist for the GPU builds... will check..

this build is working. So will update xsdk-1.0.0 with these changes.

https://gitlab.com/xsdk-project/spack-xsdk/-/pipelines/1066199514

The mfem change for supleru-dist+rocm is at https://github.com/spack/spack/pull/40981

cgcgcg commented 8 months ago

@balay Just to confirm: The issue is building SuperLU_dist and Trilinos with its SuperLU_dist interface enabled on Cuda&HIP? But when you disable the interface in Trilinos then everything works? So presumably something is broken in our interface?

balay commented 8 months ago

@balay Just to confirm: The issue is building SuperLU_dist and Trilinos with its SuperLU_dist interface enabled on Cuda&HIP? But when you disable the interface in Trilinos then everything works? So presumably something is broken in our interface?

yes - likely spack command to reproduce (I'm checking via xsdk - which enabled many of the trilinos variants)

spack install trilinos+superlu-dist ^superlu-dist+cuda

This is irrespective of trilinos+cuda or trilinos~cuda

So right now I'm using [with xsdk]

spack install xsdk+cuda ^trilinos~superlu-dist~cuda ^superlu-dist+cuda

[and similar for rocm]

xiaoyeli commented 8 months ago

@cgcgcg If you go back to this thread, the complain was at this line of the interface code:

https://github.com/trilinos/Trilinos/blob/c4f035ce9aab54e50654c9a400f2b4c041331670/packages/amesos2/src/Amesos2_Superludist_TypeMap.hpp#L75

It seems it is a long-standing problem, and is easy to fix. Can you ask someone on the Trilinos team to take a look and fix it?

srajama1 commented 8 months ago

@xiaoyeli We are looking at this.

Looks like SuperLU-Dist started using extern C within your headers so we don't have to do it. Can you tell us which version did this happen? We might have to support older versions on some systems, so we can check version numbers to decide whether to include the "extern C" or not.

srajama1 commented 8 months ago

Git blame says "extern C" came to SuperLU-Dist 6 years ago

https://github.com/xiaoyeli/superlu_dist/commit/949ea759034b3d45d02b5fe0e64b9bbbfefb8cb0

I hope we don't need to support versions older than that :)

xiaoyeli commented 8 months ago

@srajama1 You can remove the 'extern "C"' from your side. I am surprised that we didn't have it in the earlier versions before 6 years ago. Because without this protection, the C++ compiler will not produce proper names, and our code cannot be used by a C++ program.

srajama1 commented 8 months ago

We will remove it. Thanks @xiaoyeli !

cgcgcg commented 7 months ago

@balay A fix for this is now on Trilinos develop. Is this covered by a build that we can check? Or do we need to wait for the next Trilinos release?

balay commented 7 months ago

I'm still seeing failures with trilinos-develp (listed in my build as 14.4.1) - with enabling rocm or cuda.

Attaching logs.

spack-build-out.rocm.txt spack-build-out.cuda.txt

cgcgcg commented 7 months ago

Hm, the cuda build says

Explicitly disabled external packages/TPLs on input (by user or by default):  CUDA [...]

What options are set for the spack build?

balay commented 7 months ago

I reran the build - making sure trilinos+cuda is enabled.

./bin/spack spec xsdk@1.0.0%gcc@11.4.0+cuda cuda_arch=80 ^cuda@11.7.0
...
[+]      ^superlu-dist@8.2.1%gcc@11.4.0+cuda~int64~ipo~openmp+parmetis~rocm+shared build_system=cmake build_type=Release cuda_arch=80 generator=make arch=linux-ubuntu22.04-zen3
 -       ^trilinos@15.0.1%gcc@11.4.0~adelus~adios2+amesos+amesos2+anasazi+aztec~basker+belos+boost~chaco~complex+cuda~cuda_rdc~debug~dtk+epetra+epetraext~epetraextbtf~epetraextexperimental~epetraextgraphreorderings~exodus+explicit_template_instantiation~float+fortran~gtest+hdf5+hypre+ifpack+ifpack2~intrepid+intrepid2~ipo~isorropia+kokkos~mesquite~minitensor+ml+mpi+muelu~mumps+nox~openmp~panzer~phalanx~piro~python~rocm~rocm_rdc~rol~rythmos+sacado~scorec+shards+shared~shylu~stk~stokhos+stratimikos~strumpack~suite-sparse~superlu+superlu-dist~teko~tempus~test+thyra+tpetra~trilinoscouplings~uvm+wrapper~x11+zoltan+zoltan2 build_system=cmake build_type=Release cuda_arch=80 cxxstd=17 generator=make gotype=int arc
h=linux-ubuntu22.04-zen3
...

This is using latest develop with the following change [to use trilinos develop branch]

diff --git a/var/spack/repos/builtin/packages/trilinos/package.py b/var/spack/repos/builtin/packages/trilinos/package.py
index ef335a2728..4d33265a03 100644
--- a/var/spack/repos/builtin/packages/trilinos/package.py
+++ b/var/spack/repos/builtin/packages/trilinos/package.py
@@ -42,6 +42,7 @@ class Trilinos(CMakePackage, CudaPackage, ROCmPackage):

     version("master", branch="master")
     version("develop", branch="develop")
+    version("15.0.1", branch="develop")
     version("15.0.0", sha256="5651f1f967217a807f2c418a73b7e649532824dbf2742fa517951d6cc11518fb")
     version("14.4.0", sha256="8e7d881cf6677aa062f7bfea8baa1e52e8956aa575d6a4f90f2b6f032632d4c6")
     version("14.2.0", sha256="c96606e5cd7fc9d25b9dc20719cd388658520d7cbbd2b4de77a118440d1e0ccb")
diff --git a/var/spack/repos/builtin/packages/xsdk/package.py b/var/spack/repos/builtin/packages/xsdk/package.py
index 6b3ec2c126..2697fad1d1 100644
--- a/var/spack/repos/builtin/packages/xsdk/package.py
+++ b/var/spack/repos/builtin/packages/xsdk/package.py
@@ -150,7 +150,6 @@ class Xsdk(BundlePackage, CudaPackage, ROCmPackage):
     xsdk_depends_on("superlu-dist@8.1.2", when="@0.8.0")
     xsdk_depends_on("superlu-dist@7.1.1", when="@0.7.0")

-    xsdk_depends_on("trilinos +superlu-dist", when="@1.0.0: +trilinos ~cuda ~rocm")
     xsdk_depends_on(
         "trilinos@develop+hypre+hdf5~mumps+boost"
         + "~suite-sparse+tpetra+nox+ifpack2+zoltan+zoltan2+amesos2"
@@ -159,11 +158,12 @@ class Xsdk(BundlePackage, CudaPackage, ROCmPackage):
         when="@develop +trilinos",
     )
     xsdk_depends_on(
-        "trilinos@14.4.0+hypre+hdf5~mumps+boost"
+        "trilinos@15.0.1+hypre+hdf5~mumps+boost"
         + "~suite-sparse+tpetra+nox+ifpack2+zoltan+zoltan2+amesos2"
-        + "~exodus~dtk+intrepid2+shards+stratimikos gotype=int"
-        + " cxxstd=17",
+        + "~exodus~dtk+intrepid2+shards+stratimikos+superlu-dist gotype=int"
+        + " cxxstd=17 ",
         when="@1.0.0 +trilinos",
+        cuda_var="cuda", rocm_var="rocm",
     )
     xsdk_depends_on(
         "trilinos@13.4.1+hypre+superlu-dist+hdf5~mumps+boost"

using:

./bin/spack install -j32  xsdk@1.0.0%gcc@11.4.0+cuda cuda_arch=80 ^cuda@11.7.0

I get

-- Check for working CXX compiler: /scratch/balay/spack/opt/spack/linux-ubuntu22.04-zen3/gcc-11.4.0/mpich-4.1.2-si2g2ajcuunt2u6oqangmpn7rq4rbqa5/bin/mpic++ - broken
CMake Error at /scratch/balay/spack/opt/spack/linux-ubuntu22.04-zen3/gcc-11.4.0/cmake-3.27.9-quq3pv7mwictuuvg7m3dm2tdet3kkjor/share/cmake-3.27/Modules/CMakeTestCXXCompiler.cmake:60 (message):
  The C++ compiler

    "/scratch/balay/spack/opt/spack/linux-ubuntu22.04-zen3/gcc-11.4.0/mpich-4.1.2-si2g2ajcuunt2u6oqangmpn7rq4rbqa5/bin/mpic++"

  is not able to compile a simple test program.

  It fails with the following output:

    Change Dir: '/scratch/balay/spack/spack-stage/spack-stage-trilinos-15.0.1-ejmgblrzy72khdxawps72brvshfr4g4b/spack-build-ejmgblr/CMakeFiles/CMakeScratch/TryCompile-kC1Bjz'

    Run Build Command(s): /scratch/balay/spack/opt/spack/linux-ubuntu22.04-zen3/gcc-11.4.0/cmake-3.27.9-quq3pv7mwictuuvg7m3dm2tdet3kkjor/bin/cmake -E env VERBOSE=1 /scratch/balay/spack/opt/spack/linux-ubuntu22.04-zen3/gcc-11.4.0/gmake-4.4.1-5quzins5c2jqhwgkxwpndhyivnrfgxm2/bin/gmake -f Makefile cmTC_a8f67/fast
    /scratch/balay/spack/opt/spack/linux-ubuntu22.04-zen3/gcc-11.4.0/gmake-4.4.1-5quzins5c2jqhwgkxwpndhyivnrfgxm2/bin/gmake  -f CMakeFiles/cmTC_a8f67.dir/build.make CMakeFiles/cmTC_a8f67.dir/build
    gmake[1]: Entering directory '/scratch/balay/spack/spack-stage/spack-stage-trilinos-15.0.1-ejmgblrzy72khdxawps72brvshfr4g4b/spack-build-ejmgblr/CMakeFiles/CMakeScratch/TryCompile-kC1Bjz'
    Building CXX object CMakeFiles/cmTC_a8f67.dir/testCXXCompiler.cxx.o
    /scratch/balay/spack/opt/spack/linux-ubuntu22.04-zen3/gcc-11.4.0/mpich-4.1.2-si2g2ajcuunt2u6oqangmpn7rq4rbqa5/bin/mpic++    -o CMakeFiles/cmTC_a8f67.dir/testCXXCompiler.cxx.o -c /scratch/balay/spack/spack-stage/spack-stage-trilinos-1\
5.0.1-ejmgblrzy72khdxawps72brvshfr4g4b/spack-build-ejmgblr/CMakeFiles/CMakeScratch/TryCompile-kC1Bjz/testCXXCompiler.cxx
    g++: error: unrecognized command-line option '--expt-extended-lambda'

spack-build-out.txt

cgcgcg commented 7 months ago

@balay Can you try with +wrapper as well? That should enable the nvcc_wrapper.

balay commented 7 months ago

@cgcgcg the above build has +wrapper [as listed in the spack spec output in the previous message]

And I see:

            if "+wrapper" in spec:
                flags.append("--expt-extended-lambda")

i.e --expt-extended-lambda is added only for the +wrapper build

balay commented 7 months ago

With this additional change:

diff --git a/var/spack/repos/builtin/packages/trilinos/package.py b/var/spack/repos/builtin/packages/trilinos/package.py
index 4d33265a03..e2dd097d09 100644
--- a/var/spack/repos/builtin/packages/trilinos/package.py
+++ b/var/spack/repos/builtin/packages/trilinos/package.py
@@ -510,8 +510,6 @@ def flag_handler(self, name, flags):
             if "+stk%intel" in spec:
                 # Workaround for Intel compiler segfaults with STK and IPO
                 flags.append("-no-ipo")
-            if "+wrapper" in spec:
-                flags.append("--expt-extended-lambda")
         elif name == "ldflags":
             if spec.satisfies("%cce@:14"):
                 flags.append("-fuse-ld=gold")

trilinos build is is successful (with cuda, superlu-dist)

==> Installing trilinos-15.0.1-aouk7pqbmlfktkzd4ffxs3iimuajdyug [84/100]
==> No binary for trilinos-15.0.1-aouk7pqbmlfktkzd4ffxs3iimuajdyug found: installing from source
==> No patches needed for trilinos
==> trilinos: Executing phase: 'cmake'
==> trilinos: Executing phase: 'build'
==> trilinos: Executing phase: 'install'
==> trilinos: Successfully installed trilinos-15.0.1-aouk7pqbmlfktkzd4ffxs3iimuajdyug
  Stage: 59.09s.  Cmake: 46.24s.  Build: 21m 57.47s.  Install: 9.52s.  Post-install: 2.28s.  Total: 23m 55.63s
[+] /scratch/balay/spack/opt/spack/linux-ubuntu22.04-zen3/gcc-11.4.0/trilinos-15.0.1-aouk7pqbmlfktkzd4ffxs3iimuajdyug

Now there are failures in dtk, phist, sundials

build.log.txt

balay commented 7 months ago

trilinos+rocm [with superlu_dist] build also goes through with the attached changes.

./bin/spack spec xsdk@1.0.0+rocm amdgpu_target=gfx90a

[+]      ^superlu-dist@8.2.1%gcc@11.4.0~cuda~int64~ipo~openmp+parmetis+rocm+shared amdgpu_target=gfx90a build_system=cmake build_type=Release generator=make arch=linux-ubuntu22.04-zen4
[+]      ^trilinos@15.0.1%gcc@11.4.0~adelus~adios2+amesos+amesos2+anasazi+aztec~basker+belos+boost~chaco~complex~cuda~cuda_rdc~debug~dtk+epetra+epetraext~epetraextbtf~epetraextexperimental~epetraextgraphreorderings~exodus+explicit_template_instantiation~float+fortran~gtest+hdf5+hypre+ifpack+ifpack2~intrepid+intrepid2~ipo~isorropia+kokkos~mesquite~minitensor+ml+mpi+muelu~mumps+nox~openmp~panzer~phalanx~piro~python+rocm~rocm_rdc~rol~rythmos+sacado~scorec+shards+shared~shylu~stk~stokhos+stratimikos~strumpack~suite-sparse~superlu+superlu-dist~teko~tempus~test+thyra+tpetra~trilinoscouplings~wrapper~x11+zoltan+zoltan2 amdgpu_target=gfx90a build_system=cmake build_type=Release cxxstd=17 generator=make gotype=int  arch=linux-ubuntu22.04-zen4
./bin/spack install -j64 xsdk@1.0.0+rocm amdgpu_target=gfx90a

trilinos-cuda-rocm.patch.txt

And subsequent dtk, phist, sundials failures

build-rocm.log.txt

iyamazaki commented 3 months ago

@balay. We were wondering if the Trilinos PR 12524 has resolved this issue with Amesos2. Please let us know if we could help!