xsdk-project / xsdk-issues

A repository under which GitHub issues not related to a specific xSDK repo can be filed.
7 stars 0 forks source link

kokkos+wrapper (a dependency of "petsc+kokkos") breaks many packages [petsc, slepc, sundials, and perhaps others] #192

Closed balay closed 1 year ago

balay commented 2 years ago
    kokkos: remove +wrapper dependency - this likely breaks many pckakges

    [petsc, slepc have fixes - but now its sundials - and perhaps other dependent pkgs]

    See build log for details:
      /data/balay/spack/spack-stage/spack-stage-sundials-6.4.0-p3qquftw3tfbx2gjie6axd2vpgnilgfl/spack-build-out.txt

    ==> Warning: Skipping build of xsdk-0.8.0-ncr5kdsopw35sdvhgjgxfy3ouhhx7yws since sundials-6.4.0-p3qquftw3tfbx2gjie6axd2vpgnilgfl failed
    ==> Warning: Skipping build of dealii-9.4.0-u436c3jft2frh2mdrtkveaxz3djz36uj since sundials-6.4.0-p3qquftw3tfbx2gjie6axd2vpgnilgfl failed
    ==> Warning: Skipping build of mfem-4.5.0-2dxq65vwwk3nxwpyqgesbcs3adzkxqb5 since sundials-6.4.0-p3qquftw3tfbx2gjie6axd2vpgnilgfl failed
    ==> Warning: Skipping build of amrex-22.09-qaa4ktveapsroaaggrjhw27ilwitourc since sundials-6.4.0-p3qquftw3tfbx2gjie6axd2vpgnilgfl failed

For ex: - check https://gitlab.com/xsdk-project/spack-xsdk/-/jobs/3193169610

For now disabling this requirement (of kokkos+wrapper for gcc) in kokkos/pacakge.py - this build goes through fine for me with gcc - so I don't know why this code exists in spack..

And I'm not sure if we can upstream this change

https://gitlab.com/xsdk-project/spack-xsdk/-/commit/b19632e72db0c33bcb83d95323f500a94b428e49

Note: PETSc already has a workaround for kokkos+wrapper, I've added this fix for slepc, likely similar fixes might be needed for other pkgs.. (for kokkos+wrapper to work)

https://gitlab.com/xsdk-project/spack-xsdk/-/commit/9afe4953d25ceda7ef8a16e7198b820639bc246b

bangerth commented 2 years ago

@tamiko FYI

masterleinad commented 2 years ago

For now disabling this requirement (of kokkos+wrapper for gcc) in kokkos/pacakge.py - this build goes through fine for me with gcc - so I don't know why this code exists in spack..

For Kokkos to use Cuda all of it (and all downstream compilation units that use Kokkos headers) need to be compiled with a CUDA-able compiler, not just files with specific extensions or so. Thus, for using nvcc as CUDA compiler, Kokkos' nvcc_wrapper script must be used as host compiler. Clang can compile Cuda code natively so the wrapper is not necessary.

masterleinad commented 2 years ago

On the other hand, Kokkos has a mechanism in place to direct all compilation to nvcc_wrapper internally so that the host compiler specified in CMake can be arbitrary. This also holds for all downstream code that uses Kokkos via CMake and target_link_libraries.

balay commented 2 years ago

For Kokkos to use Cuda all of it (and all downstream compilation units that use Kokkos headers) need to be compiled with a CUDA-able compiler, not just files with specific extensions or so.

Right now this is basically is compiling all .c sources with nvcc (via mpicc) - breaking builds. Check slepc errors at https://gitlab.com/xsdk-project/spack-xsdk/-/jobs/3193169610

>> 605    /nfs/apps/spacks/2022-02-10/opt/spack/linux-centos7-x86_64/gcc-7.3.
            0/cuda-11.6.0-tf6htqx3zi5j32km2bq6jdi44tzedbbb/include/cuda_bf16.hp
            p(373): error: calling a __device__ function("__float_as_uint") fro
            m a __host__ __device__ function("__internal_float2bfloat16") is no
            t allowed

etc..

    636    nvcc warning : The 'compute_35', 'compute_37', 'sm_35', and 'sm_37'
             architectures are deprecated, and may be removed in a future relea
            se (Use -Wno-deprecated-gpu-targets to suppress warning).
     637    <command-line>: warning: "__CUDA_ARCH_LIST__" redefined
     638    <command-line>: note: this is the location of the previous definiti
            on
  >> 639    /nfs/apps/spacks/2022-02-10/opt/spack/linux-centos7-x86_64/gcc-7.3.
            0/cuda-11.6.0-tf6htqx3zi5j32km2bq6jdi44tzedbbb/include/cuda_bf16.hp
            p(373): error: calling a __device__ function("__float_as_uint") fro
            m a __host__ __device__ function("__internal_float2bfloat16") is no
            t allowed

And PETSc/kokkos code doesn't need these wrappers [they break PETSc build similarly]

balay commented 2 years ago

For now disabling this requirement (of kokkos+wrapper for gcc) in kokkos/pacakge.py - this build goes through fine for me with gcc

Ah - I thought this worked for me [but I guess I must have some bug in my testing process]

https://gitlab.com/xsdk-project/spack-xsdk/-/jobs/3193963276

kokkos@3.6.00~aggressive_vectorization~compiler_warnings+cuda~cuda_constexpr+cuda_lambda~cuda_ldg_intrinsic~cuda_relocatable_device_code~cuda_uvm~debug~debug_bounds_check~debug_dualview_modify_check~deprecated_code~examples~explicit_instantiation~hpx~hpx_async_dispatch~hwloc~ipo~memkind~numactl~openmp~openmptarget~pic+profiling~profiling_load_print~pthread~qthread~rocm+serial+shared~sycl~tests~tuning~wrapper build_type=RelWithDebInfo cuda_arch=70 intel_gpu_arch=none std=14

i.e kokkos~wrapper is used in this build. petsc/slepc build fine here. But sundials is failing.

    265    -- Finding PETSC using PETSC_DIR
     266    -- Recognized PETSC install with single library for all packages
     267    -- PETSC could not be used, maybe the install is broken.
  >> 268    CMake Error at /nfs/apps/spacks/2022-02-10/opt/spack/linux-centos7-
            x86_64/gcc-7.3.0/cmake-3.22.2-rdvpr5odvqzoanneyoq5u4qqufsnfof4/shar
            e/cmake-3.22/Modules/FindPackageHandleStandardArgs.cmake:230 (messa
            ge):
     269      PETSC could not be found.  (missing: PETSC_EXECUTABLE_RUNS) (foun
            d version
     270      "3.18.0")
     271    Call Stack (most recent call first):
     272      /nfs/apps/spacks/2022-02-10/opt/spack/linux-centos7-x86_64/gcc-7.
            3.0/cmake-3.22.2-rdvpr5odvqzoanneyoq5u4qqufsnfof4/share/cmake-3.22/
            Modules/FindPackageHandleStandardArgs.cmake:594 (_FPHSA_FAILURE_MES
            SAGE)
     273      cmake/tpl/FindPETSC.cmake:738 (find_package_handle_standard_args)
     274      cmake/tpl/SundialsPETSC.cmake:52 (find_package)

spack-build-out(10).txt

perhaps @balos1 can take a look

masterleinad commented 2 years ago

Right now this is basically is compiling all .c sources with nvcc (via mpicc) - breaking builds. Check slepc errors at gitlab.com/xsdk-project/spack-xsdk/-/jobs/3193169610

Would you have the compile line causing that error?

balay commented 2 years ago

Would you have the compile line causing that error?

Attaching logs with spack install -j1. I has previously noticed warnings after mpicc. Now they are after mpif90. [but the error is with .cu sources]. I'm totally confused now.

spack-build-env.txt spack-build-out.txt

using kokkos~wrapper - or resetting MPICH_CXX back to native compiler does get the build working though...

masterleinad commented 2 years ago

Hmm... it seems that Kokkos flags are not properly propagated. In particular, I don't see a flag for the GPU architecture for the failing compilation units.

balos1 commented 2 years ago

I have a fix for sundials that gets past the error above reported by @balay. However, I now get a different error if I do not disable the trilinos variant of sundials when petsc+kokkos. There seems to be a clash between the internal trilinos kokkos and the standalone.

balay commented 2 years ago

https://gitlab.com/xsdk-project/spack-xsdk/-/jobs/3198712537

xsdk~cuda builds fine. This includes petsc+kokkos and trilinos+kokkos,

So the failure come up when +cuda is used.

BTW: Does xsdk+cuda~trilinos build go through? [good to know - but won't help with the primary issue].

In the last release cycle we tried to add in superlu_dist+cuda - and that triggered issues with many packages. That issue is likely still pending...

balay commented 2 years ago

BTW: Does xsdk+cuda~trilinos build go through? [good to know - but won't help with the primary issue].

Well the sundials build does go through without requiring additional fixes - but the dealii build fails

spack-build-out.txt

Ref: balay@xsdk:/data/balay/spack.x>nice ./bin/spack install -j24 xsdk@0.8.0+cuda~trilinos cuda_arch=70 ^cuda@11.6.0 ^openmpi |& tee spack-build.log

masterleinad commented 2 years ago

I have a fix for sundials that gets past the error above reported by @balay. However, I now get a different error if I do not disable the trilinos variant of sundials when petsc+kokkos. There seems to be a clash between the internal trilinos kokkos and the standalone.

I would not be surprised that using Trilinos and an external Kokkos at the same type is problematic. Trilinos can't use an external Kokkos yet and always bundles it. Thus, this case results in two competing Kokkos installations.

balos1 commented 2 years ago

Linking to kokkos in addition to petsc when petsc+kokkos fixed the second problem and it now builds fine. @balay I think that means we can close this now, yes?

balay commented 2 years ago

Hm - should we enable petsc+kokkos in xsdk and try again?

balos1 commented 2 years ago

Sure.

balay commented 2 years ago

( using current kokkos mode of forcing kokkos+wrappers) the build now breaks with MFEM and DEALII

so can't really enable petsc+kokkos

dealii-spack-build-out.txt mfem-spack-build-out.txt

cc: @bangerth @v-dobrev

balos1 commented 2 years ago

Yeah, best keep it unspecified for now. Likely, mfem and dealii would have to do the same thing we did in sundials link to kokkos/kokkos-kernels directly.

balay commented 2 years ago

Also tried kokkos~wrappers, dealii buids now. mfem still breaks.

mfem-spack-build-out2.txt

v-dobrev commented 2 years ago

From the above log, it looks like petsc.so has unresolved symbols, e.g.:

/data/balay/spack/opt/spack/linux-centos7-cascadelake/gcc-9.2.0/petsc-3.18.1-eocg6m7e4s4p25pa5oykupapl4c2skiy/lib/libpetsc.so: undefined reference to `KokkosBlas::Impl::Nrm2<Kokkos::View<double, Kokkos::LayoutLeft, Kokkos::HostSpace, Kokkos::MemoryTraits<1u> >, Kokkos::View<double const*, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<1u> >, 1, false, true>::nrm2(Kokkos::View<double, Kokkos::LayoutLeft, Kokkos::HostSpace, Kokkos::MemoryTraits<1u> > const&, Kokkos::View<double const*, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<1u> > const&, bool const&)'

Why is this not resolved in petsc.so? Is it in a static library?

v-dobrev commented 2 years ago

In the other log file (mfem-spack-build-out.txt) there's big mess with errors like this:

/nfs/apps/spacks/2022-02-10/opt/spack/linux-centos7-x86_64/gcc-7.3.0/gcc-9.2.0-llib7puyqxdfte5dd2mw33v7d6mjarrw/lib/gcc/x86_64-pc-linux-gnu/9.2.0/include/stddef.h(426): error: invalid redeclaration of type name "max_align_t"
(426): here

which are impossible to understand without the sequence of #include directives that lead to this error.

@balos1, what did you need to do in SUNDIALS to fix this kind of errors? Of course, if you had similar errors.

balay commented 2 years ago

Why is this not resolved in petsc.so? Is it in a static library?

balay@xsdk:/data/balay/spack>ldd /data/balay/spack/opt/spack/linux-centos7-cascadelake/gcc-9.2.0/petsc-3.18.1-eocg6m7e4s4p25pa5oykupapl4c2skiy/lib/libpetsc.so |grep kokkos
    libkokkoskernels.so => /data/balay/spack/opt/spack/linux-centos7-cascadelake/gcc-9.2.0/kokkos-kernels-3.7.00-4rxlnggjo5imxzvrbjqzr2xlvml667bz/lib64/libkokkoskernels.so (0x00007fc81d9d5000)
    libkokkoscontainers.so.3.7 => /data/balay/spack/opt/spack/linux-centos7-cascadelake/gcc-9.2.0/kokkos-3.7.00-dvfdqghzez7c4iievdvirjk4fjqiid2h/lib64/libkokkoscontainers.so.3.7 (0x00007fc81d7c0000)
    libkokkoscore.so.3.7 => /data/balay/spack/opt/spack/linux-centos7-cascadelake/gcc-9.2.0/kokkos-3.7.00-dvfdqghzez7c4iievdvirjk4fjqiid2h/lib64/libkokkoscore.so.3.7 (0x00007fc81d413000)

Yet the linker complains.

@balos1 added the following fix to sundials for this issue

https://gitlab.com/xsdk-project/spack-xsdk/-/commit/c524ebbe7a8cdff480a3aa72d4b8e89e730c00a1

In the other log file (mfem-spack-build-out.txt) there's big mess with errors like this:

with kokkos+wrapper I get similar mess (that I don't understand) with petsc and slepc. Here is the fix I use for slepc (similar for petsc) - basically undo what kokkos+wrapper does:

https://github.com/spack/spack/pull/33529

balos1 commented 2 years ago

Are there any Kokkos symbols in public petsc headers? If yes, then I suppose linking to petsc.so wont resolve those. @v-dobrev I did not go down the rabbit hole to figure out where in petsc the errors were coming from (although I did not get the one about max_align_t) yet.

balay commented 2 years ago

petsc exposes kokkos includes to users via petsc public includes - I think its needed for definitions of basic datatypes from kokkos that get used with some petsc (public/api) functions.

cc: @jczhang07

jczhang07 commented 2 years ago

Yes, petsc has some public headers like petscvec_kokkos.hpp. When kokkos is enabled, they provide functions like getting a Kokkos::View from a petsc vector.

In current petsc makefile system, Kokkos files are supposed to have suffix*.kokkos.cxx. PETSc will compile them with a so-called Kokkos compiler. .c, .cxx files are compiled with regular C or C++ compilers.

balay commented 2 years ago

Hm petscvec_kokkos.hpp is probably not getting included from sundials/mfem - just the basic includes (petscvec.h,petscsnes.h) - With this usesage - kokkos includes shouldn't get exposed to user? [but linker complains...]

jczhang07 commented 2 years ago

With this usesage - kokkos includes shouldn't get exposed to user?

No, they should not.

but linker complains...

What do you mean? If petsc is configured with Kokkos, of course users should link the petsc library with kokkos libraries

balay commented 2 years ago

What do you mean? If petsc is configured with Kokkos, of course users should link the petsc library with kokkos libraries

Normally when -lpetsc is created by linking in external libraries - only '-lpetsc' is needed at application link time. But with kokkos [only when cuda is enabled?] - we are getting kokkos link errors.

balay commented 2 years ago

Hm - I'm unable to reproduce this issue with a stand-alone build of petsc+kokkos+cuda, with a petsc example.

balay@petsc-gpu-01:/scratch/balay/petsc/src/snes/tutorials$ make ex19.o
mpicc -o ex19.o -c -fPIC -Wall -Wwrite-strings -Wno-unknown-pragmas -Wno-lto-type-mismatch -Wno-stringop-overflow -fstack-protector -fvisibility=hidden -g3 -O0  -I/scratch/balay/petsc/include -I/scratch/balay/petsc/arch-kk-cuda/include -I/usr/local/cuda/include    `pwd`/ex19.c
balay@petsc-gpu-01:/scratch/balay/petsc/src/snes/tutorials$ mpicc -o ex19 ex19.o -Wl,-rpath,/scratch/balay/petsc/arch-kk-cuda/lib -L/scratch/balay/petsc/arch-kk-cuda/lib -lpetsc -lm
balay@petsc-gpu-01:/scratch/balay/petsc/src/snes/tutorials$ /usr/local/cuda/bin/nvcc -o ex19 ex19.o -ccbin mpic++ -Xlinker=-rpath,/scratch/balay/petsc/arch-kk-cuda/lib -L/scratch/balay/petsc/arch-kk-cuda/lib -lpetsc -lm
balay@petsc-gpu-01:/scratch/balay/petsc/src/snes/tutorials$ 
balay commented 2 years ago

And I'm unable to reproduce this with the spack build of PETSc (using petsc example).

balay@petsc-gpu-01:/scratch/balay/petsc/src/snes/tutorials$ make PETSC_DIR=/scratch/balay/spack/opt/spack/linux-ubuntu22.04-zen3/gcc-11.3.0/petsc-3.18.1-ytnw5gw575kfb4zzl7ixmz5nbzm47ak6 ex19.o
/scratch/balay/spack/opt/spack/linux-ubuntu22.04-zen3/gcc-11.3.0/mpich-4.0.2-dtswwqtovrk5ogkporfb47wifyizzt74/bin/mpicc -o ex19.o -c -fPIC -Wall -Wwrite-strings -Wno-unknown-pragmas -Wno-lto-type-mismatch -Wno-stringop-overflow -fstack-protector -fvisibility=hidden -g -O  -I/scratch/balay/spack/opt/spack/linux-ubuntu22.04-zen3/gcc-11.3.0/petsc-3.18.1-ytnw5gw575kfb4zzl7ixmz5nbzm47ak6/include -I/scratch/balay/spack/opt/spack/linux-ubuntu22.04-zen3/gcc-11.3.0/hypre-2.26.0-ureawcg2si5afftqcxqkj7jgly37gwha/include -I/scratch/balay/spack/opt/spack/linux-ubuntu22.04-zen3/gcc-11.3.0/superlu-dist-8.1.2-3u2effyw5qflel5ducqvrh43sfqq5ivl/include -I/scratch/balay/spack/opt/spack/linux-ubuntu22.04-zen3/gcc-11.3.0/kokkos-kernels-3.7.00-4ht6yomef4pf7t3sivljcnblfqjvufmb/include -I/scratch/balay/spack/opt/spack/linux-ubuntu22.04-zen3/gcc-11.3.0/kokkos-3.7.00-iwdfnaxuphlns375qhkkwcmwrk6nst55/include -I/scratch/balay/spack/opt/spack/linux-ubuntu22.04-zen3/gcc-11.3.0/hdf5-1.12.2-77mvmjjkujmq6tpl4ec2mrzrb77ue7sn/include -I/scratch/balay/spack/opt/spack/linux-ubuntu22.04-zen3/gcc-11.3.0/parmetis-4.0.3-inj6jvej57u72pypma5a5zmd4usy4n4t/include -I/scratch/balay/spack/opt/spack/linux-ubuntu22.04-zen3/gcc-11.3.0/metis-5.1.0-64osc7x3dyrov4wejoayqrktqkdavwdt/include -I/scratch/balay/spack/opt/spack/linux-ubuntu22.04-zen3/gcc-11.3.0/zlib-1.2.13-a46gganu6rrg7kcrvfle4eext3lu4wt7/include -I/scratch/balay/spack/opt/spack/linux-ubuntu22.04-zen3/gcc-11.3.0/cuda-11.8.0-6bqbc3g2cfdxvhvi6pxoedbytj5yz2md/include    `pwd`/ex19.c
balay@petsc-gpu-01:/scratch/balay/petsc/src/snes/tutorials$ /scratch/balay/spack/opt/spack/linux-ubuntu22.04-zen3/gcc-11.3.0/mpich-4.0.2-dtswwqtovrk5ogkporfb47wifyizzt74/bin/mpicc -o ex19 ex19.o -Wl,-rpath,/scratch/balay/spack/opt/spack/linux-ubuntu22.04-zen3/gcc-11.3.0/petsc-3.18.1-ytnw5gw575kfb4zzl7ixmz5nbzm47ak6/lib -L/scratch/balay/spack/opt/spack/linux-ubuntu22.04-zen3/gcc-11.3.0/petsc-3.18.1-ytnw5gw575kfb4zzl7ixmz5nbzm47ak6/lib -lpetsc -lm
balay@petsc-gpu-01:/scratch/balay/petsc/src/snes/tutorials$ /scratch/balay/spack/opt/spack/linux-ubuntu22.04-zen3/gcc-11.3.0/cuda-11.8.0-6bqbc3g2cfdxvhvi6pxoedbytj5yz2md/bin/nvcc ex19.c -o ex19 -O3 -std=c++14 -x=cu --expt-extended-lambda -arch=sm_80 -ccbin /scratch/balay/spack/opt/spack/linux-ubuntu22.04-zen3/gcc-11.3.0/mpich-4.0.2-dtswwqtovrk5ogkporfb47wifyizzt74/bin/mpic++ -I/scratch/balay/spack/opt/spack/linux-ubuntu22.04-zen3/gcc-11.3.0/petsc-3.18.1-ytnw5gw575kfb4zzl7ixmz5nbzm47ak6/include -I/scratch/balay/spack/opt/spack/linux-ubuntu22.04-zen3/gcc-11.3.0/hypre-2.26.0-ureawcg2si5afftqcxqkj7jgly37gwha/include -I/scratch/balay/spack/opt/spack/linux-ubuntu22.04-zen3/gcc-11.3.0/superlu-dist-8.1.2-3u2effyw5qflel5ducqvrh43sfqq5ivl/include -I/scratch/balay/spack/opt/spack/linux-ubuntu22.04-zen3/gcc-11.3.0/kokkos-kernels-3.7.00-4ht6yomef4pf7t3sivljcnblfqjvufmb/include -I/scratch/balay/spack/opt/spack/linux-ubuntu22.04-zen3/gcc-11.3.0/kokkos-3.7.00-iwdfnaxuphlns375qhkkwcmwrk6nst55/include -I/scratch/balay/spack/opt/spack/linux-ubuntu22.04-zen3/gcc-11.3.0/hdf5-1.12.2-77mvmjjkujmq6tpl4ec2mrzrb77ue7sn/include -I/scratch/balay/spack/opt/spack/linux-ubuntu22.04-zen3/gcc-11.3.0/parmetis-4.0.3-inj6jvej57u72pypma5a5zmd4usy4n4t/include -I/scratch/balay/spack/opt/spack/linux-ubuntu22.04-zen3/gcc-11.3.0/metis-5.1.0-64osc7x3dyrov4wejoayqrktqkdavwdt/include -I/scratch/balay/spack/opt/spack/linux-ubuntu22.04-zen3/gcc-11.3.0/zlib-1.2.13-a46gganu6rrg7kcrvfle4eext3lu4wt7/include -I/scratch/balay/spack/opt/spack/linux-ubuntu22.04-zen3/gcc-11.3.0/cuda-11.8.0-6bqbc3g2cfdxvhvi6pxoedbytj5yz2md/include -Xlinker=-rpath,/scratch/balay/spack/opt/spack/linux-ubuntu22.04-zen3/gcc-11.3.0/petsc-3.18.1-ytnw5gw575kfb4zzl7ixmz5nbzm47ak6/lib -L/scratch/balay/spack/opt/spack/linux-ubuntu22.04-zen3/gcc-11.3.0/petsc-3.18.1-ytnw5gw575kfb4zzl7ixmz5nbzm47ak6/lib -lpetsc -lm
balay@petsc-gpu-01:/scratch/balay/petsc/src/snes/tutorials$ 

With this same build - mfem fails (inside spack build)

spack-build-out.txt

jczhang07 commented 2 years ago

Does the kokkos library contain the undefined reference in libpetsc.so?

balay commented 2 years ago

Does the kokkos library contain the undefined reference in libpetsc.so?

It should - the string is way too long to do nm to verify :(

balay commented 2 years ago

BTW: Noticed mfem was built as static [default] - tried switching to mfem+shared and got the same error.

And regular [non-cuda/kokkos] builds appear to fail - filed this at: https://github.com/xsdk-project/xsdk-issues/issues/199

v-dobrev commented 2 years ago

The unresolved symbols seem to be from the namespaces KokkosSparse::Impl and KokkosBlas::Impl. Which Kokkos library file(s) contain these symbols? I can try to added these manually in addition to -lpetsc.

balay commented 2 years ago

The unresolved symbols seem to be from the namespaces KokkosSparse::Impl and KokkosBlas::Impl. Which Kokkos library file(s) contain these symbols?

They should be kokkos kernel libraries. Might need to query spack to get the correct library names. [perhaps with dependent kokkos library as-well]

balay@petsc-gpu-01:/scratch/balay/petsc/arch-kk-cuda/lib$ ldd libpetsc.so |grep kokkos
        libkokkoskernels.so (0x00007f44a1ab5000)
        libkokkoscore.so.3.7 (0x00007f44a1810000)
        libkokkoscontainers.so.3.7 (0x00007f44a0ebe000)
v-dobrev commented 2 years ago

@balay, I'm trying to reproduce the issue on Lassen using https://github.com/spack/spack/pull/33603 with

./bin/spack install -j 128 --fresh mfem+cuda+petsc cuda_arch=70 ^petsc+cuda+kokkos

which leads to kokkos~wrapper spec.

However, I get this error from the kokkos package:

==> Error: InstallError: Kokkos requires +wrapper when using +cudawithout clang
...
        264        if spec.satisfies("~wrapper+cuda") and not (
        265            spec.satisfies("%clang") or spec.satisfies("%cce")
        266        ):
  >>    267            raise InstallError("Kokkos requires +wrapper when using +cuda" "without clang")
        268
        269        options = [
        270            from_variant("CMAKE_POSITION_INDEPENDENT_CODE", "pic"),

My setup is using gcc@8.3.1 and external cuda@11.5.0. In the log you posted above (mfem-spack-build-out2.txt), you also seem to be using gcc -- how come your kokkos build worked?

v-dobrev commented 2 years ago

I think I figured out the main issue with the kokkos+wrapper dependency:

  1. it redirects the MPI wrapper to call it instead of what Spack sets by default
  2. it behaves as nvcc, removing, changing, and adding arguments and ultimately calling nvcc.

The issue with that in the MFEM package (and probably many other packages) is that it compiles CUDA + MPI by calling nvcc with -ccbin set to the MPI wrapper, so we end up with a chain of calls like this: nvcc -> mpicxx -> nvcc_wrapper (adds arguments like -arch=sm_35) -> nvcc -> g++. This is clearly not what we want.

Since many packages compile CUDA + MPI the way MFEM does (by calling nvcc with -ccbin set to the MPI wrapper), to me the behavior of the kokkos-nvcc-wrapper package (which overwrites the compiler for the MPI wrapper to be nvcc_wrapper) seems unacceptable because every such package now has to undo at least some of the changes that kokkos-nvcc-wrapper does to the environment -- that is what Satish has already done for PETSc and SLEPc.

balay commented 2 years ago

how come your kokkos build worked?

I comment out those 3 offending lines when testing the kokkos~wrapper use case.

The issue with that in the MFEM package (and probably many other packages) is that it compiles CUDA + MPI by calling nvcc with -ccbin set to the MPI wrapper, so we end up with a chain of calls like this: nvcc -> mpicxx -> nvcc_wrapper (adds arguments like -arch=sm_35) -> nvcc -> g++. This is clearly not what we want.

Yes - I think it doesn't belong in kokkos/package.py [i.e it should not force all dependent pkgs to use this modified mpicxx - that breaks compiles. Only pkgs that need it should do this switch. However @masterleinad disagrees...

Since many packages compile CUDA + MPI the way MFEM does (by calling nvcc with -ccbin set to the MPI wrapper), to me the behavior of the kokkos-nvcc-wrapper package (which overwrites the compiler for the MPI wrapper to be nvcc_wrapper) seems unacceptable because every such package now has to undo at least some of the changes that kokkos-nvcc-wrapper does to the environment -- that is what Satish has already done for PETSc and SLEPc.

yes - I undo this mpicxx switch in petsc/slepc for kokkos+wrapper - and that gets these builds working.

masterleinad commented 2 years ago

Yes - I think it doesn't belong in kokkos/package.py [i.e it should not force all dependent pkgs to use this modified mpicxx - that breaks compiles. Only pkgs that need it should do this switch. However @masterleinad disagrees...

I don't necessarily disagree. I'm just pointing out that using Kokkos+CUDA with nvcc requires special care in the choice of the compiler and I'm honestly surprised that you make it work without using nvcc_wrapper. I think it's a good idea to just open a pull request on the spack side and discuss what to do there.

v-dobrev commented 2 years ago

Okay, I created an issue in Spack: https://github.com/spack/spack/issues/33684.

v-dobrev commented 2 years ago

Until https://github.com/spack/spack/issues/33684 is resolved, I pushed a temporary (?) workaround for the kokkos+wrapper issue to MFEM in https://github.com/spack/spack/pull/33603.

With that, I think we still have the issue with dealii failing with kokkos+wrapper. Do we want to try to fix that (by reverting the environment change from kokkos-nvcc-wrapper?

Alternaltively, we can try to modify the kokkos package to allow kokkos~wrapper with g++ and other non-clang compilers and then I can try to resolve the issue with MFEM in this case.

Any thoughts?

v-dobrev commented 2 years ago

Regarding MFEM with kokkos~wrapper: after commenting out these lines in the Kokkos package:

        if spec.satisfies("~wrapper+cuda") and not (
            spec.satisfies("%clang") or spec.satisfies("%cce")
        ):
            raise InstallError("Kokkos requires +wrapper when using +cuda" "without clang")

I had no issue build the following on Lassen:

./bin/spack install -j 128 --fresh mfem+cuda+petsc cuda_arch=70 ^petsc+cuda+kokkos+mumps ^kokkos~wrapper