xsdk-project / xsdk-issues

A repository under which GitHub issues not related to a specific xSDK repo can be filed.
7 stars 0 forks source link

exago build failures [previously petsc-3.16 dependency issue] #185

Closed balay closed 1 year ago

balay commented 1 year ago

Currently xsdk-0.8.0 is with petsc-3.18.0. [and exago-1.4.1]

However exago has:

    depends_on("petsc@3.16.0:3.16", when="@1.3.0:")

Is it possible to provide exago release compatible with petsc-3.18.0?

balay@xsdk:/data/balay/spack>./bin/spack spec xsdk@0.8.0+cuda+exago cuda_arch=70 ^cuda@11.6.0 ^openmpi
==> Error: No version for 'petsc' satisfies '@3.16.6' and '@3.18.0'
cameronrutherford commented 1 year ago

Asher has left PNNL so no need to give him more emails.

I have created an ExaGO issue to track this, and I still owe you an update to #181. @abhyshr this should be easy enough given Petsc API doesn't change too much over time.

abhyshr commented 1 year ago

@balay by when do you want this done?

balay commented 1 year ago

@abhyshr Perhaps its a simple fix - a patch file would be sufficient - so that we can continue with xsdk testing.

abhyshr commented 1 year ago

What version of HIOP are you planning to include in the release? ExaGO depends on HIOP so I think we need to ensure compatibility with it, right?

balay commented 1 year ago

~hiop@0.7.0~

hiop@0.7.1

balay commented 1 year ago

Updating the status of exago in xsdk here.

petsc-3.18 fixes are now in exago develop branch. However - we still has outstanding build issues.

Default exago build in spack is without raja. Here are the build errors in this mode.

spack-build-out-oneapi.txt spack-build-out-gcc.txt

@CameronRutherford suggested exago,hiop builds should be with +raja. This also fails. Here is the log

spack-build-out.txt

balay commented 1 year ago

Also adding this issue here.

There were build breakages with exago+ipopt - as ipopt has a dependency on mumps~mpi - this installs a conflicting mpi.h (for this sequential build) - in a xsdk build that is always a +mpi build. exago build picks up this alternate mpi.h and breaks compiles.

Current workaround is to use exago~ipopt

balay commented 1 year ago

Another build issue:

exago build fails on MacOS [as default xcode compilers don't support openmp]

spack-build-out(13).txt

Current workaround is to disable exago [from xsdk] on MacOS

balay commented 1 year ago

+cuda build also fails

spack-build-out.txt

balay commented 1 year ago

To update, right now exago is broken in xdk.

Hoping exago~raja build issues in develop branch can be fixed [as a buid with raja appears to have very specific dependencies - that are breaking in xsdk build]. And enabling exago+raja can be deferred to the next release cycle.

Likely this breakage is introduced by:

16c3f58af7e37e0344e2516ff6cd0a1b0d553082
    Add Crusher CI + build with hiop@develop on Crusher, Ascent, Summit, Newell and Marianas

As I have a successful build with the ( exago develop) snapshot prior to this change + "Building with Petsc 3.18.0" + "Fix disable logging"

balay commented 1 year ago

Perhaps the following fix (for exago/develop)?

exago-noraja.patch.txt

abhyshr commented 1 year ago

Thanks for the diagnosis, Satish. I've pushed a branch xsdk-build-failures for fixing the builds. This has fix for +raja build. I noticed that the error log was in a portion of the code that's activated when GPU is not being used. We always enable GPU when using RAJA. @CameronRutherford may be able to point to specific spack option to enable for GPU in the exago script.

pelesh commented 1 year ago

Also adding this issue here.

There were build breakages with exago+ipopt - as ipopt has a dependency on mumps~mpi - this installs a conflicting mpi.h (for this sequential build) - in a xsdk build that is always a +mpi build. exago build picks up this alternate mpi.h and breaks compiles.

Current workaround is to use exago~ipopt

FWIW Ipopt should not depend on MUMPS and ExaGO should depend on ipopt~mumps. The MUMPS interface in Ipopt has been broken for some time. We haven't been able to get correct results with Ipopt when MUMPS is the linear solver.

pelesh commented 1 year ago

Perhaps the following fix (for exago/develop)?

exago-noraja.patch.txt

This is probably not the best way to move forward. As a short term solution, I would suggest the default ExaGO build to include RAJA and Umpire.

Umpire specific code probably should have not been implemented in opflow_hiop.cpp file in the first place and we'll likely need to refactor that. Carving out Umpire calls with preprocessor directives would probably make it work for xSDK release, but it is a kind of a bubble gum and duct tape work; it will make later refactoring harder. I think it is easier just to make ExaGO depend on RAJA/Umpire for now.

pelesh commented 1 year ago

Hoping exago~raja build issues in develop branch can be fixed [as a buid with raja appears to have very specific dependencies - that are breaking in xsdk build]. And enabling exago+raja can be deferred to the next release cycle.

@balay, could you clarify what are the dependencies pulled by exago+raja that are breaking xsdk build.

cameronrutherford commented 1 year ago

@pelesh @abhyshr I have been working with Satish on this, and I was hoping that getting build without RAJA working would be easiest path forward for xSDK release. I have Jaelyn working on this.

Only other issue would be OpenMP + MacOS issue that Ryan is working on.

@balay only error in +cuda log that I can see is for MPI headers (which I think you have submitted spack fixes for), so perhaps that build would work this time around.

I am hoping to add spack ci and spack test support to ExaGO soon, and so hopefully this will let ExaGO support minimal configured builds better in future.

balay commented 1 year ago

Here is the prior log with raja failure.

spack-build-out.txt

And @CameronRutherford suggested:

Fix is to downgrade camp to version before they started versioning based on day of year

[and that didn't really work, currently don't have the log]

We are fast approaching our deadlines - and I fear we might not have time to deal with raja and all its dependencies [if only specific versions of raja/camp/umpire is compatible - but then if they don't build or introduce other incompatibilities on our test boxes]

So I'm hoping if we can simply this exago build [i.e reduce dependencies] for this release - but getting fixes for exago~raja working.

And aim for exago+raja for next xsdk release

And If we can't get this sorted soon - might have to consider skipping exago from the current xsdk release.

Note: fixing macos issue is not a priority - but getting things working on LCFs is [i.e crusher, perlmutter etc.] is.

That part of testing work is ongoing - so far exago has been failing on workstations - so its likely broken on LCFs.

cc: @ulrikeyang @balos1

pelesh commented 1 year ago

And If we can't get this sorted soon - might have to consider skipping exago from the current xsdk release.

What I would suggest as a path forward is to enable minimalistic ExaGO build, which includes only PFLOW module. This would require only minor changes in CMake and could be done quickly. The minimalistic build would still depend on PETSc and MPI, and would be a bone fide xSDK member. More importantly, it would not require any hacks in the ExaGO code.

Having such build would allow us to build ExaGO without HiOp, Ipopt, RAJA/Umpire, CUDA, and HIP, and to defer all related issues for the next xSDK release.

All other solutions would take too long for xSDK release or would introduce some ugly hacks to ExaGO imho.

balay commented 1 year ago

What I would suggest as a path forward is to enable minimalistic ExaGO build, which includes only PFLOW module

If you can provide a branch where this works - and the spack patch to disable this module - I can try it in xsdk build

balay commented 1 year ago

BTW: When raja is enabled - should both hiop and exago have raja enabled? Or would it be exago+raja hiop~raja?

So far I've been trying hiop+raja hiop+raja

cameronrutherford commented 1 year ago

BTW: When raja is enabled - should both hiop and exago have raja enabled? Or would it be exago+raja hiop~raja?

So far I've been trying hiop+raja hiop+raja

HiOp+raja goes with ExaGO+raja.

This is specified in Spack of ExaGO.

You should be able to do ExaGO~raja and HiOp+raja in theory...

balay commented 1 year ago

I've pushed a branch xsdk-build-fixes for fixing the builds

I have one successful build with it. (with exago+raja hiop+raja). Will try other builds.

cameronrutherford commented 1 year ago

And If we can't get this sorted soon - might have to consider skipping exago from the current xsdk release.

What I would suggest as a path forward is to enable minimalistic ExaGO build, which includes only PFLOW module. This would require only minor changes in CMake and could be done quickly. The minimalistic build would still depend on PETSc and MPI, and would be a bone fide xSDK member. More importantly, it would not require any hacks in the ExaGO code.

Having such build would allow us to build ExaGO without HiOp, Ipopt, RAJA/Umpire, CUDA, and HIP, and to defer all related issues for the next xSDK release.

All other solutions would take too long for xSDK release or would introduce some ugly hacks to ExaGO imho.

@pelesh this would be reasonable. If we can define what that minimal spec is, adding a spack/cmake configuration shouldn't be too hard.

cameronrutherford commented 1 year ago

Here is the prior log with raja failure.

spack-build-out.txt

And @CameronRutherford suggested:

Fix is to downgrade camp to version before they started versioning based on day of year

[and that didn't really work, currently don't have the log]

We are fast approaching our deadlines - and I fear we might not have time to deal with raja and all its dependencies [if only specific versions of raja/camp/umpire is compatible - but then if they don't build or introduce other incompatibilities on our test boxes]

So I'm hoping if we can simply this exago build [i.e reduce dependencies] for this release - but getting fixes for exago~raja working.

And aim for exago+raja for next xsdk release

And If we can't get this sorted soon - might have to consider skipping exago from the current xsdk release.

Note: fixing macos issue is not a priority - but getting things working on LCFs is [i.e crusher, perlmutter etc.] is.

That part of testing work is ongoing - so far exago has been failing on workstations - so its likely broken on LCFs.

cc: @ulrikeyang @balos1

@balay if we can get the error log with the tagged version of camp that could help.

We have had success on Crusher recently and so it would be helpful to see the errors.

balay commented 1 year ago

@balay if we can get the error log with the tagged version of camp that could help.

With the fix in xsdk-build-fixes - I'm not seeing that error anymore [with latest camp]. So perhaps this is not really a camp version issue.

[balay@pj01 spack.x]$ ./bin/spack find -vL camp umpire raja hiop exago
-- linux-fedora37-skylake / oneapi@2022.2.0 ---------------------
7nswbigz3vrs2jcs6n2ldx4mdood336x camp@2022.03.2~cuda~ipo+openmp~rocm~tests build_system=cmake build_type=RelWithDebInfo
aqznxna5u3vho5syulosv7ced6zkzlas exago@1.5.0~cuda+hiop~ipo~ipopt+mpi+python+raja~rocm build_system=cmake build_type=RelWithDebInfo patches=6289d0b
cit46ag57ipgkb4t45viwmcsmgrb2x75 hiop@0.7.1~cuda~cusolver+deepchecking~ginkgo~ipo~jsrun~kron+mpi+raja~rocm~shared~sparse build_system=cmake build_type=RelWithDebInfo
qaeq37act4brdrptdfp7ngac2kopgiqv raja@2022.03.0~cuda+examples+exercises~ipo+openmp~rocm+shared~tests build_system=cmake build_type=RelWithDebInfo
ph2jo6hh72xvezjups3sa2jckpdnauvn umpire@2022.03.1+c~cuda+device_alloc~deviceconst+examples~fortran~ipo~numa~openmp~rocm+shared build_system=cmake build_type=RelWithDebInfo tests=none
==> 5 installed packages

I'm trying out other builds to see if they also work.

https://gitlab.com/xsdk-project/spack-xsdk/-/pipelines/681993599

pelesh commented 1 year ago

@pelesh this would be reasonable. If we can define what that minimal spec is, adding a spack/cmake configuration shouldn't be too hard.

I created an issue at ExaGO gitlab suggesting how minimalistic ExaGO could be built. It seems this could be done quickly and cleanly.

balay commented 1 year ago

https://gitlab.com/xsdk-project/spack-xsdk/-/jobs/3254652436

https://gitlab.com/xsdk-project/spack-xsdk/-/jobs/3254652437

https://gitlab.com/xsdk-project/spack-xsdk/-/jobs/3254652438

https://gitlab.com/xsdk-project/spack-xsdk/-/jobs/3254652440

pelesh commented 1 year ago

@balay: Let us try mini-ExaGO. The module that is causing the troubles is already obsolete and support for it is likely to get discontinued.

balay commented 1 year ago

@balay: Let us try mini-ExaGO.

Ok. Let me know how I can use this [from spack]

For now I'm using xsdk-build-fixes branch - with the noraja fixes I posted earlier. (and this builds fine, macos is disabled)

https://gitlab.com/xsdk-project/spack-xsdk/-/commit/efc170d7bd9061a83f6e56f8e989b79cc842268a

https://gitlab.com/xsdk-project/spack-xsdk/-/pipelines/682173755

cameronrutherford commented 1 year ago

@balay can you try and pin RAJA with 0.14.0, Umpire 6.0.0 and Camp 0.2.3? That is the blessed version configuration that we have been testing with, and perhaps would get pipelines to pass...

We are still working on a version that will be supported for exago~ipopt~hiop~raja, and that should be the more maintainable solution for your tests.

balay commented 1 year ago

Here is the build with pin RAJA with 0.14.0, Umpire 6.0.0 and Camp 0.2.3

with https://gitlab.com/xsdk-project/spack-xsdk/-/pipelines/684811482

The workstation buids are clean [except for MacOS - where currently I have exago disabled - but I might also have to disable hiop - when raja is enabled. I tried raja~threads - but that somehow triggered sundials failure - so perhaps sundials is somehow picking up raja]

And here is the buid with exago-pflow-only branch - here using exago~hiop~ipopt~python~cuda

https://gitlab.com/xsdk-project/spack-xsdk/-/pipelines/685059053

workstation builds are clean (hiop is enabled on MacOS, exago disabled)

balay commented 1 year ago

Will try the following:

yet to test this...

https://gitlab.com/xsdk-project/spack-xsdk/-/pipelines/686002813

cameronrutherford commented 1 year ago

exago-pflow-only branch is set to be merged into develop with tests now passing.

We are on track to close out a few more issues, and then I think we can also have a 1.4.2 release to have something static for xSDK.

We ran into a new issue debugging MacOS issue, but I think I can make CMake changes that fix issue without reproducing first if you want me to try that.

balay commented 1 year ago

If you have changes for MacOS [exago branch,corresponding spack changes] - I can give it a try and see if it will build

balay commented 1 year ago

BTW: would version 1.5.0 be better fro this than 1.4.2- as there are some major [dependency] changes wrt petsc and hiop [esp on the spack side]. Also we are anchoring raja/umpire/camp for this version in spack.

Usually sub-minor releases (1.4.1 to 1.4.2) are for bug-fix-only updates [i.e usually drop in replacement wrt dependencies]

balay commented 1 year ago

Build failure on cori [with Intel compilers]

1 error found in build log:
     27    -- Could NOT find OpenMP_CXX (missing: OpenMP_CXX_FLAGS OpenMP_CXX_LIB_NAMES)
     28    -- Could NOT find OpenMP (missing: OpenMP_C_FOUND OpenMP_CXX_FOUND)
     29    -- Looking for sgemm_
     30    -- Looking for sgemm_ - not found
     31    -- Looking for pthread.h
     32    -- Looking for pthread.h - not found
  >> 33    CMake Error at /global/common/software/nersc/cori-2022q1/spack/cray-cnl7-haswell/cmake-3.22.2-ugphoa4/share/cmake-3.22/Modules/FindPackageHandleStandardArgs.cmake:230 (message):
     34      Could NOT find Threads (missing: Threads_FOUND)
     35    Call Stack (most recent call first):
     36      /global/common/software/nersc/cori-2022q1/spack/cray-cnl7-haswell/cmake-3.22.2-ugphoa4/share/cmake-3.22/Modules/FindPackageHandleStandardArgs.cmake:594 (_FPHSA_FAILURE_MESSAGE)
     37      /global/common/software/nersc/cori-2022q1/spack/cray-cnl7-haswell/cmake-3.22.2-ugphoa4/share/cmake-3.22/Modules/FindThreads.cmake:238 (FIND_PACKAGE_HANDLE_STANDARD_ARGS)
     38      /global/common/software/nersc/cori-2022q1/spack/cray-cnl7-haswell/cmake-3.22.2-ugphoa4/share/cmake-3.22/Modules/FindBLAS.cmake:456 (find_package)
     39      /global/common/software/nersc/cori-2022q1/spack/cray-cnl7-haswell/cmake-3.22.2-ugphoa4/share/cmake-3.22/Modules/FindLAPACK.cmake:240 (find_package)

ref: balay@cori12:/tmp/balay/spack> ./bin/spack install xsdk@0.8.0%intel@19.1.2.254 ^netlib-lapack ^dealii cflags=-L/opt/cray/pe/atp/3.14.9/libApp cxxflags=-L/opt/cray/pe/atp/3.14.9/libApp |& tee spack-build.log

spack-build-out.txt

balay commented 1 year ago

Hm - pflow-only i.e exago~hiop~ipopt gives errors. [don't know why I didn't see this before]

balay@cori12:/tmp/balay/spack> ./bin/spack spec xsdk@0.8.0%intel@19.1.2.254~trilinos~dealii+hiop+exago~raja
==> Error: ExaGO needs at least one solver enabled

I see exago/package.py has:

    conflicts("~hiop~ipopt", msg="ExaGO needs at least one solver enabled")

I guess this check can now be updated to [as its fixed by pflow-only] :

    conflicts("~hiop~ipopt @:1.4.1", msg="ExaGO needs at least one solver enabled")

It doesn't fix the above Could NOT find OpenMP_C (missing: OpenMP_C_FLAGS OpenMP_C_LIB_NAMES) error though..

cameronrutherford commented 1 year ago

Hm - pflow-only i.e exago~hiop~ipopt gives errors. [don't know why I didn't see this before]

balay@cori12:/tmp/balay/spack> ./bin/spack spec xsdk@0.8.0%intel@19.1.2.254~trilinos~dealii+hiop+exago~raja
==> Error: ExaGO needs at least one solver enabled

I see exago/package.py has:

    conflicts("~hiop~ipopt", msg="ExaGO needs at least one solver enabled")

I guess this check can now be updated to [as its fixed by pflow-only] :

    conflicts("~hiop~ipopt @:1.4.1", msg="ExaGO needs at least one solver enabled")

It doesn't fix the above Could NOT find OpenMP_C (missing: OpenMP_C_FLAGS OpenMP_C_LIB_NAMES) error though..

Yes, we should update our conflicts cause there in the newest update as well. We also need to disable python when building either ~hiop or ~ipopt.

I agree with 1.5.0 instead of 1.4.2 - I will bring this up with my team and see what the consensus is, but that shouldn't change the timeline on the release.

w.r.t. the CMake Threads/OpenMP issue, I think we were finally able to reproduce, and so I want to try out a change locally before merging the fix into the develop branch.

cameronrutherford commented 1 year ago

What I would suggest as a path forward is to enable minimalistic ExaGO build, which includes only PFLOW module

If you can provide a branch where this works - and the spack patch to disable this module - I can try it in xsdk build

@balay this should be fixed in develop. The release should be coming soon, as we just need to update our spack package to reflect dependency issues highlighted here, and we should be good to go

balay commented 1 year ago

To update: the exago build now works on MacOS

https://gitlab.com/xsdk-project/spack-xsdk/-/jobs/3304607638

exago@1.5.0~cuda~hiop~ipo~ipopt+mpi~python~raja~rocm build_system=cmake build_type=RelWithDebInfo

However - the failure on cori persists

1 error found in build log:
     20    -- Found MPI_C: /tmp/balay/spack/lib/spack/env/intel/icc (found version "3.1")
     21    -- Found MPI_CXX: /tmp/balay/spack/lib/spack/env/intel/icpc (found version "3.1")
     22    -- Found MPI: TRUE (found version "3.1") found components: C CXX
     23    -- Found PkgConfig: /tmp/balay/spack/opt/spack/cray-cnl7-haswell/intel-19.1.2.254/pkgconf-1.8.0-uhriasntutowdps376mw4r3xntktq3ab/bin/pkg-config (found version "1.8.0")
     24    -- Checking for module 'PETSc'
     25    --   Found PETSc, version 3.18.1
  >> 26    CMake Error at /global/common/software/nersc/cori-2022q1/spack/cray-cnl7-haswell/cmake-3.22.2-ugphoa4/share/cmake-3.22/Modules/FindPackageHandleStandardArgs.cmake:230 (message):
     27      Could NOT find OpenMP_C (missing: OpenMP_C_FLAGS OpenMP_C_LIB_NAMES)
     28    Call Stack (most recent call first):
     29      /global/common/software/nersc/cori-2022q1/spack/cray-cnl7-haswell/cmake-3.22.2-ugphoa4/share/cmake-3.22/Modules/FindPackageHandleStandardArgs.cmake:594 (_FPHSA_FAILURE_MESSAGE)
     30      /global/common/software/nersc/cori-2022q1/spack/cray-cnl7-haswell/cmake-3.22.2-ugphoa4/share/cmake-3.22/Modules/FindOpenMP.cmake:544 (find_package_handle_standard_args)
     31      CMakeLists.txt:239 (find_package)

Will just disable exgo on cori for now.

balay commented 1 year ago

@CameronRutherford @abhyshr - do you have an estimate for when the release would occur?

Hoping all (outstanding) pkgs can provide a release this week (if not by Monday) - as they need to go through spack PR process, merged in - and then xsdk released, (changes in this process need re testing) , website updated with the xsdk release - and all these stages will take time, and they all should be completed by 18th [i.e about a week from now]

cameronrutherford commented 1 year ago

To update: the exago build now works on MacOS

https://gitlab.com/xsdk-project/spack-xsdk/-/jobs/3304607638

exago@1.5.0~cuda~hiop~ipo~ipopt+mpi~python~raja~rocm build_system=cmake build_type=RelWithDebInfo

However - the failure on cori persists

1 error found in build log:
     20    -- Found MPI_C: /tmp/balay/spack/lib/spack/env/intel/icc (found version "3.1")
     21    -- Found MPI_CXX: /tmp/balay/spack/lib/spack/env/intel/icpc (found version "3.1")
     22    -- Found MPI: TRUE (found version "3.1") found components: C CXX
     23    -- Found PkgConfig: /tmp/balay/spack/opt/spack/cray-cnl7-haswell/intel-19.1.2.254/pkgconf-1.8.0-uhriasntutowdps376mw4r3xntktq3ab/bin/pkg-config (found version "1.8.0")
     24    -- Checking for module 'PETSc'
     25    --   Found PETSc, version 3.18.1
  >> 26    CMake Error at /global/common/software/nersc/cori-2022q1/spack/cray-cnl7-haswell/cmake-3.22.2-ugphoa4/share/cmake-3.22/Modules/FindPackageHandleStandardArgs.cmake:230 (message):
     27      Could NOT find OpenMP_C (missing: OpenMP_C_FLAGS OpenMP_C_LIB_NAMES)
     28    Call Stack (most recent call first):
     29      /global/common/software/nersc/cori-2022q1/spack/cray-cnl7-haswell/cmake-3.22.2-ugphoa4/share/cmake-3.22/Modules/FindPackageHandleStandardArgs.cmake:594 (_FPHSA_FAILURE_MESSAGE)
     30      /global/common/software/nersc/cori-2022q1/spack/cray-cnl7-haswell/cmake-3.22.2-ugphoa4/share/cmake-3.22/Modules/FindOpenMP.cmake:544 (find_package_handle_standard_args)
     31      CMakeLists.txt:239 (find_package)

Will just disable exgo on cori for now.

I suppose that we will debug this failure in future releases?

cameronrutherford commented 1 year ago

@balay we should have a release along with a spack PR within a day. Just have to polish up the changelog and tag the new version. I will create a spack PR with exago changes that are identical to existing xSDK PR.

balay commented 1 year ago

You might want to debug this - irrespective of xsdk. [it could be a spack/config issue on cori - or it could be and exago issue]

Wrt xsdk - cori is going away in a couple of months - [i.e not something to debug for future release]

I guess I was trying to convey: do not delay the release for this issue.

balay commented 1 year ago

@balay we should have a release along with a spack PR within a day. Just have to polish up the changelog and tag the new version. I will create a spack PR with exago changes that are identical to existing xSDK PR.

sounds good.