Closed balay closed 1 year ago
Asher has left PNNL so no need to give him more emails.
I have created an ExaGO issue to track this, and I still owe you an update to #181. @abhyshr this should be easy enough given Petsc API doesn't change too much over time.
@balay by when do you want this done?
@abhyshr Perhaps its a simple fix - a patch file would be sufficient - so that we can continue with xsdk testing.
What version of HIOP are you planning to include in the release? ExaGO depends on HIOP so I think we need to ensure compatibility with it, right?
~hiop@0.7.0~
hiop@0.7.1
Updating the status of exago in xsdk here.
petsc-3.18 fixes are now in exago develop
branch. However - we still has outstanding build issues.
Default exago build in spack is without raja. Here are the build errors in this mode.
spack-build-out-oneapi.txt spack-build-out-gcc.txt
@CameronRutherford suggested exago,hiop builds should be with +raja. This also fails. Here is the log
Also adding this issue here.
There were build breakages with exago+ipopt
- as ipopt has a dependency on mumps~mpi
- this installs a conflicting mpi.h (for this sequential build) - in a xsdk build that is always a +mpi build. exago build picks up this alternate mpi.h and breaks compiles.
Current workaround is to use exago~ipopt
Another build issue:
exago build fails on MacOS [as default xcode compilers don't support openmp]
Current workaround is to disable exago [from xsdk] on MacOS
+cuda build also fails
To update, right now exago is broken in xdk.
Hoping exago~raja
build issues in develop
branch can be fixed [as a buid with raja appears to have very specific dependencies - that are breaking in xsdk build]. And enabling exago+raja can be deferred to the next release cycle.
Likely this breakage is introduced by:
16c3f58af7e37e0344e2516ff6cd0a1b0d553082
Add Crusher CI + build with hiop@develop on Crusher, Ascent, Summit, Newell and Marianas
As I have a successful build with the ( exago develop) snapshot prior to this change + "Building with Petsc 3.18.0" + "Fix disable logging"
Perhaps the following fix (for exago/develop)?
Thanks for the diagnosis, Satish. I've pushed a branch xsdk-build-failures for fixing the builds. This has fix for +raja build. I noticed that the error log was in a portion of the code that's activated when GPU is not being used. We always enable GPU when using RAJA. @CameronRutherford may be able to point to specific spack option to enable for GPU in the exago script.
Also adding this issue here.
There were build breakages with
exago+ipopt
- as ipopt has a dependency onmumps~mpi
- this installs a conflicting mpi.h (for this sequential build) - in a xsdk build that is always a +mpi build. exago build picks up this alternate mpi.h and breaks compiles.Current workaround is to use
exago~ipopt
FWIW Ipopt should not depend on MUMPS and ExaGO should depend on ipopt~mumps
. The MUMPS interface in Ipopt has been broken for some time. We haven't been able to get correct results with Ipopt when MUMPS is the linear solver.
Perhaps the following fix (for exago/develop)?
This is probably not the best way to move forward. As a short term solution, I would suggest the default ExaGO build to include RAJA and Umpire.
Umpire specific code probably should have not been implemented in opflow_hiop.cpp
file in the first place and we'll likely need to refactor that. Carving out Umpire calls with preprocessor directives would probably make it work for xSDK release, but it is a kind of a bubble gum and duct tape work; it will make later refactoring harder. I think it is easier just to make ExaGO depend on RAJA/Umpire for now.
Hoping
exago~raja
build issues indevelop
branch can be fixed [as a buid with raja appears to have very specific dependencies - that are breaking in xsdk build]. And enabling exago+raja can be deferred to the next release cycle.
@balay, could you clarify what are the dependencies pulled by exago+raja that are breaking xsdk build.
@pelesh @abhyshr I have been working with Satish on this, and I was hoping that getting build without RAJA working would be easiest path forward for xSDK release. I have Jaelyn working on this.
Only other issue would be OpenMP + MacOS issue that Ryan is working on.
@balay only error in +cuda log that I can see is for MPI headers (which I think you have submitted spack fixes for), so perhaps that build would work this time around.
I am hoping to add spack ci
and spack test
support to ExaGO soon, and so hopefully this will let ExaGO support minimal configured builds better in future.
Here is the prior log with raja failure.
And @CameronRutherford suggested:
Fix is to downgrade camp to version before they started versioning based on day of year
[and that didn't really work, currently don't have the log]
We are fast approaching our deadlines - and I fear we might not have time to deal with raja and all its dependencies [if only specific versions of raja/camp/umpire is compatible - but then if they don't build or introduce other incompatibilities on our test boxes]
So I'm hoping if we can simply this exago build [i.e reduce dependencies] for this release - but getting fixes for exago~raja
working.
And aim for exago+raja for next xsdk release
And If we can't get this sorted soon - might have to consider skipping exago from the current xsdk release.
Note: fixing macos issue is not a priority - but getting things working on LCFs is [i.e crusher, perlmutter etc.] is.
That part of testing work is ongoing - so far exago has been failing on workstations - so its likely broken on LCFs.
cc: @ulrikeyang @balos1
And If we can't get this sorted soon - might have to consider skipping exago from the current xsdk release.
What I would suggest as a path forward is to enable minimalistic ExaGO build, which includes only PFLOW module. This would require only minor changes in CMake and could be done quickly. The minimalistic build would still depend on PETSc and MPI, and would be a bone fide xSDK member. More importantly, it would not require any hacks in the ExaGO code.
Having such build would allow us to build ExaGO without HiOp, Ipopt, RAJA/Umpire, CUDA, and HIP, and to defer all related issues for the next xSDK release.
All other solutions would take too long for xSDK release or would introduce some ugly hacks to ExaGO imho.
What I would suggest as a path forward is to enable minimalistic ExaGO build, which includes only PFLOW module
If you can provide a branch
where this works - and the spack patch to disable this module - I can try it in xsdk build
BTW: When raja is enabled - should both hiop and exago have raja enabled? Or would it be exago+raja hiop~raja
?
So far I've been trying hiop+raja hiop+raja
BTW: When raja is enabled - should both hiop and exago have raja enabled? Or would it be
exago+raja hiop~raja
?So far I've been trying
hiop+raja hiop+raja
HiOp+raja goes with ExaGO+raja.
This is specified in Spack of ExaGO.
You should be able to do ExaGO~raja and HiOp+raja in theory...
I've pushed a branch xsdk-build-fixes for fixing the builds
I have one successful build with it. (with exago+raja hiop+raja). Will try other builds.
And If we can't get this sorted soon - might have to consider skipping exago from the current xsdk release.
What I would suggest as a path forward is to enable minimalistic ExaGO build, which includes only PFLOW module. This would require only minor changes in CMake and could be done quickly. The minimalistic build would still depend on PETSc and MPI, and would be a bone fide xSDK member. More importantly, it would not require any hacks in the ExaGO code.
Having such build would allow us to build ExaGO without HiOp, Ipopt, RAJA/Umpire, CUDA, and HIP, and to defer all related issues for the next xSDK release.
All other solutions would take too long for xSDK release or would introduce some ugly hacks to ExaGO imho.
@pelesh this would be reasonable. If we can define what that minimal spec is, adding a spack/cmake configuration shouldn't be too hard.
Here is the prior log with raja failure.
And @CameronRutherford suggested:
Fix is to downgrade camp to version before they started versioning based on day of year
[and that didn't really work, currently don't have the log]
We are fast approaching our deadlines - and I fear we might not have time to deal with raja and all its dependencies [if only specific versions of raja/camp/umpire is compatible - but then if they don't build or introduce other incompatibilities on our test boxes]
So I'm hoping if we can simply this exago build [i.e reduce dependencies] for this release - but getting fixes for
exago~raja
working.And aim for exago+raja for next xsdk release
And If we can't get this sorted soon - might have to consider skipping exago from the current xsdk release.
Note: fixing macos issue is not a priority - but getting things working on LCFs is [i.e crusher, perlmutter etc.] is.
That part of testing work is ongoing - so far exago has been failing on workstations - so its likely broken on LCFs.
cc: @ulrikeyang @balos1
@balay if we can get the error log with the tagged version of camp that could help.
We have had success on Crusher recently and so it would be helpful to see the errors.
@balay if we can get the error log with the tagged version of camp that could help.
With the fix in xsdk-build-fixes - I'm not seeing that error anymore [with latest camp]. So perhaps this is not really a camp version issue.
[balay@pj01 spack.x]$ ./bin/spack find -vL camp umpire raja hiop exago
-- linux-fedora37-skylake / oneapi@2022.2.0 ---------------------
7nswbigz3vrs2jcs6n2ldx4mdood336x camp@2022.03.2~cuda~ipo+openmp~rocm~tests build_system=cmake build_type=RelWithDebInfo
aqznxna5u3vho5syulosv7ced6zkzlas exago@1.5.0~cuda+hiop~ipo~ipopt+mpi+python+raja~rocm build_system=cmake build_type=RelWithDebInfo patches=6289d0b
cit46ag57ipgkb4t45viwmcsmgrb2x75 hiop@0.7.1~cuda~cusolver+deepchecking~ginkgo~ipo~jsrun~kron+mpi+raja~rocm~shared~sparse build_system=cmake build_type=RelWithDebInfo
qaeq37act4brdrptdfp7ngac2kopgiqv raja@2022.03.0~cuda+examples+exercises~ipo+openmp~rocm+shared~tests build_system=cmake build_type=RelWithDebInfo
ph2jo6hh72xvezjups3sa2jckpdnauvn umpire@2022.03.1+c~cuda+device_alloc~deviceconst+examples~fortran~ipo~numa~openmp~rocm+shared build_system=cmake build_type=RelWithDebInfo tests=none
==> 5 installed packages
I'm trying out other builds to see if they also work.
https://gitlab.com/xsdk-project/spack-xsdk/-/pipelines/681993599
@pelesh this would be reasonable. If we can define what that minimal spec is, adding a spack/cmake configuration shouldn't be too hard.
I created an issue at ExaGO gitlab suggesting how minimalistic ExaGO could be built. It seems this could be done quickly and cleanly.
https://gitlab.com/xsdk-project/spack-xsdk/-/jobs/3254652436
https://gitlab.com/xsdk-project/spack-xsdk/-/jobs/3254652437
https://gitlab.com/xsdk-project/spack-xsdk/-/jobs/3254652438
https://gitlab.com/xsdk-project/spack-xsdk/-/jobs/3254652440
@balay: Let us try mini-ExaGO. The module that is causing the troubles is already obsolete and support for it is likely to get discontinued.
@balay: Let us try mini-ExaGO.
Ok. Let me know how I can use this [from spack]
For now I'm using xsdk-build-fixes
branch - with the noraja
fixes I posted earlier. (and this builds fine, macos is disabled)
https://gitlab.com/xsdk-project/spack-xsdk/-/commit/efc170d7bd9061a83f6e56f8e989b79cc842268a
https://gitlab.com/xsdk-project/spack-xsdk/-/pipelines/682173755
@balay can you try and pin RAJA with 0.14.0, Umpire 6.0.0 and Camp 0.2.3? That is the blessed version configuration that we have been testing with, and perhaps would get pipelines to pass...
We are still working on a version that will be supported for exago~ipopt~hiop~raja
, and that should be the more maintainable solution for your tests.
Here is the build with pin RAJA with 0.14.0, Umpire 6.0.0 and Camp 0.2.3
with https://gitlab.com/xsdk-project/spack-xsdk/-/pipelines/684811482
The workstation buids are clean [except for MacOS - where currently I have exago disabled - but I might also have to disable hiop - when raja is enabled. I tried raja~threads - but that somehow triggered sundials failure - so perhaps sundials is somehow picking up raja]
And here is the buid with exago-pflow-only
branch - here using exago~hiop~ipopt~python~cuda
https://gitlab.com/xsdk-project/spack-xsdk/-/pipelines/685059053
workstation builds are clean (hiop is enabled on MacOS, exago disabled)
Will try the following:
raja
to xsdk - this way - both normal build (with +raja
default, also work with +cuda) and pflow-only
(with ~raja
) can be attempted ~raja ~exago
[i.e hiop~raja
gets built on MacOS]yet to test this...
https://gitlab.com/xsdk-project/spack-xsdk/-/pipelines/686002813
exago-pflow-only branch is set to be merged into develop with tests now passing.
We are on track to close out a few more issues, and then I think we can also have a 1.4.2 release to have something static for xSDK.
We ran into a new issue debugging MacOS issue, but I think I can make CMake changes that fix issue without reproducing first if you want me to try that.
If you have changes for MacOS [exago branch,corresponding spack changes] - I can give it a try and see if it will build
BTW: would version 1.5.0 be better fro this than 1.4.2- as there are some major [dependency] changes wrt petsc and hiop [esp on the spack side]. Also we are anchoring raja/umpire/camp for this version in spack.
Usually sub-minor releases (1.4.1 to 1.4.2) are for bug-fix-only updates [i.e usually drop in replacement wrt dependencies]
Build failure on cori
[with Intel compilers]
1 error found in build log:
27 -- Could NOT find OpenMP_CXX (missing: OpenMP_CXX_FLAGS OpenMP_CXX_LIB_NAMES)
28 -- Could NOT find OpenMP (missing: OpenMP_C_FOUND OpenMP_CXX_FOUND)
29 -- Looking for sgemm_
30 -- Looking for sgemm_ - not found
31 -- Looking for pthread.h
32 -- Looking for pthread.h - not found
>> 33 CMake Error at /global/common/software/nersc/cori-2022q1/spack/cray-cnl7-haswell/cmake-3.22.2-ugphoa4/share/cmake-3.22/Modules/FindPackageHandleStandardArgs.cmake:230 (message):
34 Could NOT find Threads (missing: Threads_FOUND)
35 Call Stack (most recent call first):
36 /global/common/software/nersc/cori-2022q1/spack/cray-cnl7-haswell/cmake-3.22.2-ugphoa4/share/cmake-3.22/Modules/FindPackageHandleStandardArgs.cmake:594 (_FPHSA_FAILURE_MESSAGE)
37 /global/common/software/nersc/cori-2022q1/spack/cray-cnl7-haswell/cmake-3.22.2-ugphoa4/share/cmake-3.22/Modules/FindThreads.cmake:238 (FIND_PACKAGE_HANDLE_STANDARD_ARGS)
38 /global/common/software/nersc/cori-2022q1/spack/cray-cnl7-haswell/cmake-3.22.2-ugphoa4/share/cmake-3.22/Modules/FindBLAS.cmake:456 (find_package)
39 /global/common/software/nersc/cori-2022q1/spack/cray-cnl7-haswell/cmake-3.22.2-ugphoa4/share/cmake-3.22/Modules/FindLAPACK.cmake:240 (find_package)
ref: balay@cori12:/tmp/balay/spack> ./bin/spack install xsdk@0.8.0%intel@19.1.2.254 ^netlib-lapack ^dealii cflags=-L/opt/cray/pe/atp/3.14.9/libApp cxxflags=-L/opt/cray/pe/atp/3.14.9/libApp |& tee spack-build.log
Hm - pflow-only
i.e exago~hiop~ipopt
gives errors. [don't know why I didn't see this before]
balay@cori12:/tmp/balay/spack> ./bin/spack spec xsdk@0.8.0%intel@19.1.2.254~trilinos~dealii+hiop+exago~raja
==> Error: ExaGO needs at least one solver enabled
I see exago/package.py has:
conflicts("~hiop~ipopt", msg="ExaGO needs at least one solver enabled")
I guess this check can now be updated to [as its fixed by pflow-only] :
conflicts("~hiop~ipopt @:1.4.1", msg="ExaGO needs at least one solver enabled")
It doesn't fix the above Could NOT find OpenMP_C (missing: OpenMP_C_FLAGS OpenMP_C_LIB_NAMES)
error though..
Hm -
pflow-only
i.eexago~hiop~ipopt
gives errors. [don't know why I didn't see this before]balay@cori12:/tmp/balay/spack> ./bin/spack spec xsdk@0.8.0%intel@19.1.2.254~trilinos~dealii+hiop+exago~raja ==> Error: ExaGO needs at least one solver enabled
I see exago/package.py has:
conflicts("~hiop~ipopt", msg="ExaGO needs at least one solver enabled")
I guess this check can now be updated to [as its fixed by pflow-only] :
conflicts("~hiop~ipopt @:1.4.1", msg="ExaGO needs at least one solver enabled")
It doesn't fix the above
Could NOT find OpenMP_C (missing: OpenMP_C_FLAGS OpenMP_C_LIB_NAMES)
error though..
Yes, we should update our conflicts cause there in the newest update as well. We also need to disable python when building either ~hiop
or ~ipopt
.
I agree with 1.5.0 instead of 1.4.2 - I will bring this up with my team and see what the consensus is, but that shouldn't change the timeline on the release.
w.r.t. the CMake Threads/OpenMP issue, I think we were finally able to reproduce, and so I want to try out a change locally before merging the fix into the develop branch.
What I would suggest as a path forward is to enable minimalistic ExaGO build, which includes only PFLOW module
If you can provide a
branch
where this works - and the spack patch to disable this module - I can try it in xsdk build
@balay this should be fixed in develop. The release should be coming soon, as we just need to update our spack package to reflect dependency issues highlighted here, and we should be good to go
To update: the exago build now works on MacOS
https://gitlab.com/xsdk-project/spack-xsdk/-/jobs/3304607638
exago@1.5.0~cuda~hiop~ipo~ipopt+mpi~python~raja~rocm build_system=cmake build_type=RelWithDebInfo
However - the failure on cori persists
1 error found in build log:
20 -- Found MPI_C: /tmp/balay/spack/lib/spack/env/intel/icc (found version "3.1")
21 -- Found MPI_CXX: /tmp/balay/spack/lib/spack/env/intel/icpc (found version "3.1")
22 -- Found MPI: TRUE (found version "3.1") found components: C CXX
23 -- Found PkgConfig: /tmp/balay/spack/opt/spack/cray-cnl7-haswell/intel-19.1.2.254/pkgconf-1.8.0-uhriasntutowdps376mw4r3xntktq3ab/bin/pkg-config (found version "1.8.0")
24 -- Checking for module 'PETSc'
25 -- Found PETSc, version 3.18.1
>> 26 CMake Error at /global/common/software/nersc/cori-2022q1/spack/cray-cnl7-haswell/cmake-3.22.2-ugphoa4/share/cmake-3.22/Modules/FindPackageHandleStandardArgs.cmake:230 (message):
27 Could NOT find OpenMP_C (missing: OpenMP_C_FLAGS OpenMP_C_LIB_NAMES)
28 Call Stack (most recent call first):
29 /global/common/software/nersc/cori-2022q1/spack/cray-cnl7-haswell/cmake-3.22.2-ugphoa4/share/cmake-3.22/Modules/FindPackageHandleStandardArgs.cmake:594 (_FPHSA_FAILURE_MESSAGE)
30 /global/common/software/nersc/cori-2022q1/spack/cray-cnl7-haswell/cmake-3.22.2-ugphoa4/share/cmake-3.22/Modules/FindOpenMP.cmake:544 (find_package_handle_standard_args)
31 CMakeLists.txt:239 (find_package)
Will just disable exgo on cori for now.
@CameronRutherford @abhyshr - do you have an estimate for when the release would occur?
Hoping all (outstanding) pkgs can provide a release this week (if not by Monday) - as they need to go through spack PR process, merged in - and then xsdk released, (changes in this process need re testing) , website updated with the xsdk release - and all these stages will take time, and they all should be completed by 18th [i.e about a week from now]
To update: the exago build now works on MacOS
https://gitlab.com/xsdk-project/spack-xsdk/-/jobs/3304607638
exago@1.5.0~cuda~hiop~ipo~ipopt+mpi~python~raja~rocm build_system=cmake build_type=RelWithDebInfo
However - the failure on cori persists
1 error found in build log: 20 -- Found MPI_C: /tmp/balay/spack/lib/spack/env/intel/icc (found version "3.1") 21 -- Found MPI_CXX: /tmp/balay/spack/lib/spack/env/intel/icpc (found version "3.1") 22 -- Found MPI: TRUE (found version "3.1") found components: C CXX 23 -- Found PkgConfig: /tmp/balay/spack/opt/spack/cray-cnl7-haswell/intel-19.1.2.254/pkgconf-1.8.0-uhriasntutowdps376mw4r3xntktq3ab/bin/pkg-config (found version "1.8.0") 24 -- Checking for module 'PETSc' 25 -- Found PETSc, version 3.18.1 >> 26 CMake Error at /global/common/software/nersc/cori-2022q1/spack/cray-cnl7-haswell/cmake-3.22.2-ugphoa4/share/cmake-3.22/Modules/FindPackageHandleStandardArgs.cmake:230 (message): 27 Could NOT find OpenMP_C (missing: OpenMP_C_FLAGS OpenMP_C_LIB_NAMES) 28 Call Stack (most recent call first): 29 /global/common/software/nersc/cori-2022q1/spack/cray-cnl7-haswell/cmake-3.22.2-ugphoa4/share/cmake-3.22/Modules/FindPackageHandleStandardArgs.cmake:594 (_FPHSA_FAILURE_MESSAGE) 30 /global/common/software/nersc/cori-2022q1/spack/cray-cnl7-haswell/cmake-3.22.2-ugphoa4/share/cmake-3.22/Modules/FindOpenMP.cmake:544 (find_package_handle_standard_args) 31 CMakeLists.txt:239 (find_package)
Will just disable exgo on cori for now.
I suppose that we will debug this failure in future releases?
@balay we should have a release along with a spack PR within a day. Just have to polish up the changelog and tag the new version. I will create a spack PR with exago changes that are identical to existing xSDK PR.
You might want to debug this - irrespective of xsdk. [it could be a spack/config issue on cori - or it could be and exago issue]
Wrt xsdk - cori is going away in a couple of months - [i.e not something to debug for future release]
I guess I was trying to convey: do not delay the release for this issue.
@balay we should have a release along with a spack PR within a day. Just have to polish up the changelog and tag the new version. I will create a spack PR with exago changes that are identical to existing xSDK PR.
sounds good.
Currently xsdk-0.8.0 is with petsc-3.18.0. [and exago-1.4.1]
However exago has:
Is it possible to provide exago release compatible with petsc-3.18.0?