william-dawson / NTPoly

A massively parallel library for computing the functions of sparse matrices.
https://william-dawson.github.io/NTPoly/
MIT License
20 stars 10 forks source link

Nix Package for NTPoly #231

Closed maxwell-gisborne closed 6 months ago

maxwell-gisborne commented 6 months ago

Hi, I am trying to package ntpoly-v2.3.1 into a nix flake, and I am having some difficulty.

I hope it is okay for me to make an issue here about the topic, I apologies if its not.

I have managed to get it to compile now, but it's failing tests 1-11.

I used the Linux.cmake config, but in order to get it to compile I had to remove the -openmp CXX_FLAG option as it seemed to be confusing cc1plus. The compiler is provided by the mpicxx so I suppose it already knows it should link to mpi. But maybe this is causing problems. As seen later, I belive the errors are a failure to link properly with mpi. since adding a -openmp flag seems to break the c++ compiler, I'm not sure how this is supposed to be done.

The CMAKE_TOOLCHAIN_FILE I am using is this

    # Build file for a gcc, linux system.
    set(CMAKE_SYSTEM_NAME Linux)
    set(CMAKE_C_COMPILER mpicc)
    set(CMAKE_Fortran_COMPILER mpif90)
    set(CMAKE_CXX_COMPILER mpicxx)
    set(CMAKE_CXX_FLAGS "")

    # Library Files
    set(TOOLCHAIN_LIBS "-lblas")

    # Release suggestions
    set(CXX_TOOLCHAINFLAGS_RELEASE "-O3 -lgomp")
    set(F_TOOLCHAINFLAGS_RELEASE "-O3 -cpp")

    # Debug suggestions
    set(CXX_TOOLCHAINFLAGS_DEBUG "-O0 -Wall")
    set(F_TOOLCHAINFLAGS_DEBUG "-O0 -cpp -fcheck=all -Wall")

    #set(NOSWIG "yes")
    set(CMAKE_BUILD_TYPE "Debug")
    set(CMAKE_Fortran_FLAGS "-fallow-argument-mismatch")

When I run make test test 1-11 fail, while the rest succeed.

the output of the failed tests are

hwloc/linux: failed to find sysfs cpu topology directory, aborting linux discovery.                                                                                                                                                           
--------------------------------------------------------------------------                                                                                                                                                                    
The value of the MCA parameter "plm_rsh_agent" was set to a path                                                       
that could not be found:                                                                                               

  plm_rsh_agent: ssh : rsh                                                                                             

Please either unset the parameter, or check that the path is correct                                                                                                                                                                          
--------------------------------------------------------------------------                                                                                                                                                                    
[localhost:01345] [[INVALID],INVALID] FORCE-TERMINATE AT Not found:-13 - error plm_rsh_component.c(335)                                                                                                                                       
<end of output>                                                                                                        
Test time =   0.71 sec                                                                                                 
----------------------------------------------------------                                                             
Test Failed.                                                                                                           
"Regression111" end time: Feb 27 18:43 UTC                                                                             
"Regression111" time elapsed: 00:00:00                                                                                 
----------------------------------------------------------                                                             

2/26 Testing: Regression211                                                                                            
2/26 Test: Regression211                                                                                               
Command: "/nix/store/6payx2da66dbjl6vg15csxfb5hpf3df4-bash-5.2-p15/bin/bash" "/build/source/Build/bin/RunTest.sh" "2" "1" "1" "2"                                                                                                             
Directory: /build/source/Build/UnitTests                                                                               
"Regression211" start time: Feb 27 18:43 UTC                                                                           
Output:                                                                                                                
----------------------------------------------------------                                                             
hwloc/linux: failed to find sysfs cpu topology directory, aborting linux discovery.                                                                                                                                                           
--------------------------------------------------------------------------                                                                                                                                                                    
The value of the MCA parameter "plm_rsh_agent" was set to a path                                                       
that could not be found:                                                                                               

  plm_rsh_agent: ssh : rsh                                                                                             

Please either unset the parameter, or check that the path is correct                                                                                                                                                                          
--------------------------------------------------------------------------                                                                                                                                                                    
[localhost:01347] [[INVALID],INVALID] FORCE-TERMINATE AT Not found:-13 - error plm_rsh_component.c(335)                                                                                                                                       
<end of output>                                                                                                        
Test time =   0.11 sec                                                                                                 
----------------------------------------------------------                                                             
Test Failed.                                                                                                           
"Regression211" end time: Feb 27 18:43 UTC                                                                             
"Regression211" time elapsed: 00:00:00                                                                                 
----------------------------------------------------------                                                             

3/26 Testing: Regression121                                                                                            
3/26 Test: Regression121                                                                                               
Command: "/nix/store/6payx2da66dbjl6vg15csxfb5hpf3df4-bash-5.2-p15/bin/bash" "/build/source/Build/bin/RunTest.sh" "1" "2" "1" "2"                                                                                                             
Directory: /build/source/Build/UnitTests                                                                               
"Regression121" start time: Feb 27 18:43 UTC                                                                           
Output:                                                                                                                
----------------------------------------------------------                                                             
hwloc/linux: failed to find sysfs cpu topology directory, aborting linux discovery.                                                                                                                                                           
--------------------------------------------------------------------------                                                                                                                                                                    
The value of the MCA parameter "plm_rsh_agent" was set to a path                                                       
that could not be found:                                                                                               

  plm_rsh_agent: ssh : rsh                                                                                             

Please either unset the parameter, or check that the path is correct                                                                                                                                                                          
--------------------------------------------------------------------------                                                                                                                                                                    
[localhost:01349] [[INVALID],INVALID] FORCE-TERMINATE AT Not found:-13 - error plm_rsh_component.c(335)                                                                                                                                       
<end of output>                                                                                                        
Test time =   0.11 sec                                                                                                 
----------------------------------------------------------                                                             
Test Failed.                                                                                                           
"Regression121" end time: Feb 27 18:43 UTC                                                                             
"Regression121" time elapsed: 00:00:00                                                                                 
----------------------------------------------------------                                                             

4/26 Testing: Regression112                                                                                            
4/26 Test: Regression112                                                                                               
Command: "/nix/store/6payx2da66dbjl6vg15csxfb5hpf3df4-bash-5.2-p15/bin/bash" "/build/source/Build/bin/RunTest.sh" "1" "1" "2" "2"                                                                                                             
Directory: /build/source/Build/UnitTests                                                                               
"Regression112" start time: Feb 27 18:43 UTC                                                                           
Output:                                                                                                                
----------------------------------------------------------                                                             
hwloc/linux: failed to find sysfs cpu topology directory, aborting linux discovery.                                                                                                                                                           
--------------------------------------------------------------------------                                                                                                                                                                    
The value of the MCA parameter "plm_rsh_agent" was set to a path                                                       
that could not be found:                                                                                               

  plm_rsh_agent: ssh : rsh                                                                                             

Please either unset the parameter, or check that the path is correct                                                                                                                                                                          
--------------------------------------------------------------------------                                                                                                                                                                    
[localhost:01351] [[INVALID],INVALID] FORCE-TERMINATE AT Not found:-13 - error plm_rsh_component.c(335)                                                                                                                                       
<end of output>                                                                                                        
Test time =   0.11 sec                                                                                                 
----------------------------------------------------------                                                             
Test Failed.                                                                                                           

Any help would be greatly

william-dawson commented 6 months ago

Thanks for working on the nix flake. 1) If the first set of tests work, there is probably no issue with openmp linking, so I wouldn't worry about it. In fact, cmake is set to search for openmp itself if the flag is not provided (you might see some useful output about this during the cmake configure step). 2) For nix, is the build done in a docker container? The error sounds something like this one: (https://github.com/open-mpi/ompi/issues/3625). Maybe it can be fixed by install ssh in the container.

maxwell-gisborne commented 6 months ago

Thank you for replying.

(1) it is tests from 1 to 11 that the ones that are failing, and tests 12 to 26 that are passing. So i suppose that means it's the first set which is failing.

(2) I am not using a docker container. Nix containerizes its build environments. So perhaps its the same problem

maxwell-gisborne commented 6 months ago

After adding openssh to the build enviroment, the same tests are failing, but now with a different error message.

They now bear

At line 7 of file dense_includes/CheckMemoryPoolValidity.f90
Fortran runtime error: Allocatable argument 'this' is not allocated

repeated a few times followed by

Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborte
william-dawson commented 6 months ago

Great. Installing openssh seems to have helped, because now we're actually getting into the code.

It seems like there is actually a bug in the 2.7.1 version, I can reproduce this on my machine. Fortunately the bug doesn't exist in the 3.0 series. My guess is that #188 fixed it. I will backport whatever fix was needed and release a v2.7.2 for you. Sorry for the trouble and thanks for finding this.

maxwell-gisborne commented 6 months ago

Okay, thanks.

I would like to package a version compatible with bigdft. Should I chose 3.0.0 or 3.1.0_bigdft. What is the difference?

maxwell-gisborne commented 6 months ago

Ive installed v3.0.0 with all tests passed :)

Thankyou for your help.

william-dawson commented 6 months ago

For the latest release of BigDFT (1.9.4) it is using NTPoly 3.0.0, so I recommend that you start with that. The _bigdft version was a prerelease so I could test out some new features.

Thank you for your contributions! I'm looking forward to there being a BigDFT nix package!