xiaoyeli / superlu

Supernodal sparse direct solver. https://portal.nersc.gov/project/sparse/superlu/
Other
272 stars 95 forks source link

Segfault in Windows tests #134

Closed gruenich closed 8 months ago

gruenich commented 9 months ago

While adding the Windows runner, we found out that the tests are segfaulting using MSVC. #133 contains a more detailed analysis of the issue. Valgrind can also detect it on Linux.

Without this fix, we cannot enable the tests on Windows and the Windows runner is only testing configure and building.

wo80 commented 9 months ago

Due to the lack of communication, I'm still not sure if @xiaoyeli has gotten the point that this is not a problem with Windows. It's a problem with the tests passing on Linux and Mac, though there is clearly something going wrong.

Let me add some more thoughts to my analysis in https://github.com/xiaoyeli/superlu/pull/133#issuecomment-1826879216. I think there are two possibilities:

1) The value of -1 for urow is invalid. This would mean that the error has to be fixed in the SuperLU code. 2) The (at least to me) more plausible possibility is that the value is valid and it's an indicator for "stop processing, no more data to come here, the matrix was singular". This raises a few questions:

The answer to that last question is easy: the tests are way too complex. Ideally, if a test fails, it would give a hint about what goes wrong. The tests as they are now, have four nested for loops (imat, iequed, ifact and equil), but only one indicator for failure (nfails). This is absolutly useless when it comes to help finding the cause of a test failure. You have to debug the tests (which is not what tests are made for).

One way to improve this would be to add more output. This will then at least give a hint at what stage the error occurred (in particular in case of a segfault). Maybe add an additional command line option -v to enable more verbose output.

xiaoyeli commented 9 months ago

I'll use valgrind to fix the testers. I merged #133 to temporarily not run testers on Windows

xiaoyeli commented 8 months ago

@wo80 @gruenich I finally got around to fix this. It's related to testing code error while testing singular matrices. Thanks for your detailed report.

wo80 commented 8 months ago

Thanks @xiaoyeli. All tests are passing now. I suggest you undo https://github.com/xiaoyeli/superlu/pull/133/commits/0559d56084de64c7d2384e49165f79b2fd0a601f , so that all platforms are tested properly.

wo80 commented 8 months ago

It might be worth also updating the (c|d|s|z)gst01.c files: https://github.com/xiaoyeli/superlu/blob/c173a6642c1dbb726c58777145396be3b194c747/TESTING/cgst01.c#L38-L40 Looking at the changes in c173a6642c1dbb726c58777145396be3b194c747, it seems n is not the number of columns but rather the rank of A.

In any case, I think that the method should use exit(-1) or at least print an error message in case urow is invalid. As we have seen, Linux and Mac did not report any problem, so checking this explicitly would be a good idea. Maybe it's also worth opening a separate issue for this (a general review/cleanup of the tests).