xiaoyeli / superlu

Supernodal sparse direct solver. https://portal.nersc.gov/project/sparse/superlu/
Other
281 stars 96 forks source link

Draft: add windows test runner #133

Closed gruenich closed 11 months ago

gruenich commented 1 year ago

Open issues:

  1. Compiling the SuperLU library fails without an error message.
  2. The optional dependency WinGetOps should be added to the runner to run the tests

Until #131 is merged, this based on top of this branch.

gruenich commented 1 year ago

This branch would add two Windows runners, one with WinGetOpt and one without. The tests are executed for the former. For d_tests are segfaulting.

As pointed out in #108, reverting cf93 fixes the issue.
@xiaoyeli I suggest to revert all changed made as part of #117 as part of this pull request. We have to reach out to @evan-dsa to re-create his changes in a way that they a) do not segfault SuperLU's test suite and b) cover not only double but also the other three types s, c, and z. Do you agree? Should I update this pull request to revert #117 completely?

xiaoyeli commented 1 year ago

All 4 versions should be correct now. Do you still see segfaults?

gruenich commented 1 year ago

Thank you for the promt reply. I rebased this branch to include your changes and dropped the reversal of the problematic patch from pull request 117. Unfortunately, we get even more segfaults, not only for double but also for the other types.

xiaoyeli commented 1 year ago

The tests are fine with Linux and MacOS. I have no experience with Windows, so I don't know what goes wrong with Windows.

wo80 commented 1 year ago

The tests are fine with Linux and MacOS. I have no experience with Windows, so I don't know what goes wrong with Windows.

Well I hope you are not suggesting to ignore the error.


To tackle the problem, I'd suggest to focus on the test code first, then maybe the SuperLU code.

dtest crashes for imat=7 in dgst01 line 101:

https://github.com/xiaoyeli/superlu/blob/f530ef30e72ba1e7bbc382d23fe110c4faed1def/TESTING/dgst01.c#L94-L101

For imat=7 the 9x9 test matrix has all columns > 5 set to zero. At loop k=5 in dgst01 the value for urow gets -1 at some point, which produces a garbage offset for superno. I also checked this on Debian/GCC and it's the same.

One might be lucky and the memory at superno = Lstore->col_to_sup[urow] for urow=-1 is some reasonable number, but I have to express my full support for MSVC producing an access violation exception here!

wo80 commented 1 year ago

Valgrind also catches this correctly:

Invalid read of size 4
   at 0x10C19D: dgst01
   by 0x10B4CA: main
 Address 0x51005dc is 4 bytes before a block of size 40 alloc'd
   at 0x48407B4: malloc (vg_replace_malloc.c:381)
   by 0x117D9F: superlu_malloc
   by 0x117FA3: int32Malloc
   by 0x12588F: dLUMemInit
   by 0x11D830: dgstrf
   by 0x118D35: dgssv
   by 0x10B422: main

...

For lists of detected and suppressed errors, rerun with: -s
ERROR SUMMARY: 40 errors from 12 contexts (suppressed: 0 from 0)

That 4 bytes before a block ... sounds a lot like accessing an integer array at index -1.

@gruenich Maybe add valgrind to the Ubuntu workflow?

BTW, the test command line was ./d_test -t "LA" -n 9 -s 2 -l 10000000, both for valgrind and the debugger in the previous post.

gruenich commented 12 months ago

Moving the segfault issue out of the merge request. I opened #134 for this. I hoped we can fix this here, but I don't want to block other improvements in this merge requests.

I propose to merge this.