yangao07 / abPOA

abPOA: an SIMD-based C library for fast partial order alignment using adaptive band
MIT License
118 stars 18 forks source link

abPOA crashes with seeding enabled #66

Closed glennhickey closed 5 months ago

glennhickey commented 6 months ago

To reproduce

wget -q http://public.gi.ucsc.edu/~hickey/debug/abpoa_fail_mar21.fa
abpoa -r 1 -m 0 -S abpoa_fail_mar21.fa > out
[main] CMD:  abpoa -r 1 -m 0 -S abpoa_fail_mar21.fa
Segmentation fault (core dumped)

It runs through without -S

abpoa -r 1 -m 0 abpoa_fail_mar21.fa > out
[main] CMD:  abpoa -r 1 -m 0 abpoa_fail_mar21.fa
[abpoa_main] Real time: 164.640 sec; CPU: 164.630 sec; Peak RSS: 6.884 GB.

I used abpoa version 8a2dc2fc49b10bb1b57ef8738b7f8c6a48cbad79 Thanks!

yangao07 commented 6 months ago

Hi Glenn, for this data, it runs through on my machine, but clearly uses a lot of memory

[main] CMD:  /home/gaoy1/program/abPOA/bin/abpoa -r 1 -m 0 -S abpoa_fail_mar21.fa
[abpoa_main] Real time: 177.372 sec; CPU: 173.506 sec; Peak RSS: 57.480 GB.

I guess this error is caused by running of memory, I will double-check if this memory usage is correct.

yangao07 commented 6 months ago

Also, the MSA result of seeding run seems to be off.

glennhickey commented 6 months ago

Thanks for looking into this. I ran a few more experiments:

Compile flags | Seed | Memory (G) | Time (s)| Output Size (b)
------------------------------------------------------------
native        | No   | 6.884      | 258     | 2,141,347
native        | Yes  | 52.181     | 295     | CRASH
------------------------------------------------------------
sse2          | No   | 6.902      | 352     | 2,138,334
sse2          | Yes  | 8.379      | 78      | CRASH
------------------------------------------------------------
sse41         | No   | 6.901      | 292     | 2,138,334
sse41         | Yes  | 8.379      | 70      | CRASH
------------------------------------------------------------
avx2          | No   | 6.884      | 224     | 2,141,347
avx2          | Yes  | 52.181     | 296     | CRASH
------------------------------------------------------------
gdb           | No   | 6.895      | 316     | 2,152,875
gdb           | Yes  | 30.511     | 930     | 24,155,897

The only way I can get it to run through with seeding with compiling with gdb defined. And then it runs 3 times longer and the output is 10x larger. It's also a bit concerning that even without seeding, the output changes between gdb/sse/avx, though the difference is much smaller.

Anyway, even without the crash, as you say, this example shows that there is a problem with seeding. And, which is concerning to me, this problem may manifest in bad output (very hard to detect) rather than crashes depending on the system.

For the record I'm using

Intel(R) Xeon(R) CPU E7-8870 v4 @ 2.10GHz
gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
1.5 TB RAM

and simd_check reports (when built with -march=native)

__AVX2__

=======================================
==== AVX2 (256 bits) will be used. ====
=======================================

which is consistent with the avx2 and native results being equivalent above

thanks again for your help.

yangao07 commented 6 months ago

I do see thoes difference between sse4/avx2/native in non-seeding mode on my machine. Hopefully I can figure out this and the seeding mode by this weekend.

yangao07 commented 6 months ago

@glennhickey The instruction-dependent difference should be fixed now. I tested the sse41/avx2/avx512 on my machine. Also the seg-fault in seeding mode should also be fixed. It was caused by a huge maxtrix size (>32G). Please also test on your side and let me know if you see anything wrong.

But the alignment result of seeding mode is still off. This is because the seeding/anchor-searching by abPOA are not optimized to deal with divergent input sequences like this dataset. So the sub-sequence vs sub-graph are largely off, leading to a huge POA graph. In short, the seeding mode should not be enabled unless you are very sure about the consequence.

glennhickey commented 5 months ago

Thanks @yangao07 for looking at this. I just tried cd178fdd64c22ae1800d8600224c82684329c894 and confirm:

glennhickey commented 5 months ago

I'll dig out data to reproduce when I have a chance, but a heads up that updating to this commit fails Cactus's unit tests

[simd_abpoa_align_sequence_to_subgraph] Error in cg_backtrack.
yangao07 commented 5 months ago

I did see this error duing debugging the above issues, but I thought I sovled it :(

I'll dig out data to reproduce when I have a chance, but a heads up that updating to this commit fails Cactus's unit tests

[simd_abpoa_align_sequence_to_subgraph] Error in cg_backtrack.
glennhickey commented 5 months ago

Thanks here's how to reproduce

wget -q http://public.gi.ucsc.edu/~hickey/debug/abpoa_fail_apr8.fa
wget -q http://public.gi.ucsc.edu/~hickey/debug/abpoa_fail_apr8.fa.mat
abpoa abpoa_fail_apr8.fa -O 4,24 -E 2,1 -b 10 -f 0.010000 -t abpoa_fail_apr8.fa.mat -r 1 -m 0 > abpoa_fail_apr8.fa.out
[simd_abpoa_align_sequence_to_subgraph] Error in cg_backtrack.
yangao07 commented 5 months ago

This error should be fixed now.

glennhickey commented 5 months ago

This commit passes the tests. thanks!