yangao07 / abPOA

abPOA: an SIMD-based C library for fast partial order alignment using adaptive band
MIT License
118 stars 18 forks source link

progressive mode does not work with seeding #35

Closed glennhickey closed 2 years ago

glennhickey commented 2 years ago

There seems to be a conflict between -p and -S, where abpoa exits 1 and prints no output if the two options are used together. Could you please clarify if this is a bug or intended behaviour? If it's the latter, I think it would be good to add an informative error message.

Example. t.fa

1>
AGGTTCTATGGCTATACAAATTCAGCCGCCCATGTATGAAGAGAGTTTCTTAGTTATGCAGTAGATAGCTCAATCATCAGCAGTTTTATTGCAGCACACAACATCTATCAAAGGGTATCCTCTTCTTGCCTAGAGATTAGCACTAGTGGCATGATTATGACTTGCTATGCATACTTTCGTGGGTAATCATTTTTAAATTCCTGATTTCCTTTACATTCAG
>2
AGGTTCTATATTAGTCATGTTATACAAAATCATTTGCCCATGTATTATCAGAATCTCTTAGTTATGCAGTAGATGTAGATAGCTTGATCATTGGCAGTTTTGTTGCAGTATACAACAGATGATTACATCATAAATGGTCTCTGAGGTAAGTGTATCTGGTAATCTTGTATCAAAGGGTATCCTCTTTTTGCCTAGACATTAGCGCCAGTGGCATCATCATGATGCTATGCTACTTTCATTGCAGAGAGCCGGGGTAATCGTTTCTGAATTCCTGAGTTCCTTTGCATTCAG

works fine with all combinations of -p or -S by themselves:

abpoa t.fa -m 0 -r 1
[main] CMD:  abpoa -m 0 -r 1 t.fa
>Multiple_sequence_alignment
AGGTTCTAT----G----GCTATACAAATTCAGCCGCCCATGTATGAAGAGAGTTTCTTAGTTATGCA------GTAGATAGCTCAATCATCAGCAGTTTTATTGCAGCACACAACA------------------------------------------------TCTATCAAAGGGTATCCTCTTCTTGCCTAGAGATTAGCACTAGTGGCATGATTATGACTTGCTATGCATACTTTCGT------------GGGTAATCATTTTTAAATTCCTGATTTCCTTTACATTCAG
AGGTTCTATATTAGTCATGTTATACAAAATCATTTGCCCATGTATTATCAGAATCTCTTAGTTATGCAGTAGATGTAGATAGCTTGATCATTGGCAGTTTTGTTGCAGTATACAACAGATGATTACATCATAAATGGTCTCTGAGGTAAGTGTATCTGGTAATCTTGTATCAAAGGGTATCCTCTTTTTGCCTAGACATTAGCGCCAGTGGCATCATCATGA--TGCTATGC-TACTTTCATTGCAGAGAGCCGGGGTAATCGTTTCTGAATTCCTGAGTTCCTTTGCATTCAG
[abpoa_main] Real time: 0.001 sec; CPU: 0.001 sec; Peak RSS: 0.003 GB.
abpoa t.fa -m 0 -r 1 -S
[main] CMD:  abpoa -m 0 -r 1 -S t.fa
>Multiple_sequence_alignment
AGGTTCTAT----G----GCTATACAAATTCAGCCGCCCATGTATGAAGAGAGTTTCTTAGTTATGCA------GTAGATAGCTCAATCATCAGCAGTTTTATTGCAGCACACAACA------------------------------------------------TCTATCAAAGGGTATCCTCTTCTTGCCTAGAGATTAGCACTAGTGGCATGATTATGACTTGCTATGCATACTTTCGT------------GGGTAATCATTTTTAAATTCCTGATTTCCTTTACATTCAG
AGGTTCTATATTAGTCATGTTATACAAAATCATTTGCCCATGTATTATCAGAATCTCTTAGTTATGCAGTAGATGTAGATAGCTTGATCATTGGCAGTTTTGTTGCAGTATACAACAGATGATTACATCATAAATGGTCTCTGAGGTAAGTGTATCTGGTAATCTTGTATCAAAGGGTATCCTCTTTTTGCCTAGACATTAGCGCCAGTGGCATCATCATGA--TGCTATGC-TACTTTCATTGCAGAGAGCCGGGGTAATCGTTTCTGAATTCCTGAGTTCCTTTGCATTCAG
[abpoa_main] Real time: 0.002 sec; CPU: 0.004 sec; Peak RSS: 0.003 GB.
abpoa t.fa -m 0 -r 1 -p
[main] CMD:  abpoa -m 0 -r 1 -p t.fa
>Multiple_sequence_alignment
AGGTTCTAT----G----GCTATACAAATTCAGCCGCCCATGTATGAAGAGAGTTTCTTAGTTATGCA------GTAGATAGCTCAATCATCAGCAGTTTTATTGCAGCACACAACA------------------------------------------------TCTATCAAAGGGTATCCTCTTCTTGCCTAGAGATTAGCACTAGTGGCATGATTATGACTTGCTATGCATACTTTCGT------------GGGTAATCATTTTTAAATTCCTGATTTCCTTTACATTCAG
AGGTTCTATATTAGTCATGTTATACAAAATCATTTGCCCATGTATTATCAGAATCTCTTAGTTATGCAGTAGATGTAGATAGCTTGATCATTGGCAGTTTTGTTGCAGTATACAACAGATGATTACATCATAAATGGTCTCTGAGGTAAGTGTATCTGGTAATCTTGTATCAAAGGGTATCCTCTTTTTGCCTAGACATTAGCGCCAGTGGCATCATCATGA--TGCTATGC-TACTTTCATTGCAGAGAGCCGGGGTAATCGTTTCTGAATTCCTGAGTTCCTTTGCATTCAG
[abpoa_main] Real time: 0.002 sec; CPU: 0.006 sec; Peak RSS: 0.003 GB.

but not with -p and -S together:

abpoa t.fa -m 0 -r 1 -p -S
[main] CMD:  abpoa -m 0 -r 1 -p -S t.fa
[abpoa_add_graph_sequence] seq_l: 0 start: 0    end: 0.
echo $?
1
yangao07 commented 2 years ago

Hi Glenn,

Just find out that this is a bug in the progressive mode, fixed now.

Yan

glennhickey commented 2 years ago

Thanks for your fast response! Here's a small example that still fails after your fix, though:

u.fa

>0
CTGGGCTAGT
>1
CTCAGCAAGT
>2
CTTGGATTTC

A segfault this time but once again, it only fails with both -p -S

abpoa -r 1 -m 0 u.fa -p -S
[main] CMD:  abpoa -r 1 -m 0 -p -S u.fa
Segmentation fault (core dumped)

abpoa -r 1 -m 0 u.fa -p
[main] CMD:  abpoa -r 1 -m 0 -p u.fa
>Multiple_sequence_alignment
CTGGGCTAGT
CTCAGCAAGT
CTTGGATTTC
[abpoa_main] Real time: 0.001 sec; CPU: 0.004 sec; Peak RSS: 0.003 GB.

abpoa -r 1 -m 0 u.fa -S
[main] CMD:  abpoa -r 1 -m 0 -S u.fa
>Multiple_sequence_alignment
CTGGGCTAGT
CTCAGCAAGT
CTTGGATTTC
[abpoa_main] Real time: 0.001 sec; CPU: 0.004 sec; Peak RSS: 0.003 GB.

abpoa -r 1 -m 0 u.fa
[main] CMD:  abpoa -r 1 -m 0 u.fa
>Multiple_sequence_alignment
CTGGGCTAGT
CTCAGCAAGT
CTTGGATTTC
[abpoa_main] Real time: 0.001 sec; CPU: 0.004 sec; Peak RSS: 0.003 GB.
yangao07 commented 2 years ago

This is another bug. These two bugs are because of no minimizer or no minimizer hit existing.

Thanks for catching them.

Yan

glennhickey commented 2 years ago

Thanks again! Here's another one:

v.fa

>0
TAGTAGTCCA
>1
TAGGACGATTTAGGAATCTA
>2
CAGGAATCTT

Works without -s -P

abpoa v.fa -m 0 -r 1
[main] CMD:  abpoa -m 0 -r 1 v.fa
>Multiple_sequence_alignment
TAG-------TAG---TCCA
TAGGACGATTTAGGAATCTA
----------CAGGAATCTT
[abpoa_main] Real time: 0.001 sec; CPU: 0.004 sec; Peak RSS: 0.003 GB.

But last row disappears with -S -p

abpoa v.fa -m 0 -r 1 -S -p
[main] CMD:  abpoa -m 0 -r 1 -S -p v.fa
>Multiple_sequence_alignment
TAG-------TAG---TCCA
TAGGACGATTTAGGAATCTA
--------------------
[abpoa_main] Real time: 0.001 sec; CPU: 0.004 sec; Peak RSS: 0.003 GB.
yangao07 commented 2 years ago

Another silly bug, fixed now. Thanks, Glenn!

Also one more thing, the progressive mode will work only if the seeding mode is enabled because they are using the same set of minimizers. Otherwise, abPOA skips the progressive tree building if the seeding mode is disabled.

Right now I think separating them makes more sense, working on that now.

glennhickey commented 2 years ago

thanks for the clarification. here's another case

w.fa

>0
AGGA
>1
AGAA
>2
AGGA
abpoa w.fa -m 0 -r 1 -S 
[main] CMD:  abpoa -m 0 -r 1 -S w.fa
>Multiple_sequence_alignment
AGGA
AGAA
AGGA
[abpoa_main] Real time: 0.001 sec; CPU: 0.005 sec; Peak RSS: 0.003 GB.
abpoa w.fa -m 0 -r 1 -S -p
[main] CMD:  abpoa -m 0 -r 1 -S -p w.fa
>Multiple_sequence_alignment
AGGA
----
----
[abpoa_main] Real time: 0.001 sec; CPU: 0.004 sec; Peak RSS: 0.003 GB.
yangao07 commented 2 years ago

Kind of left this bug there in the last commit by mistake. I think we are good this time. :)

Yan

glennhickey commented 2 years ago

Thanks, this version gets through cactus's integration tests except for one edge case -- it doesn't work always work on 1-sequence cases. Easy enough to work around, but perhaps you still want to fix?

w.fa

>0
AAAAACGTACCACGATACACAGATATAGAACCCCCCCCCCCCCCCCCTAGAGAG
abpoa -m 0 -r 1 w.fa
[main] CMD:  abpoa -m 0 -r 1 w.fa
>Multiple_sequence_alignment
AAAAACGTACCACGATACACAGATATAGAACCCCCCCCCCCCCCCCCTAGAGAG
[abpoa_main] Real time: 0.000 sec; CPU: 0.004 sec; Peak RSS: 0.004 GB.
abpoa -m 0 -r 1 w.fa -p
[main] CMD:  abpoa -m 0 -r 1 -p w.fa
[abpoa_add_graph_sequence] seq_l: 0 start: 0    end: 0.
yangao07 commented 2 years ago

Right, just fixed it. Rare case, but should better be fixed.

glennhickey commented 2 years ago

It is still sometimes removing sequences in the output...

wget http://public.gi.ucsc.edu/~hickey/debug/abpoa_fail_feb24.fa

abpoa abpoa_fail_feb24.fa -m 0 -r 1 | tail -1 | sed -e 's/-//g'
[main] CMD:  abpoa -m 0 -r 1 abpoa_fail_feb24.fa
[abpoa_main] Real time: 0.040 sec; CPU: 0.044 sec; Peak RSS: 0.025 GB.
GTTTGGAGACAGGTAAAATCAGGGTTAGAATAGGGTAGTGTTAGGGTAGTGTGTGGGTGTGGGTGTGTGGGTGTGGTGTGTGGGTGTGGTGTGTGTGGGTGTGGTGTGTGGGTGGGGGGGGGGGGGGGGGGGGGGGGGGGTGTGGGGTGTGGGTGTGGGTGTGGGTGTGGGTGTGGTGTGGGTGTGGTGTGTGGGTGTGG

# last sequence disappears with -p
abpoa abpoa_fail_feb24.fa -m 0 -r 1 -p | tail -1 | sed -e 's/-//g'
[main] CMD:  abpoa -m 0 -r 1 -p abpoa_fail_feb24.fa
[abpoa_main] Real time: 0.017 sec; CPU: 0.021 sec; Peak RSS: 0.027 GB.
yangao07 commented 2 years ago

This is actually a fatal bug that may result in wrong progressive tree orders. I have to admit that I haven't tested the progressive mode on enough datasets. Thanks so much, Glenn.

glennhickey commented 2 years ago

Thanks! At least the bugs were very easy to find -- better it fail completely than seem to work but do the wrong clustering. Anyway, the current version seems to be working, but I will still try it on a much bigger dataset than I have so far...