yangao07 / abPOA

abPOA: an SIMD-based C library for fast partial order alignment using adaptive band
MIT License
118 stars 18 forks source link

Better global alignment when aligning in other direction #78

Open glennhickey opened 1 week ago

glennhickey commented 1 week ago

@adamnovak has been picking through the HPRC graph and finding suspect alignments. Here is one from CHM13#0#chr3:164033777-164033842. If I align it with abpoa in its forward orientation (on chm13) I get region-2-abpoa (where gaps are transparent).
But if I reverse complement I get region-2-abpoa rev which seems much cleaner -- ie there is only 1 gap per row except 3 cases, where the gap seems more properly placed on the right.

Are these alignment somehow scoring equivalently, even though by eye one seems much better? If not, is this expected or a bug? Do you have any suggestions on how it could be improved?

All the information to reproduce is here (see README for command lines): https://public.gi.ucsc.edu/~hickey/debug/abpoa_direction_oct17_2024/

Thanks so much!

yangao07 commented 1 week ago

The difference comes from two reasons: 1) you used seeding and progressive tree to order the input sequence, which does not work well for this repeat region sequences. I did get less gaps with the seeding disabled for forward strand. 2) the more-than-one-gap alignment in the first MSA is actually optimal, even though its RC gets a one-gap alignment, because some gap is not penalized as it already exist in the partial order alignment graph. So, input order is very important for determining the number of gaps in the alignment.

Although I don't know which is better, they are all expected results.

Forward strand without seeding: image Reverse strand without seeding: image

yangao07 commented 1 week ago

But I do agree that they are not real optimal alignment results. It may not be easy, but I will try to improve it.

glennhickey commented 6 days ago

Thanks for the quick follow-up. By eye it still seems that the reverse with seeding is the best. I understand that the difference between the different scenarios is explainable by the order, and it's not reflected in the current scoring scheme.

I'm still not sure I understand the difference when aligning the different strands -- shouldn't the order be unaffected?

In any case, it does seem like there is room for future improvements -- we are happy to test any ideas you come up with!

yangao07 commented 6 days ago

The difference between different strands is because abpoa always puts gaps in the left-most position. To get the same result, gaps should be put on the right side for the reverse-comp strand.