yangao07 / abPOA

abPOA: an SIMD-based C library for fast partial order alignment using adaptive band
MIT License
118 stars 18 forks source link

MSA anchoring #70

Open rmhubley opened 5 months ago

rmhubley commented 5 months ago

I have a naive question about your MSA methodology. Why would the aligner choose to align short segments (<20 bp) at the start of the alignment and then allow a large deletion (>1000bp) before aligning again? Perhaps I am misunderstanding the parameters to abPOA.

Example: I have a 100 homologous sequences -- a few full length (~2200 bp) and many fragments (~300 bp in length). I used local alignment as there is always a chance in these datasets that small portions of the edges of these sequences could be unrelated sequence, although in this dataset it doesn't look to be the case. I ran abPOA as:

../bin/abpoa family-seqs.fa -t comparison.matrix -O 25,0 -E 5,0  -m 1 -r 1

Where the comparison matrix is simply:

    A   G   C   T   N      
A   9  -6 -15 -17  -1     
G  -6  10 -15 -15  -1     
C -15 -15  10  -6  -1     
T -17 -15  -6   9  -1     
N  -1  -1  -1  -1  -1     

Here is the start of the MSA with the first 11 sequences anchoring with minor or no overlap to other sequences before entering a large deletion:

gi|526567    1 AAGC------------------------------------------------------------------------------------------------    4 [1]
gi|527347    1   GC------------------------------------------------------------------------------------------------    2 [2]
gi|526255    1     ATGACA---------------------T--------------------------------------------------------------------    7 [3]
gi|526360    1         CA---------------------T--------------------------------------------------------------------    3 [4]
gi|527067    1           TG----------------------------------------------------------------------------------------    2 [5]
gi|527527    1             ATA-------------------------------------------------------------------------------------    3 [6]
gi|527416    1                C------------------------------------------------------------------------------------    1 [7]
gi|526343    1                 CCACCTGGAAG-------------------------------------------------------------------------    11 [8]
gi|527241    1                            TGCTT--------------------------------------------------------------------    5 [9]
gi|530654    1                                 CTA-----------------------------------------------------------------    3 [10]
gi|525985    1                                    TACAT------------------------------------------------------------    5 [11]
gi|525696    1                                         CCTGTCTTCTACCATCTCCGAACTCCAGATACTCCAACAGCGAAAGGGATCTGGGCCCAA    60 [12]
gi|525698    1                                         CCTGTCTTCCACCATCTCCGAACTCCAGATACTCCTACAGCGAAAGGGATCTGGGCCCAA    60 [13]
gi|525697    1                                         CCTGTCTTCCACCATCTCTGAACTCCAGACACTCCAACAGTGAAAGGGATCTAGGGCCAC    60 [14]
gi|525699    1                                         CCTGTCTTCCACCATCTCTGAACTCCAGACACTCCAACAGTGAAAGGGATCTAGGGCCAC    60 [15]

The gap in the first 11 sequences goes on for about 1000 bp in the alignment before starting backup up again:

gi|526567    5 --------------------------------------------------------------------------------ATAT------AGAG-T--T-    14 [1]
gi|527347    3 --------------------------------------------------------------------------------ATGT------CAA-----T-    10 [2]
gi|526255    8 -----------------------------------------------------------------------------------TAT---GAG------T-    14 [3]
gi|526360    4 -----------------------------------------------------------------------------------TGT---GGGG-----C-    11 [4]
gi|527067    3 -------------------------------------------------------------------------------TGTATAT-----AAT-G--T-    14 [5]
gi|527527    4 ------------------------------------------------------------------------AA------ATGT------AAA-------    12 [6]
gi|527416    2 --------------------------------------------------------------------------------GTGA------AAGTA-----    10 [7]
gi|526343   12 ---------------------------------------------------------------------------------------CCAAAGT-TCATG    23 [8]
gi|527241    6 -----------------------------------------------------------------------------------T------AAAT-T--T-    12 [9]
gi|530654    3 ----------------------------------------------------------------------------------------------------    3 [10]
gi|525985    6 ------------------GGCAGAGGAAAAAGTG-CACAAATTGGGACCAGAAAAT---ATAA---A--GGGGAG---TTGTGT------AGAT-T--T-    65 [11]
gi|525696 1040 CAGATGGTTAAATGGGAGGACAGAAAATGAAT-G-CACAAG--TGGACCAATAAAT-GAATGATCCATTGGGAAGCATCTGTGCAT---GAAAT-C--T-    1127 [12]
gi|525698 1040 CAGATGGTTAAATGGGAGGGCAGAAAATGAAT-G-CACAAG--TGGATCTATAAAT-GAATGATCCATTGGGAAGCATCTGTGCAT---GAAAT-C--T-    1127 [13]
gi|525697 1042 CAGATGGTTAAATGGGAGGGCAGAAAATGAAT-G-CACAAG--TGGACCAATAAAT-GAATGATCCATTGGGAAGCATCTGTGTAT---GAAAT-C--T-    1129 [14]
gi|525699 1040 CAGATGGTTAAATGGGAGGGCAGAAAATGTAG-G-CACAAG--GGGACCAATAAAT-GAATGATCTATTGAGAAGCATCTGTGCAT---GAAAT-C--T-    1127 [15]
gi|525700  749 CAGATGGGTAAATGCGAGGGCAGAGAATGAAT-G-CACAAG--GGTACCAATAAAT-GAATGATCCATTGGGAAGCATCTGTGCAC---CAAAT-C--TG    837 [16]

Certainly there was enough opportunity to align the first two bases of gi|527347 ('GC') somewhere much closer. Perhaps my affine gap parameters (values I use typically in pairwise alignment) are not stringent enough in this context? Any other ideas?

yangao07 commented 5 months ago

The above MSA result may look weird, but it is indeed correct given your input. The local alignment in abPOA aligns part of the sequence to the current graph with the best local alignment score, the other part will be left unaligned (like soft clipping in cigars of SAM file).

In your case, the first several bases of the first 11 sequences are considered unaligned when the longer sequence is aligned to the graph. So the gaps in the first 11 sequences have no gap penalty.

My initial impression is that the order of the input sequences is crucial. Have you tried placing shorter sequences at the end of the input file? Or, if you want all the bases to be included in the alignment, you can try to use global alignment mode.

rmhubley commented 5 months ago

I would understand placing the unaligned bases in the MSA if this was a global alignment, however with a local alignment wouldn't those simply be omitted from the output? What about the column with the T's in it, those are aligned (at least among three sequences), wouldn't the gap parameters be applied between those T's and the bulk of the aligned sequence ~1000bp later? I guess I am confused as to what is considered aligned and what isn't in this output.

gi|526567    1 AAGC------------------------------------------------------------------------------------------------    4 [1]
gi|527347    1   GC------------------------------------------------------------------------------------------------    2 [2]
gi|526255    1     ATGACA---------------------T--------------------------------------------------------------------    7 [3]
gi|526360    1         CA---------------------T--------------------------------------------------------------------    3 [4]
gi|527067    1           TG----------------------------------------------------------------------------------------    2 [5]
gi|527527    1             ATA-------------------------------------------------------------------------------------    3 [6]
gi|527416    1                C------------------------------------------------------------------------------------    1 [7]
gi|526343    1                 CCACCTGGAAG-------------------------------------------------------------------------    11 [8]
gi|527241    1                            TGCTT--------------------------------------------------------------------    5 [9]
gi|530654    1                                 CTA-----------------------------------------------------------------    3 [10]
gi|525985    1                                    TACAT------------------------------------------------------------    5 [11]
gi|525696    1                                         CCTGTCTTCTACCATCTCCGAACTCCAGATACTCCAACAGCGAAAGGGATCTGGGCCCAA    60 [12]
gi|525698    1                                         CCTGTCTTCCACCATCTCCGAACTCCAGATACTCCTACAGCGAAAGGGATCTGGGCCCAA    60 [13]
gi|525697    1                                         CCTGTCTTCCACCATCTCTGAACTCCAGACACTCCAACAGTGAAAGGGATCTAGGGCCAC    60 [14]
gi|525699    1                                         CCTGTCTTCCACCATCTCTGAACTCCAGACACTCCAACAGTGAAAGGGATCTAGGGCCAC    60 [15]
yangao07 commented 5 months ago

I could not reproduce your above MSA, so I can only guess what is going on there: The Ts were considered aligned for the top sequences, but not the bottom ones (longer ones). So the gaps are not penalized when the longer sequences are added to the MSA.

yangao07 commented 5 months ago

One more thing you may want to know is that the local mode is good for consensus calling purposes, as the unaligned part is unlikely to be part of the consensus, but not for the MSA purpose. This is why the MSA always looks weird for local mode.