yangao07 / abPOA

abPOA: an SIMD-based C library for fast partial order alignment using adaptive band
MIT License
118 stars 18 forks source link

Align expanded sequences, trim them and then compute the consensus sequence #30

Open AndreaGuarracino opened 2 years ago

AndreaGuarracino commented 2 years ago

Hi Yaon,

I've expanded few sequences with flanking nucleotides, then I've aligned them, obtaining the MSA and consensus sequence.

However, after aligning the expanded sequences, I would like to first delete the additional nucleotides and then calculate the consensus sequence with only the unexpanded sequences, without having to recalculate the whole alignment.

Is it possible?

yangao07 commented 2 years ago

I don't fully get your idea. Do all the aligned sequences have flanking nucleotides? Or only a part of the sequences have?

Then, I guess what you want is to obtain the alignment/consensus for only a subset of all the input aligned sequences that have no flanking nucleotides, am I right?

Yan (I know Yao is very famous:) )

AndreaGuarracino commented 2 years ago

All the aligned sequences have flanking nucleotides.

Let me try with an example.

These are the original sequences.

example_abpoa.fa

>Sequence1
AAAAAAAATTTT
>Sequence2
TTGGTTTTTT

abpoa --match 1 --mismatch 9 --gap-open 16,41 --gap-ext 2,1 example_abpoa.fa -r 2

>Multiple_sequence_alignment
------AAAAAAAATTTT
TTGGTT--------TTTT
------AAAAAAAATTTT

Instead of aligning the original sequences, I am extending them with few flanking nucleotides upstream and downstream, and then aligning the expanded sequences. For example, adding 4 flanking nucleotides:

example_abpoa_padded.fa

>Sequence1
GGGGAAAAAAAATTTTGGGG
>Sequence2
TTTTTTGGTTTTTTGGGG

abpoa --match 1 --mismatch 9 --gap-open 16,41 --gap-ext 2,1 example_abpoa_padded.fa -r 2

>Multiple_sequence_alignment
------GGGGAAAAAAAATTTTGGGG
TTTTTTGG--------TTTTTTGGGG
TTTTTTGGGGAAAAAAAATTTTGGGG

Concerning the sequences, I am taking this alignment and removing the added flanking nucleotides. However, regarding the consensus sequence, I can't just trim the flanking nucleotides out (this leads to rare, but blocking bugs).

I would like to trim the expanded sequences in the abPOA graph (where the alignment was already computed with the expanded sequences) and then re-ask for the consensus sequence generation using the trimmed graph.

Yan (I know Yao is very famous:) )

LOL, mutations, you've got me!

yangao07 commented 2 years ago
>Multiple_sequence_alignment
------GGGGAAAAAAAATTTTGGGG
TTTTTTGG--------TTTTTTGGGG
TTTTTTGGGGAAAAAAAATTTTGGGG

I guess, from this new graph, what you want is the consensus of the original (unexpanded) sequences, right? Then the graph should be transformed to:

>Multiple_sequence_alignment
----------AAAAAAAATTTT----
----TTGG--------TTTTTT----
>Consensus_sequence
----TTGG--AAAAAAAATTTT----

I think this is possible. We need to record all the flanking-base nodes and then manipulate their weight/base before calling the consensus:

  1. w = w-1, if w > 1
  2. set base as - if w == 1

Edit: this manipulation may not be correct, I need to think about it.

Then for the final consensus, we just need to remove all the - symbols.

AndreaGuarracino commented 2 years ago

Hi @yangao07, any update on this?

yangao07 commented 2 years ago

See above, is that what you want?

AndreaGuarracino commented 2 years ago

I apologize, I didn't realize you were waiting for confirmation from me. Yes, I want the consensus of the original (unexpanded) sequences, as you wrote.

>Multiple_sequence_alignment
----------AAAAAAAATTTT----
----TTGG--------TTTTTT----
>Consensus_sequence
----TTGG--AAAAAAAATTTT---- <---- This one, generated from the trimmed (unexpanded) sequences

As I said, the consensus sequence can't be simply trimmed (this leads to rare bugs) but has to be generated starting from the original (unexpanded) sequences, using the alignment that we've got when we aligned the same sequences together with the flanking nucleotides.

After your considerations, do you think is it still doable?