yangao07 / abPOA

abPOA: an SIMD-based C library for fast partial order alignment using adaptive band
MIT License
118 stars 18 forks source link

abPOA fails to capture difference in prefix #57

Closed TanyaDvorkina closed 9 months ago

TanyaDvorkina commented 9 months ago

Hi,

Thank you for developing abPOA!

I am trying to use abPOA to build consensus and cluster short TR sequences (~40-70bp in length). I've noticed that if the difference in sequences is in prefix, then abPOA identifies only one cluster instead of two.

I generated three different dataset of 20 reads using this function:

def generate_subreads(num, where = "start"):
    seqs = []
    repeat = "CTAT"
    insertion="TCTA"
    if where == "start":
        for i in range(num//2):
            seqs.append(SeqRecord(Seq(insertion + repeat*15), id = str(i), name= str(i), description=""))
    elif where == "middle":
        for i in range(num//2):
            seqs.append(SeqRecord(Seq(repeat*8 + insertion + repeat*7), id = str(i), name= str(i), description=""))
    else:
        for i in range(num//2):
            seqs.append(SeqRecord(Seq(repeat*15 + insertion), id = str(i), name= str(i), description=""))

    for i in range(num//2):
        seqs.append(SeqRecord(Seq(repeat*15), id = str(num//2 + i), name= str(num//2 + i), description=""))

    return seqs

and run: aligner.msa(seqs, out_cons=True, out_msa=True, max_n_cons=2, min_freq=0.3)

and here is the output of the aligner.msa on these datasets: synthetic_output.txt

Is there a problem with abpoa algorithm on such corner case?

Thank you, Tatiana

yangao07 commented 9 months ago

Thanks for the feedback, this is a bug related to this corner case. Should have been fixed in the latest commit. Also, since this function is not thoroughly tested, let me know when you see anything weird in the future.