yangao07 / abPOA

abPOA: an SIMD-based C library for fast partial order alignment using adaptive band
MIT License
111 stars 18 forks source link

Remove low coverage bases from consensus sequence? #67

Open jiaan-yu opened 2 months ago

jiaan-yu commented 2 months ago

Hi Yan,

I use abPOA to generate a consensus from four sequences. Here is the distribtuion of n_seq / n_cov in each base in the consensus sequence. My question is, is there a way to remove low coverage bases? For this consensus sequence, the coverage should be 0.25/0.5/0.75/1; however, the low coverage bases at 0.25 are clearly errors from the input sequences.

FDE4BB4C-7937-4d47-A04A-D5C5C64982E7

Cheers, Jiaan

yangao07 commented 2 months ago

I guess you mean the consesus called by abPOA may not be as accurate as you expected? Can you share your input sequences here so that I can have more ideas about that?

jiaan-yu commented 2 months ago

Hi Yan,

Here is a section of the MSA output from abPOA (everything at default). I outlined the expected behaviour in green box and unexpected in red. 1508C185-64B2-4202-AECE-31478121347C Also attached is the input fastq file. input.fastq.txt

Cheers

yangao07 commented 2 months ago

Thanks for the clarification! abPOA did have an option to output consensus sequence with the most common/frequent base at each position, but was removed now. Right now, the consensus sequence is a travsal with max edge weight on the graph.

I can re-add this option. Will post the update here.

yangao07 commented 2 months ago

The latest abPOA has an option to chose a different consensus calling method: -a1, which selects the most frequent base at each position. The default is still heaviest bundling.

abpoa -a1 input.fastq.txt -r2

image

jiaan-yu commented 2 months ago

Hi Yan,

Thanks so much for your swift response! I don't know if you have benchmarked the performance of two consensus algo, here is my findings: Sequences are generated with fixed and independent errors The performance of default algo (-a0) is superior when the number of input sequences is >= 7; on the contrary, the performance of frequency algo (-a1) is much better when the number of input sequences is low... It is good to know when to use which cons method.

image

You may close this issue.

Cheers, Jiaan

yangao07 commented 2 months ago

Hi Jiaan,

Thanks for the information. Actually I wasn't aware of this "input-number-dependent" performance difference. This is interesting to me. I guess you have a template sequence as the ground truth, and simulated read sets with different read counts, right? Do you mind providing more details about your simulation, or maybe simply sharing your scripts so that I can look more into this?

jiaan-yu commented 2 months ago

Hi Yan, Here is how I tested abPOA:

(1) I randomly subset sequences from a reference fasta (E.coli etc.) as templates. The length of the templates can be 1K to 5K; but the consensus accuracy is independent of the length anyway.

(2) Simlord to insert errors into the templates https://bitbucket.org/genomeinformatics/simlord/src/master/.

simlord \
    --read-reference $template_fasta \
    -n 10 -ps 0.2 -pi 0.2 -pd 0.2 -t 0.18 \
    --no-sam $error_fasta

-n 10 for 10 repeats; -t 0.18 for overall error rate

(3) Generate consensus sequence with abPOA: abpoa -Q -r5 $error_fasta > $ccs_fastq

(4) I use edit distance (a python package) to calculate read accuracy. https://pypi.org/project/editdistance/

import editdistance
accuracy = 1-editdistance.eval(ccs_fastq_seq, template_fasta_seq) / float(len(template_fasta_seq))

Cheers