Open jiaan-yu opened 2 months ago
I guess you mean the consesus called by abPOA may not be as accurate as you expected? Can you share your input sequences here so that I can have more ideas about that?
Hi Yan,
Here is a section of the MSA output from abPOA (everything at default). I outlined the expected behaviour in green box and unexpected in red.
Also attached is the input fastq file. input.fastq.txt
Cheers
Thanks for the clarification! abPOA did have an option to output consensus sequence with the most common/frequent base at each position, but was removed now. Right now, the consensus sequence is a travsal with max edge weight on the graph.
I can re-add this option. Will post the update here.
The latest abPOA has an option to chose a different consensus calling method: -a1
, which selects the most frequent base at each position. The default is still heaviest bundling.
abpoa -a1 input.fastq.txt -r2
Hi Yan,
Thanks so much for your swift response! I don't know if you have benchmarked the performance of two consensus algo, here is my findings: Sequences are generated with fixed and independent errors The performance of default algo (-a0) is superior when the number of input sequences is >= 7; on the contrary, the performance of frequency algo (-a1) is much better when the number of input sequences is low... It is good to know when to use which cons method.
You may close this issue.
Cheers, Jiaan
Hi Jiaan,
Thanks for the information. Actually I wasn't aware of this "input-number-dependent" performance difference. This is interesting to me. I guess you have a template sequence as the ground truth, and simulated read sets with different read counts, right? Do you mind providing more details about your simulation, or maybe simply sharing your scripts so that I can look more into this?
Hi Yan, Here is how I tested abPOA:
(1) I randomly subset sequences from a reference fasta (E.coli etc.) as templates. The length of the templates can be 1K to 5K; but the consensus accuracy is independent of the length anyway.
(2) Simlord to insert errors into the templates https://bitbucket.org/genomeinformatics/simlord/src/master/.
simlord \
--read-reference $template_fasta \
-n 10 -ps 0.2 -pi 0.2 -pd 0.2 -t 0.18 \
--no-sam $error_fasta
-n 10 for 10 repeats; -t 0.18 for overall error rate
(3) Generate consensus sequence with abPOA:
abpoa -Q -r5 $error_fasta > $ccs_fastq
(4) I use edit distance (a python package) to calculate read accuracy. https://pypi.org/project/editdistance/
import editdistance
accuracy = 1-editdistance.eval(ccs_fastq_seq, template_fasta_seq) / float(len(template_fasta_seq))
Cheers
Hi Yan,
I use abPOA to generate a consensus from four sequences. Here is the distribtuion of n_seq / n_cov in each base in the consensus sequence. My question is, is there a way to remove low coverage bases? For this consensus sequence, the coverage should be 0.25/0.5/0.75/1; however, the low coverage bases at 0.25 are clearly errors from the input sequences.
Cheers, Jiaan