yangao07 / abPOA

abPOA: an SIMD-based C library for fast partial order alignment using adaptive band
MIT License
111 stars 18 forks source link

optimum parameters for ONT reads #68

Open asylvz opened 2 months ago

asylvz commented 2 months ago

Hi,

I'm trying to use the library to generate consensus of ONT reads for multiple clusters of reads. Each cluster has around 10 - 30 reads. However, I'm not sure which parameters to use for minimizer-based seeding and partitioning in order to balance the accuracy and speed.

I'll be happy if you can suggest me a set of parameters to optimize for speed, memory and accuracy.

Thank you, Arda

yangao07 commented 2 months ago

Hi, if you can share a few example input datasets, I think I may be able to give you some suggestions in terms of parameters.

asylvz commented 2 months ago

Actually this is not for a specific scenario; I'll use it in my algorithm and currently testing it with ONT data of some samples (reads can be retrieved from the crams here: https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1KG_ONT_VIENNA/hg38/).

Basically it should be fast enough for 20-30K long ONT reads. I'm currently using wtdbg2 for this.

asylvz commented 2 months ago

I'm also sending a sample cluster of reads. This is one of the large clusters (25 reads), so not all of them are that large. H2-s218243_1350.fasta.zip

yangao07 commented 2 months ago

I am not sure the scenario you specifically refer to. Since you mentioned wtdbg2, if you need a consensus sequence after the assembly step, I think wtdbg2 has its own poa consensus calling module. For abPOA, it generally takes reads with unified boundaries and perform end-to-end global alignment, and then generate a consensus sequence based on the alignment result.

asylvz commented 2 months ago

I actually want to generate a consensus but since the poa algorithms are slower, I had to use wtdbg2. Your algorithm seems to be much faster, so I wanted to test it. For the ONT reads of 20-30K, which w, k, min-w, etc. would you suggest?

yangao07 commented 2 months ago

For your data H2-s218243_1350.fasta.zip, I see the read lengths varies a lot and they are not from the same strand.

Since I don't know how you obtained this cluster of reads (based on mapping position?), I can only suggest you run abpoa -Ss in.fasta > cons.fa and see how the consensus sequence meets your expection.