Open DeondeJager opened 1 month ago
Main thing to watch out for is memory - polypolish polish
opens both input sam
files into memory. As a rule of thumb, sam
files from polyalign filtered
will be at minimum the same size of the input fastq
, but can be several times larger if the genome is repetitive and reads align in many places. 100x illumina coverage of a 2.6Gbase genome would be fastq
files around 260GiB each, so sam
files >260GiB, so you'll need >520GiB memory.
polyalign splitfiltered
will help. This splits the sam
files per sequence in the genome assembly - so if your largest contig is 1/10 of the genome, then memory usage will be 10x lower.
python3 -m polyalign splitfiltered genome.fasta illumina1.fq illumina2.fq outputname
python3 -m polyalign splitfasta genome.fasta outputname
for fasta in outputname/*.fasta; do
filename=$(basename "$fasta")
filename="${filename%.*}"
polypolish polish $fasta outputname_1/${filename}_1.sam outputname_2/${filename}_2.sam >> polishedgenome.fasta
done
Using this approach, I've polished the human genome with 100x coverage Illumina in ~8h and <32GiB memory usage on a desktop computer.
Two extra comments:
I haven't seen any performance improvement of polyalign
past 4 parallel processes, it's limited by the speed that python can read the output from bwa
. Fewer fast CPUs is probably better.
polyalign splitfiltered
writes two output files per contig, but most operating systems are limited to a process having 1024 files open at any one time. And, opening and closing files takes quite a lot of OS overhead. So, if you have >~500 contigs you will probably see a big performance drop.
Thanks Richard, really useful tips and that block of code is very helpful. Appreciate it! I'll report back here with how it went.
I am going to be trying out
polyalign filtered
, and thenpolypolish polish
on a 2.6Gb mammalian genome assembly soon (on a high performance cluster). I realise this is far outside the scope of what either tools were developed for, but interested to see how it goes. Any tips or thoughts before I begin, @zephyris? 😬