Large eukaryotic genomes

DeondeJager commented 1 month ago

I am going to be trying out polyalign filtered, and then polypolish polish on a 2.6Gb mammalian genome assembly soon (on a high performance cluster). I realise this is far outside the scope of what either tools were developed for, but interested to see how it goes. Any tips or thoughts before I begin, @zephyris? 😬

zephyris commented 1 month ago

Main thing to watch out for is memory - polypolish polish opens both input sam files into memory. As a rule of thumb, sam files from polyalign filtered will be at minimum the same size of the input fastq, but can be several times larger if the genome is repetitive and reads align in many places. 100x illumina coverage of a 2.6Gbase genome would be fastq files around 260GiB each, so sam files >260GiB, so you'll need >520GiB memory.

polyalign splitfiltered will help. This splits the sam files per sequence in the genome assembly - so if your largest contig is 1/10 of the genome, then memory usage will be 10x lower.

python3 -m polyalign splitfiltered genome.fasta illumina1.fq illumina2.fq outputname
python3 -m polyalign splitfasta genome.fasta outputname
for fasta in outputname/*.fasta; do
  filename=$(basename "$fasta")
  filename="${filename%.*}"
  polypolish polish $fasta outputname_1/${filename}_1.sam outputname_2/${filename}_2.sam >> polishedgenome.fasta
done

Using this approach, I've polished the human genome with 100x coverage Illumina in ~8h and <32GiB memory usage on a desktop computer.

zephyris commented 1 month ago

Two extra comments:

I haven't seen any performance improvement of polyalign past 4 parallel processes, it's limited by the speed that python can read the output from bwa. Fewer fast CPUs is probably better.

polyalign splitfiltered writes two output files per contig, but most operating systems are limited to a process having 1024 files open at any one time. And, opening and closing files takes quite a lot of OS overhead. So, if you have >~500 contigs you will probably see a big performance drop.

DeondeJager commented 1 month ago

Thanks Richard, really useful tips and that block of code is very helpful. Appreciate it! I'll report back here with how it went.

zephyris / polyalign

Large eukaryotic genomes #3