xiezhq / ISEScan

A python pipeline to identify IS (Insertion Sequence) elements in genome and metagenome
Apache License 2.0
79 stars 17 forks source link

Replace phmmer by diamond #31

Closed oschwengers closed 3 years ago

oschwengers commented 3 years ago

Hi @xiezhq , the phmmer based lookup of single AA is responsible for a large part of the overall runtime. As phmmer internally more or less behaves like a normal blastp, it might be worth to replace phmmer by a faster solution like: https://github.com/bbuchfink/diamond http://www.diamondsearch.org/index.php

For this small protein database clusters.single.faa the homology search of all AA seqs of a Pseudomonas aeruginosa genome took less than 1 sec and only 18 Mb RAM with 2 cores:

time -v diamond blastp --query paeruginosa.faa --db clusters.single.dmnd --threads 2 --out diamond.tsv --outfmt 6 --header"
    User time (seconds): 1.03
    System time (seconds): 0.03
    Percent of CPU this job got: 181%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.58
    Maximum resident set size (kbytes): 18624

Are there any good counter reasons? Otherwise this might significantly reduce overall runtimes. Best regards

xiezhq commented 3 years ago

Thanks, oschwengers . Maybe it deserves my trying but not in priority in my to-do list.