zheminzhou / PEPPAN

Phylogeny Enhanded Prediction of PAN-genome
https://doi.org/10.1101/2020.01.03.894154
GNU General Public License v3.0
39 stars 10 forks source link

Neighborhood based paralog splitting does not finish #2

Open marade opened 4 years ago

marade commented 4 years ago

For ~200 ~6Mb bacteria genomes, the neighborhood based paralog splitting step alone is taking over 24 hours on a c5.2xlarge EC2 instance, while the previous steps finished in a timely fashion. Notably the CPU usage for the entire period is very low (less than 1%), while memory usage remains fairly constant at 40%, indicating some sort of CPU bottleneck.

zheminzhou commented 4 years ago

Hi, thank you for the report. This is certainly much much slower than my tests. According to your text, this is most likely to have a bottleneck in the I/O.

PEPPA writes and reads lots of data from the file system. This does not seem to be an issue in my test, even when I used a mounted netdrive. But I have not tested it in an AWS instance yet. I have updated PEPPA a little bit to optimize its I/O performance. However, please do not expect too much.

marade commented 4 years ago

Thanks, I appreciate the prompt support. Perhaps you could add some sort of debugging capability so that the issue can be isolated? I'm not eager to run something for hours and not get an answer.