zheminzhou / PEPPAN

Phylogeny Enhanded Prediction of PAN-genome
https://doi.org/10.1101/2020.01.03.894154
GNU General Public License v3.0
39 stars 10 forks source link

Out of memory error #4

Open SWittouck opened 4 years ago

SWittouck commented 4 years ago

Dear Zhemin,

Thank you for making PEPPA publicly available and for putting the publication on bioRxiv, it's a very nice read!

I managed to install PEPPA successfully and tried to do a test run on 73 genomes of the order Lactobacillales. After a few minutes I got an out of memory error (memory was indeed full) and the job aborted. Is there anything I can do to solve this? I have 16GB of memory and was using all 16 threads I have available.

Best wishes, Stijn

zheminzhou commented 4 years ago

Due to the problem of multi-threading in Python, part of the parallel calculation is handled by multi-processes, and all data in the memory will be replicated in each process. Please try to run PEPPA with fewer processes (i.e., 4). I will close this issue for now but please re-open it if you still get an out-of-memory problem.

SWittouck commented 4 years ago

Dear Zhemin,

Thank you for your suggestion, I will try this.

Best wishes, Stijn

SWittouck commented 4 years ago

Dear Zhemin,

I tried to run with fewer threads, as you suggested, even down to a single thread. Unfortunately, the issue remained. In annex the log file with the error - it seems to occur in the BLASTn step.

Best wishes, Stijn peppa.log

zheminzhou commented 4 years ago

I have pushed PEPPA in pypi with a formal version number 1.0 The codes in this version have been re-visited to optimize the memory performance. You can install it in python3 >=3.5 via pip install bio-peppa And the executable is 'PEPPA' by default. Hope this can solve the memory leaking problem.

SWittouck commented 4 years ago

Hi Zhemin,

I installed PEPPA version 1.0 using pip, as you suggested. It didn't fix the problem: I still got out-of-memory errors, no matter the number of threads I used. However, I took a closer look at how PEPPA works, and it seems to me that it is not suited for datasets above the genus level? While I have a genome dataset on the order level; I think the blastn searches are not sensitive enough for those. When I set --clust_identity to 0.5, --clust_match_prop to 0.6 and --match_identityto 0.5, there was no error anymore! So I'm still not sure what caused the error, and I think my dataset is anyway outside of the scope of PEPPA, but at least the error got solved. Thank you for your help!

I have one additional remark: I found a bug in PEPPA_parser.py. In line 64, there is a ] too many.

Best regards, Stijn

zheminzhou commented 4 years ago

Thank you for the bug report (again) and the solution you found. PEPPA allows a lower limit of "--match_identity" down to 0.4, so your value of 0.5 is fine. However, the "clust_identity" and "clust_match_prop" values are certainly out of my testing scope. I think the phylogeny based paralog splitting will still be able to handle this but am not for sure.

Will push up the fixation for the bug in PEPPA_parser.py later this week.