wshuai294 / PStrain

Pstrain profiles strains in metagenomics data. It infers strain abundance and genotype for each species. Also, it has a single species mode; where given a BAM and VCF, it can phase the variants for any species.
MIT License
8 stars 2 forks source link

Threading and parallelisation across species #11

Closed andrewjmc closed 2 years ago

andrewjmc commented 2 years ago

Hello,

Can I check that -p parallelises over samples, of which each pipeline is mostly single-threaded? And -n only parallelises bowtie (not GATK/picard/metaphlan etc). Thus the maximum number of processes will be p * n at the bowtie stage, and p subsequently? In this case, I assume maximum efficiency would be gained by -p over the number of processors and -n 1?

Having tested overnight with two pharyngeal samples (cached metaphlan runs, and -p 8), I found that the pipeline took most of its time at the iterative pulp stage - presumably working through species (~60) one by one (two samples in parallel, I hope!). It took about 7 hours.

I have >200 samples in total, and even with 32 processes this will exceed the run-time of my HPC jobs. What could be useful for me (and others, I suspect) is a way of batching the pipeline by species after metaphlan, BAM and VCF steps (keeping the advantage of multi-threading and competitive read mapping across all species). For example, if PStrain could produce intermediate output, and a list of the species eligible for strain-calling, it could then be run from a batch job per-species -- many, like me, will have access to much greater computing power by parallelising tasks as batch jobs rather than running as single large jobs.

It may be that single_species.py already does this -- to my first look though I believe it is designed for full separation (i.e totally separate species references, alignments and VCFs).

Final separate question, is there any way of getting strain model fitting statistics out of the pipeline (e.g. some measure of how well the deconvoluted phased strains and RAs account for the reads in each sample-species combination)? I assume this might be extractable from stdout/log in some way (though interestingly, no log is produced in my output folder, despite the code suggesting it should be produced).

Thanks again for such a useful tool!

Andrew

andrewjmc commented 2 years ago

I guess an alternative approach could be to batch across samples and do a final merge, if that is easier.

wshuai294 commented 2 years ago

Hi,

Thank you for your kind suggestions.

Yeah, we can run in batch across samples for parallelization, we can add all your samples to the config file. PStrain will regard each sample as a thread. So, if you set "-p 10", it will run ten samples at the same time. Please let me know if it is efficient enough. If not, I can make further updates.

For model fitting statistics, I revise the log output functions. The final loss value of the model for each sample will be in the log file in the output folder. It will be like "[2021-12-06 23:36:23,350-pipeline_V30.py-INFO:Final loss score is 57.814152 for gene_k4_2_1 in the optimization.]".

Please kindly check, and let me know if you have further advice.

Best, shuai

andrewjmc commented 2 years ago

Thanks Shuai, the difficulty I foresee is that I can only get jobs with 32 CPUs. This would mean 7 or 8 rounds of processing batches of 32 samples, and if the pulp steps take 12 hours per sample, I will exceed a 72 hour batch job. I could run it more quickly and efficiently if I could break it down into an array job, e.g.:

A. Running a single job for each sample, and merging sample results in another job at the end B. Running a single job for each species (merging results from species would be trivial as they're just rows of the RA files)

I can see that merge.py can be called from the command line.

This suggests to me that the pipeline is already set up for option A. Would the following work?

  1. Make a separate config file for each sample
  2. Run PStrain separately for each sample (into same output dir)
  3. Concatenate config files into a single config file
  4. Run merge with the large config file to join results together

Thanks so much,

Andrew

wshuai294 commented 2 years ago

Hi,

Option A will work. And the 4 steps are exactly okay. You can run each sample, respectively. And finally, use the merge.py to merge the results for the analysis.

Waiting for your good news. Best, shuai

andrewjmc commented 2 years ago

Thanks Shuai. A partial success. I batched into sets of 8 files across 28 8-threaded jobs, all outputting into the same directory.

Two have completed without error and 3 are still running. However, 13 gave errors at the GATK step for every sample (for the remainder, PBS log files are strangely missing).

##### ERROR MESSAGE: Unable to create a temporary BAM schedule file.  Please make sure Java can write to the default temp directory or use -Djava.io.tmpdir= to instruct it to use a different temp directory instead.

Presumably not all GATK runs needed to created temporary BAM schedule files. The default Java TMP dir on linux is /tmp, which is unwriteable on our HPC system. Rather than editing the script code, I have exported the environment variable

export _JAVA_OPTIONS="-Djava.io.tmpdir=$TMPDIR"

I think this might sort it. I'm posting here as it might be relevant for other users on HPC systems.

wshuai294 commented 2 years ago

Hi Andrew,

Thanks for the valuable experience sharing. I first meet this GATK error.

The log file name follows the time. So, if you submit all the batches at the same time, their log file names might conflict. Submitting at different times or denoting different output folders for different batches (mv the results to the same folder at the end) might solve this problem.

Best, shuai

andrewjmc commented 2 years ago

Good suggestion, thanks!

I will run again. I suspect the deepest samples failed (due to lack of TMP dir) so it might still take a while to complete!

andrewjmc commented 2 years ago

All has worked well although the 24 hour batch jobs were not long enough and I had to set another three batch jobs for the final 21 samples. In 24 hours this has managed to process 10, and 11 remain. I suspect those being "left behind" are the samples which take a long time.

I would like to better understand the parallelisation - are the time consuming pulp steps parallelised within a sample if threads are available? e.g. if I run PStrain for a single sample with -n 1 -p 8 will 8 threads be used to parallelise the strain optimisations? Or is the parallelisation only across samples?

If the latter is the case, it would be helpful if pulp steps can be run for multiple species within a sample in parallel where threads are available.

Thanks,

Andrew

NB - I have just sussed out that a single sample can still use 2 processors (as reported by my HPC, which may simply reflect a 4 cores presenting as 8 because of hyperthreading or something I don't understand!) so I should probably have only run four samples per job across twice the number of jobs.

wshuai294 commented 2 years ago

Hi Andrew

It is the latter case. Taking each pulp process in a single process is better. And I will edit this when I have time. Currently, I am a little busy. Thank you very much for the kind suggestion!

Best, Shuai

andrewjmc commented 2 years ago

Thanks for the clarification! I think parallelisation of pulp processing will be most useful to people with large and diverse metagenomes where a sample processed through a single thread could take a long time. For most users it might not matter. No pressure to implement unless you think it will help advance the pipeline for a range of users.

Four samples remaining for me now, then it's time to merge!

andrewjmc commented 2 years ago

Worked perfectly, and all merged!

wshuai294 commented 2 years ago

Happy to hear this!