Closed SHuang-Broad closed 1 year ago
With lots of contigs (e.g. reads aligned to the transcriptome rather than the genome) the multiprocessing becomes much slower as a new process has to be started for rather small units. The overhead of that leads to lower efficiency than what is gained by using multiple threads. Another reason is that multiprocessing passes the data back to the main thread by pickling it, which has a size limitation to the dataframe which is sometimes problematic for very large bams. If there are many contigs then --huge is turned on automatically (https://github.com/wdecoster/nanoget/blob/bba616c75a3d30ebe547dd62e2fb5eeefe8cac6f/nanoget/extraction_functions.py#L157), and if the number of reads per contig is too large you would get an error.
7+ hours is indeed unacceptable, but I'm not sure if I can come up with a quick fix.
Thanks for the explanation!
Now I have a follow-up observation, which refers to exactly that line of the code you linked: https://github.com/wdecoster/nanoget/blob/bba616c75a3d30ebe547dd62e2fb5eeefe8cac6f/nanoget/extraction_functions.py#L157
This line essentially makes the human BAM processing a single-threaded process. Take for example the GRCh38 reference, which has over 3,000 contigs in its full set. Even the "no_alt" analysis set has 195 contigs. Both are greater than the threshold of 100.
So I did a bit of benchmarking and hacking. Here's what I've found with an experiment with 1 data point.
--huge
NanoPlot -t 16 -c orangered --N50 --tsv_stats --no_supplementary --verbose --bam /cromwell_root/fc-36ec993a-12dc-4644-9554-33f5d5508088/submissions/5817e74f-405d-41f5-8005-d6d7c7eb87a6/PBCCSWholeGenome/a6e39415-9276-4186-9b85-72003fb1a5c9/call-MergeAllReads/cacheCopy/NA24385_LR.bam
2022-08-30 02:48:16,580 NanoPlot 1.35.5 started with arguments Namespace(threads=16, verbose=True, store=False, raw=False, huge=False, outdir='.', prefix='', tsv_stats=True, info_in_report=False, legacy=False, maxlength=None, minlength=None, drop_outliers=False, downsample=None, loglength=False, percentqual=False, alength=False, minqual=None, runtime_until=None, readtype='1D', barcoded=False, no_supplementary=True, color='orangered', colormap='Greens', plots=['kde', 'dot'], listcolors=False, listcolormaps=False, no_N50=False, N50=True, title=None, font_scale=1, dpi=100, hide_stats=False, fastq=None, fasta=None, fastq_rich=None, fastq_minimal=None, summary=None, bam=['/cromwell_root/fc-36ec993a-12dc-4644-9554-33f5d5508088/submissions/5817e74f-405d-41f5-8005-d6d7c7eb87a6/PBCCSWholeGenome/a6e39415-9276-4186-9b85-72003fb1a5c9/call-MergeAllReads/cacheCopy/NA24385_LR.bam'], ubam=None, cram=None, pickle=None, feather=None, path='./')
2022-08-30 02:48:16,580 Python version is: 3.9.2 | packaged by conda-forge | (default, Feb 21 2021, 05:02:46) [GCC 9.3.0]
2022-08-30 02:48:16,677 Nanoget: Starting to collect statistics from bam file /cromwell_root/fc-36ec993a-12dc-4644-9554-33f5d5508088/submissions/5817e74f-405d-41f5-8005-d6d7c7eb87a6/PBCCSWholeGenome/a6e39415-9276-4186-9b85-72003fb1a5c9/call-MergeAllReads/cacheCopy/NA24385_LR.bam.
[W::hts_idx_load3] The index file is older than the data file: /cromwell_root/fc-36ec993a-12dc-4644-9554-33f5d5508088/submissions/5817e74f-405d-41f5-8005-d6d7c7eb87a6/PBCCSWholeGenome/a6e39415-9276-4186-9b85-72003fb1a5c9/call-MergeAllReads/cacheCopy/NA24385_LR.bam.bai
2022-08-30 02:48:17,190 Nanoget: Bam file /cromwell_root/fc-36ec993a-12dc-4644-9554-33f5d5508088/submissions/5817e74f-405d-41f5-8005-d6d7c7eb87a6/PBCCSWholeGenome/a6e39415-9276-4186-9b85-72003fb1a5c9/call-MergeAllReads/cacheCopy/NA24385_LR.bam contains 8735006 mapped and 14594 unmapped reads.
2022-08-30 02:48:17,190 Nanoget: lots of contigs (>100) or --huge, not running in separate processes
[W::hts_idx_load3] The index file is older than the data file: /cromwell_root/fc-36ec993a-12dc-4644-9554-33f5d5508088/submissions/5817e74f-405d-41f5-8005-d6d7c7eb87a6/PBCCSWholeGenome/a6e39415-9276-4186-9b85-72003fb1a5c9/call-MergeAllReads/cacheCopy/NA24385_LR.bam.bai
[W::hts_idx_load3] The index file is older than the data file: /cromwell_root/fc-36ec993a-12dc-4644-9554-33f5d5508088/submissions/5817e74f-405d-41f5-8005-d6d7c7eb87a6/PBCCSWholeGenome/a6e39415-9276-4186-9b85-72003fb1a5c9/call-MergeAllReads/cacheCopy/NA24385_LR.bam.bai
2022-08-30 09:46:04,951 Nanoget: bam /cromwell_root/fc-36ec993a-12dc-4644-9554-33f5d5508088/submissions/5817e74f-405d-41f5-8005-d6d7c7eb87a6/PBCCSWholeGenome/a6e39415-9276-4186-9b85-72003fb1a5c9/call-MergeAllReads/cacheCopy/NA24385_LR.bam contains 8206803 primary alignments.
2022-08-30 09:46:09,053 Reduced DataFrame memory usage from 1153.5245666503906Mb to 942.2059011459351Mb
2022-08-30 09:46:17,396 Nanoget: Gathered all metrics of 8206803 reads
Note it took 7 hours.
Nanoget
After changing that threshold to 5000 (an arbitrary number)
if len(chromosomes) > 5000 or kwargs["huge"]:
I run the following test code
import time
from nanoget import *
start = time.time();
df = get_input("bam", ["/home/shuang/nanoplot/NA24385_LR.bam"], threads=8)
end = time.time();
print(f"Time consumed in working: {end - start} seconds")
And the wall-clock time is
Time consumed in working: 3651.3282177448273 seconds
Throughout the whole process, the CPU usage is high, which is good. I've also monitored the memory usage, which never exceeded 10GB for this 120GB BAM.
Thanks, Steve
That's a nice speedup. The threshold of 100 is also entirely arbitrary - I did by no means investigate what the tipping point is for efficiency. Do you plan to open a PR or should I make the change? AFAIK you shouldn't use the "full" human genome with all those 3000 contigs for alignment.
I agree, that full GRCh38 is not the optimal reference to use. We actually use the "no_alt" analysis set with 195 contigs.
I actually did make a fork and made another arbitrary threshold of 3500 instead of 100. I'm perfectly happy to make a PR here or over at Nanoget. How should I do it?
Thanks, Steve
Yes, I saw your fork, that's why I asked. In the absence of any reason against it, 200 sounds good yes. Maybe it will have to be tuned again at some point. You would have to fork nanoget and open the PR there.
👍 Yep, agree with further tuning. Changing that number for human reference may need a good reason. But for other organisms, it may be totally different. However, there I agree with your original judgment, there's no easy fix.
Thanks, it is now on PyPI as nanoget v1.17.0.
Yay!
I'll do a follow up PR here to update the setup.py
Thanks!
Thank you!
Oh, I also just realized that
are now out of sync. Syncing them again might avoid confusing users if it's not too much of a hassle.
Thanks again!
I find that for cram files, contig numbers is not updated yet.
I have changed the number for cram in v1.19.2
Hi,
I have a simple question, when should we use the option
--huge
?I looked at the source code in Nanoget https://github.com/wdecoster/nanoget/blob/bba616c75a3d30ebe547dd62e2fb5eeefe8cac6f/nanoget/nanoget.py#L79 And it looks like it avoids using multi-threads if that option is turned on.
So is the following understanding correct? if my single (aligned) BAM is 100GB—please let me know if this doesn't fit the definition of huge—and that option is turned on, it will use a
process_bam
to process that; but if that option is turned off, it'll process all the input bams separately using a pool of executors, which is still just one in my case.Now onto
process_bam
, which gets that option--huge
fromnanoget.get_input(...)
. This function also tries to just use one thread if--huge
is turned on, whereas if this is turned off, it scatters by using one thread per chromosome. This is a bit puzzling to me, because wouldn't one want to make use of multiple threads, in this case of a large input, to speed up? Can you please correct me if I'm mistaken?We do sometimes process high coverage BAMs and Nanoplot takes 7+ hours on these BAMs. We'd love to understand what's causing this.
Thank you. Steve