wdecoster / NanoPlot

Plotting scripts for long read sequencing data
http://nanoplot.bioinf.be
MIT License
407 stars 48 forks source link

when to use --huge? #306

Closed SHuang-Broad closed 1 year ago

SHuang-Broad commented 1 year ago

Hi,

I have a simple question, when should we use the option --huge?

I looked at the source code in Nanoget https://github.com/wdecoster/nanoget/blob/bba616c75a3d30ebe547dd62e2fb5eeefe8cac6f/nanoget/nanoget.py#L79 And it looks like it avoids using multi-threads if that option is turned on.

So is the following understanding correct? if my single (aligned) BAM is 100GB—please let me know if this doesn't fit the definition of huge—and that option is turned on, it will use a process_bam to process that; but if that option is turned off, it'll process all the input bams separately using a pool of executors, which is still just one in my case.

Now onto process_bam, which gets that option --huge from nanoget.get_input(...). This function also tries to just use one thread if --huge is turned on, whereas if this is turned off, it scatters by using one thread per chromosome. This is a bit puzzling to me, because wouldn't one want to make use of multiple threads, in this case of a large input, to speed up? Can you please correct me if I'm mistaken?

    samfile = check_bam(bam)
    chromosomes = samfile.references
    if len(chromosomes) > 100 or kwargs["huge"]:
        logging.info("Nanoget: lots of contigs (>100) or --huge, not running in separate processes")
        datadf = pd.DataFrame(
            data=extract_from_bam(bam, None, kwargs["keep_supp"]),
            columns=["readIDs", "quals", "aligned_quals", "lengths",
                     "aligned_lengths", "mapQ", "percentIdentity"]) \
            .dropna(axis='columns', how='all') \
            .dropna(axis='index', how='any')

    else:
        unit = chromosomes
        with cfutures.ProcessPoolExecutor(max_workers=kwargs["threads"]) as executor:
            datadf = pd.DataFrame(
                data=[res for sublist in executor.map(extract_from_bam,
                                                      repeat(bam),
                                                      unit,
                                                      repeat(kwargs["keep_supp"]))
                      for res in sublist],
                columns=["readIDs", "quals", "aligned_quals", "lengths",
                         "aligned_lengths", "mapQ", "percentIdentity"]) \
                .dropna(axis='columns', how='all') \
                .dropna(axis='index', how='any')
    logging.info(f"Nanoget: bam {bam} contains {datadf['lengths'].size} primary alignments.")
    return ut.reduce_memory_usage(datadf)

We do sometimes process high coverage BAMs and Nanoplot takes 7+ hours on these BAMs. We'd love to understand what's causing this.

Thank you. Steve

wdecoster commented 1 year ago

With lots of contigs (e.g. reads aligned to the transcriptome rather than the genome) the multiprocessing becomes much slower as a new process has to be started for rather small units. The overhead of that leads to lower efficiency than what is gained by using multiple threads. Another reason is that multiprocessing passes the data back to the main thread by pickling it, which has a size limitation to the dataframe which is sometimes problematic for very large bams. If there are many contigs then --huge is turned on automatically (https://github.com/wdecoster/nanoget/blob/bba616c75a3d30ebe547dd62e2fb5eeefe8cac6f/nanoget/extraction_functions.py#L157), and if the number of reads per contig is too large you would get an error.

7+ hours is indeed unacceptable, but I'm not sure if I can come up with a quick fix.

SHuang-Broad commented 1 year ago

Thanks for the explanation!

Now I have a follow-up observation, which refers to exactly that line of the code you linked: https://github.com/wdecoster/nanoget/blob/bba616c75a3d30ebe547dd62e2fb5eeefe8cac6f/nanoget/extraction_functions.py#L157

This line essentially makes the human BAM processing a single-threaded process. Take for example the GRCh38 reference, which has over 3,000 contigs in its full set. Even the "no_alt" analysis set has 195 contigs. Both are greater than the threshold of 100.

So I did a bit of benchmarking and hacking. Here's what I've found with an experiment with 1 data point.

not using --huge

NanoPlot -t 16 -c orangered --N50 --tsv_stats --no_supplementary --verbose --bam /cromwell_root/fc-36ec993a-12dc-4644-9554-33f5d5508088/submissions/5817e74f-405d-41f5-8005-d6d7c7eb87a6/PBCCSWholeGenome/a6e39415-9276-4186-9b85-72003fb1a5c9/call-MergeAllReads/cacheCopy/NA24385_LR.bam
2022-08-30 02:48:16,580 NanoPlot 1.35.5 started with arguments Namespace(threads=16, verbose=True, store=False, raw=False, huge=False, outdir='.', prefix='', tsv_stats=True, info_in_report=False, legacy=False, maxlength=None, minlength=None, drop_outliers=False, downsample=None, loglength=False, percentqual=False, alength=False, minqual=None, runtime_until=None, readtype='1D', barcoded=False, no_supplementary=True, color='orangered', colormap='Greens', plots=['kde', 'dot'], listcolors=False, listcolormaps=False, no_N50=False, N50=True, title=None, font_scale=1, dpi=100, hide_stats=False, fastq=None, fasta=None, fastq_rich=None, fastq_minimal=None, summary=None, bam=['/cromwell_root/fc-36ec993a-12dc-4644-9554-33f5d5508088/submissions/5817e74f-405d-41f5-8005-d6d7c7eb87a6/PBCCSWholeGenome/a6e39415-9276-4186-9b85-72003fb1a5c9/call-MergeAllReads/cacheCopy/NA24385_LR.bam'], ubam=None, cram=None, pickle=None, feather=None, path='./')
2022-08-30 02:48:16,580 Python version is: 3.9.2 | packaged by conda-forge | (default, Feb 21 2021, 05:02:46) [GCC 9.3.0]
2022-08-30 02:48:16,677 Nanoget: Starting to collect statistics from bam file /cromwell_root/fc-36ec993a-12dc-4644-9554-33f5d5508088/submissions/5817e74f-405d-41f5-8005-d6d7c7eb87a6/PBCCSWholeGenome/a6e39415-9276-4186-9b85-72003fb1a5c9/call-MergeAllReads/cacheCopy/NA24385_LR.bam.
[W::hts_idx_load3] The index file is older than the data file: /cromwell_root/fc-36ec993a-12dc-4644-9554-33f5d5508088/submissions/5817e74f-405d-41f5-8005-d6d7c7eb87a6/PBCCSWholeGenome/a6e39415-9276-4186-9b85-72003fb1a5c9/call-MergeAllReads/cacheCopy/NA24385_LR.bam.bai
2022-08-30 02:48:17,190 Nanoget: Bam file /cromwell_root/fc-36ec993a-12dc-4644-9554-33f5d5508088/submissions/5817e74f-405d-41f5-8005-d6d7c7eb87a6/PBCCSWholeGenome/a6e39415-9276-4186-9b85-72003fb1a5c9/call-MergeAllReads/cacheCopy/NA24385_LR.bam contains 8735006 mapped and 14594 unmapped reads.
2022-08-30 02:48:17,190 Nanoget: lots of contigs (>100) or --huge, not running in separate processes
[W::hts_idx_load3] The index file is older than the data file: /cromwell_root/fc-36ec993a-12dc-4644-9554-33f5d5508088/submissions/5817e74f-405d-41f5-8005-d6d7c7eb87a6/PBCCSWholeGenome/a6e39415-9276-4186-9b85-72003fb1a5c9/call-MergeAllReads/cacheCopy/NA24385_LR.bam.bai
[W::hts_idx_load3] The index file is older than the data file: /cromwell_root/fc-36ec993a-12dc-4644-9554-33f5d5508088/submissions/5817e74f-405d-41f5-8005-d6d7c7eb87a6/PBCCSWholeGenome/a6e39415-9276-4186-9b85-72003fb1a5c9/call-MergeAllReads/cacheCopy/NA24385_LR.bam.bai
2022-08-30 09:46:04,951 Nanoget: bam /cromwell_root/fc-36ec993a-12dc-4644-9554-33f5d5508088/submissions/5817e74f-405d-41f5-8005-d6d7c7eb87a6/PBCCSWholeGenome/a6e39415-9276-4186-9b85-72003fb1a5c9/call-MergeAllReads/cacheCopy/NA24385_LR.bam contains 8206803 primary alignments.
2022-08-30 09:46:09,053 Reduced DataFrame memory usage from 1153.5245666503906Mb to 942.2059011459351Mb
2022-08-30 09:46:17,396 Nanoget: Gathered all metrics of 8206803 reads

Note it took 7 hours.

hacking Nanoget

After changing that threshold to 5000 (an arbitrary number)

if len(chromosomes) > 5000 or kwargs["huge"]:

I run the following test code

import time
from nanoget import *

start = time.time();

df = get_input("bam", ["/home/shuang/nanoplot/NA24385_LR.bam"], threads=8)

end = time.time();
print(f"Time consumed in working: {end - start} seconds")

And the wall-clock time is Time consumed in working: 3651.3282177448273 seconds

Throughout the whole process, the CPU usage is high, which is good. I've also monitored the memory usage, which never exceeded 10GB for this 120GB BAM.

Thanks, Steve

wdecoster commented 1 year ago

That's a nice speedup. The threshold of 100 is also entirely arbitrary - I did by no means investigate what the tipping point is for efficiency. Do you plan to open a PR or should I make the change? AFAIK you shouldn't use the "full" human genome with all those 3000 contigs for alignment.

SHuang-Broad commented 1 year ago

I agree, that full GRCh38 is not the optimal reference to use. We actually use the "no_alt" analysis set with 195 contigs.

I actually did make a fork and made another arbitrary threshold of 3500 instead of 100. I'm perfectly happy to make a PR here or over at Nanoget. How should I do it?

Thanks, Steve

wdecoster commented 1 year ago

Yes, I saw your fork, that's why I asked. In the absence of any reason against it, 200 sounds good yes. Maybe it will have to be tuned again at some point. You would have to fork nanoget and open the PR there.

SHuang-Broad commented 1 year ago

👍 Yep, agree with further tuning. Changing that number for human reference may need a good reason. But for other organisms, it may be totally different. However, there I agree with your original judgment, there's no easy fix.

SHuang-Broad commented 1 year ago

https://github.com/wdecoster/nanoget/pull/27

wdecoster commented 1 year ago

Thanks, it is now on PyPI as nanoget v1.17.0.

SHuang-Broad commented 1 year ago

Yay!

SHuang-Broad commented 1 year ago

I'll do a follow up PR here to update the setup.py

wdecoster commented 1 year ago

Thanks!

SHuang-Broad commented 1 year ago

Thank you!

SHuang-Broad commented 1 year ago

Oh, I also just realized that

are now out of sync. Syncing them again might avoid confusing users if it's not too much of a hassle.

Thanks again!

ymcki commented 11 months ago

I find that for cram files, contig numbers is not updated yet.

wdecoster commented 11 months ago

I have changed the number for cram in v1.19.2