wdecoster / NanoPlot

Plotting scripts for long read sequencing data
http://nanoplot.bioinf.be
MIT License
401 stars 48 forks source link

runs very slowly without "--only-report" option #352

Open chucknordy opened 6 months ago

chucknordy commented 6 months ago

Hi, I've recently started using NanoPlot, really liking it, but I'm wondering about one thing.

If I use this command line, with the --only-report option, then of course I only get the one HTML output file (along with the stats and the log), but that one file contains the full output, with all of the various sub-plots.

time NanoPlot --threads 16 --verbose \
    --outdir "${name}-plots" \
    --fastq "${name}.fastq.gz" \
    --info_in_report \
    --only-report \
    --N50 --legacy hex dot kde

If however I use the same command line, but simply omit the --only-report option, it then produces a bunch more output files (PNGs and HTML versions of each subplot separately), as expected. But what I didn't expect, is that the run time is much longer for the second command. Is there some obvious reason for this? Seems strange, since presumably the same information is being processed either way.

The output log here, for a small FASTQ input of only 594 Nanopore reads, shows how it took 45 minutes to run the latter way; whereas it took only seconds to run the first way. From the timestamps, it looks as if the simple act of saving the PNG files it taking many minutes each, which makes no sense to me. Please let me know if I'm missing something obvious, thank you!

2024-01-11 18:26:27,754 NanoPlot 1.42.0 started with arguments Namespace(threads=16, verbose=True, store=False, raw=False, huge=False, outdir='E.884-63H-6-plots', no_static=False, prefix='', tsv_stats=False, only_report=False, info_in_report=True, maxlength=None, minlength=None, drop_outliers=False, downsample=None, loglength=False, percentqual=False, alength=False, minqual=None, runtime_until=None, readtype='1D', barcoded=False, no_supplementary=False, color='#4CB391', colormap='Greens', format=['png'], plots=['kde', 'dot'], legacy=['hex', 'dot', 'kde'], listcolors=False, listcolormaps=False, no_N50=False, N50=True, title='E.884-63H-6', font_scale=1, dpi=100, hide_stats=False, fastq=['E.884-63H-6.fastq.gz'], fasta=None, fastq_rich=None, fastq_minimal=None, summary=None, bam=None, ubam=None, cram=None, pickle=None, feather=None, path='E.884-63H-6-plots/')
2024-01-11 18:26:27,755 Python version is: 3.10.13 | packaged by conda-forge | (main, Dec 23 2023, 15:36:39) [GCC 12.3.0]
2024-01-11 18:26:27,794 Nanoget: Starting to collect statistics from plain fastq file.
2024-01-11 18:26:27,795 Nanoget: Decompressing gzipped fastq E.884-63H-6.fastq.gz
2024-01-11 18:26:28,061 Reduced DataFrame memory usage from 0.009185791015625Mb to 0.009185791015625Mb
2024-01-11 18:26:28,078 Nanoget: Gathered all metrics of 594 reads
2024-01-11 18:26:28,090 Calculated statistics
2024-01-11 18:26:28,091 Using sequenced read lengths for plotting.
2024-01-11 18:26:28,094 NanoPlot:  Valid color #4CB391.
2024-01-11 18:26:28,094 NanoPlot:  Valid colormap Greens.
2024-01-11 18:26:28,095 NanoPlot:  Creating length plots for Read length.
2024-01-11 18:26:28,095 NanoPlot: Using 594 reads with read length N50 of 3355bp and maximum of 14210bp.
2024-01-11 18:32:54,405 Saved E.884-63H-6-plots/WeightedHistogramReadlength  as png (or png for --legacy)
2024-01-11 18:39:20,796 Saved E.884-63H-6-plots/WeightedLogTransformed_HistogramReadlength  as png (or png for --legacy)
2024-01-11 18:45:46,793 Saved E.884-63H-6-plots/Non_weightedHistogramReadlength  as png (or png for --legacy)
2024-01-11 18:52:12,864 Saved E.884-63H-6-plots/Non_weightedLogTransformed_HistogramReadlength  as png (or png for --legacy)
2024-01-11 18:58:40,142 Saved E.884-63H-6-plots/Yield_By_Length  as png (or png for --legacy)
2024-01-11 18:58:40,143 Created length plots
2024-01-11 18:58:40,144 NanoPlot: Creating Read lengths vs Average read quality plots using 594 reads.
2024-01-11 19:05:06,692 Saved E.884-63H-6-plots/LengthvsQualityScatterPlot_dot  as png (or png for --legacy)
2024-01-11 19:11:33,302 Saved E.884-63H-6-plots/LengthvsQualityScatterPlot_kde  as png (or png for --legacy)
NanoPlot needs seaborn and matplotlib with --legacy2024-01-11 19:11:33,309 Created LengthvsQual plot
2024-01-11 19:11:33,309 Writing html report.
2024-01-11 19:11:33,349 Finished!

real    45m8.805s
user    0m4.128s
sys 0m7.028s
wdecoster commented 6 months ago

That's a very interesting observation and I totally did not expect that to be the case. Remarkable. Maybe --only-report should be the default, then. Hrm.

wshropshire commented 6 months ago
Screenshot 2024-01-16 at 3 57 46 PM

Just FYI, was running into similar computation time issue and included the '--no_static' option, and drastically improved the wall time. Looks like there might be an issue with the kaleido python module.

wdecoster commented 5 months ago

Yes, that seems to be the case. Hrm, static images can be helpful too... I think. Or maybe everyone only cares about the html report.

wshropshire commented 5 months ago

Possibly, the most useful output for us is the statistics, but it is nice to have the html files as supplementary.

chucknordy commented 5 months ago

Thanks for the responses, folks. I can confirm that using the --no_static option (and not the --only-report option) also allows very fast runtimes. So indeed, the generation of the static PNG files seems to be the problem. But really, I don't see those as particularly important (at least, not worth waiting for -- with things as-is, it would be much quicker to just create the HTML output and then grab screenshots, if one really wanted static images for a slide deck or something).

For the next release, I'd recommend making the --no_static behavior the default, getting rid of that option, and instead providing, say, a --static option for those who really want PNGs and have lots of patience. :-)

wdecoster commented 5 months ago

Possibly, the most useful output for us is the statistics, but it is nice to have the html files as supplementary.

I don't know which input format you are using, but if it is bam or cram have you considered using cramino?

For the next release, I'd recommend making the --no_static behavior the default, getting rid of that option, and instead providing, say, a --static option for those who really want PNGs and have lots of patience. :-)

Yes, that sounds like a good suggestion :)

TBradley27 commented 2 months ago

For what it is worth when I tested out --only-report using average 27X of WGS data in a single bam or cram file, I only observed marginal gains (maybe about 5%?) in run time compared to running similar sized data without --only-report. Although I did not perform extensive benchmarking for this.