wdecoster / NanoPlot

Plotting scripts for long read sequencing data
http://nanoplot.bioinf.be
MIT License
407 stars 48 forks source link

NanoPlot memory overflow with –barcoded option #317

Closed yoshinak1 closed 1 year ago

yoshinak1 commented 1 year ago

Hi, I am having memory overflow issues with some of our sequence file from Gridion. I just noticed some sequence file with barcode read make NanoPlot memory usage extremly high and crashes the PC. It seems NanoPlot is not releasing memory, memory usage is going up to >50GB. Sequence file size is about 4GB. Total # of barcode is 96 and only 9 of them is finished when 16 GB machine is almost crash. Is there any way I can reduce the usage of memory? NanoPlot version 1.40.2

yoshinak1 commented 1 year ago

Log file 2022-12-02 09:43:32,195 NanoPlot 1.40.2 started with arguments Namespace(threads=8, verbose=False, store=False, raw=False, huge=False, outdir='nano_test3', no_static=False, prefix='', tsv_stats=False, info_in_report=False, maxlength=None, minlength=None, drop_outliers=False, downsample=None, loglength=False, percentqual=False, alength=False, minqual=None, runtime_until=None, readtype='1D', barcoded=True, no_supplementary=False, color='#4CB391', colormap='Greens', format=['png'], plots=['kde', 'dot'], legacy=None, listcolors=False, listcolormaps=False, no_N50=False, N50=False, title=None, font_scale=1, dpi=100, hide_stats=False, fastq=None, fasta=None, fastq_rich=None, fastq_minimal=None, summary=['seq_sum_without_last_line.txt'], bam=None, ubam=None, cram=None, pickle=None, feather=None, path='nano_test3/') 2022-12-02 09:43:32,196 Python version is: 3.10.8 | packaged by conda-forge | (main, Nov 22 2022, 08:23:14) [GCC 10.4.0] 2022-12-02 09:43:32,221 Nanoget: Collecting metrics from summary file seq_sum_without_last_line.txt for 1D sequencing 2022-12-02 09:43:32,223 Nanoget: Extracting metrics per barcode. 2022-12-02 09:44:26,541 Nanoget: Finished collecting statistics from summary file seq_sum_without_last_line.txt 2022-12-02 09:44:30,195 Reduced DataFrame memory usage from 687.0875244140625Mb to 579.0405921936035Mb 2022-12-02 09:44:31,999 Nanoget: Gathered all metrics of 6294190 reads 2022-12-02 09:44:36,758 Calculated statistics 2022-12-02 09:45:06,329 Using sequenced read lengths for plotting. 2022-12-02 09:45:06,850 Processing unclassified 2022-12-02 09:45:07,137 NanoPlot: Valid color #4CB391. 2022-12-02 09:45:07,138 NanoPlot: Valid colormap Greens. 2022-12-02 09:45:07,185 NanoPlot: Creating length plots for Read length. 2022-12-02 09:45:07,186 NanoPlot: Using 975276 reads maximum of 52125bp. 2022-12-02 09:45:07,868 Saved nano_test3/unclassified_WeightedHistogramReadlength as png (or png for --legacy) 2022-12-02 09:45:08,455 Saved nano_test3/unclassified_WeightedLogTransformed_HistogramReadlength as png (or png for --legacy) 2022-12-02 09:45:09,024 Saved nano_test3/unclassified_Non_weightedHistogramReadlength as png (or png for --legacy) 2022-12-02 09:45:09,593 Saved nano_test3/unclassified_Non_weightedLogTransformed_HistogramReadlength as png (or png for --legacy) 2022-12-02 09:45:10,735 Saved nano_test3/unclassified_Yield_By_Length as png (or png for --legacy) 2022-12-02 09:45:10,736 Created length plots 2022-12-02 09:45:10,794 NanoPlot: Creating Read lengths vs Average read quality plots using 975276 reads. 2022-12-02 09:45:12,019 Saved nano_test3/unclassified_LengthvsQualityScatterPlot_dot as png (or png for --legacy) 2022-12-02 09:45:13,830 Saved nano_test3/unclassified_LengthvsQualityScatterPlot_kde as png (or png for --legacy) 2022-12-02 09:45:13,832 Created LengthvsQual plot 2022-12-02 09:45:13,832 Nanoplotter: Creating heatmap of reads per channel using 975276 reads. 2022-12-02 09:45:14,417 Saved nano_test3/unclassified_ActivityMap_ReadsPerChannel as png (or png for --legacy) 2022-12-02 09:45:14,419 Created spatialheatmap for succesfull basecalls. 2022-12-02 09:45:14,419 Nanoplotter: Creating timeplots using 975276 (full) or 10000 (subsampled dataset) reads. 2022-12-02 09:45:15,106 Saved nano_test3/unclassified_CumulativeYieldPlot_Gigabases as png (or png for --legacy) 2022-12-02 09:45:15,683 Saved nano_test3/unclassified_CumulativeYieldPlot_NumberOfReads as png (or png for --legacy) 2022-12-02 09:45:16,304 Saved nano_test3/unclassified_NumberOfReads_Over_Time as png (or png for --legacy) 2022-12-02 09:45:17,002 Saved nano_test3/unclassified_ActivePores_Over_Time as png (or png for --legacy) 2022-12-02 09:45:17,738 Saved nano_test3/unclassified_TimeLengthViolinPlot as png (or png for --legacy) 2022-12-02 09:45:18,420 Saved nano_test3/unclassified_TimeQualityViolinPlot as png (or png for --legacy) 2022-12-02 09:45:19,075 Saved nano_test3/unclassified_TimeSequencingSpeed_ViolinPlot as png (or png for --legacy) 2022-12-02 09:45:19,079 Created timeplots. 2022-12-02 09:45:19,079 Processing barcode08 (repeat ...)

wdecoster commented 1 year ago

Hi,

Thanks for reporting this. I understand how this could happen and will think of a way to fix it.

Regards, Wouter

yoshinak1 commented 1 year ago

Hi Wouter, Thank you for your reply. I split sequence file for each barcode and run NanoPlot, it works fine but missed single report file (nanoplot- report.html) With too many barcodes that nanoplot-report.html also crashes my browser (chrome) though. Anyway, thank you for your response. I look forward to the next version.

Yoshi

wdecoster commented 1 year ago

A report is created per barcode in the latest version (v1.41.0), now available through pip. This makes reports smaller and prevents all plots from being stored in memory until the report can be created at the very end. Feedback appreciated!