wdecoster / NanoPlot

Plotting scripts for long read sequencing data
http://nanoplot.bioinf.be
MIT License
429 stars 47 forks source link

If a FASTQ file contains onlye reads with 0 quality, the quality plotting crashes #124

Closed jvhaarst closed 5 years ago

jvhaarst commented 5 years ago

I have converted a set of PacBio Sequel files to FASTQ, and as those do not contain quality info (they are all zeroes), the plotting of the quality plots crashes:

2019-03-11 15:39:09,082 NanoPlot 1.20.0 started with arguments Namespace(N50=False, alength=False, bam=None, barcoded=False, color='#4CB391', cram=None, downsample=None, dpi=100, drop_outliers=False, fasta=None, fastq=['m54072_190118_111145.fastq.gz'], fastq_minimal=None, fastq_rich=None, font_scale=1, format='png', listcolors=False, loglength=False, maxlength=None, minlength=None, minqual=None, no_N50=False, outdir='.', percentqual=False, pickle=None, plots=['kde', 'dot'], prefix='m54072_190118_111145.', raw=False, readtype='1D', runtime_until=None, store=True, summary=None, threads=48, title=None, verbose=True)
2019-03-11 15:39:09,082 Python version is: 3.6.7 | packaged by conda-forge | (default, Feb 20 2019, 02:51:38)  [GCC 7.3.0]
2019-03-11 15:39:09,084 Nanoplotter: valid output format png
2019-03-11 15:39:10,057 Nanoget: Starting to collect statistics from plain fastq file.
2019-03-11 15:39:10,070 Nanoget: Decompressing gzipped fastq m54072_190118_111145.fastq.gz
2019-03-11 16:08:40,256 Reduced DataFrame memory usage from 9.625076293945312Mb to 6.416717529296875Mb
2019-03-11 16:08:40,411 Nanoget: Gathered all metrics of 420526 reads
2019-03-11 16:08:41,038 Calculated statistics
2019-03-11 16:08:41,039 Using sequenced read lengths for plotting.
2019-03-11 16:08:41,104 Nanoplotter: Valid color #4CB391.
2019-03-11 16:08:41,113 Nanoplotter: Creating length plots for Read length.
2019-03-11 16:08:41,115 Nanoplotter: Using 420526 reads maximum of 121963bp.
2019-03-11 16:08:50,188 Created length plots
2019-03-11 16:08:50,213 Nanoplotter: Creating Read lengths vs Average read quality plots using statistics from 420526 reads.
/home/haars001/miniconda2/envs/albacore/lib/python3.6/site-packages/seaborn/axisgrid.py:1735: UserWarning: Attempting to set identical bottom==top results
in singular transformations; automatically expanding.
bottom=0, top=-0.0
  ax_joint.set_ylim(ylim)
/home/haars001/miniconda2/envs/albacore/lib/python3.6/site-packages/seaborn/axisgrid.py:1735: UserWarning: Attempting to set identical bottom==top results
in singular transformations; automatically expanding.
bottom=0, top=-0.0
  ax_joint.set_ylim(ylim)
/home/haars001/miniconda2/envs/albacore/lib/python3.6/site-packages/statsmodels/nonparametric/kernels.py:128: RuntimeWarning: invalid value encountered in true_divide
  return (1. / np.sqrt(2 * np.pi)) * np.exp(-(Xi - x)**2 / (h**2 * 2.))
/home/haars001/miniconda2/envs/albacore/lib/python3.6/site-packages/matplotlib/contour.py:1557: UserWarning: Warning: converting a masked element to nan.
  self.zmax = float(z.max())
/home/haars001/miniconda2/envs/albacore/lib/python3.6/site-packages/matplotlib/contour.py:1558: UserWarning: Warning: converting a masked element to nan.
  self.zmin = float(z.min())
/home/haars001/miniconda2/envs/albacore/lib/python3.6/site-packages/matplotlib/contour.py:1203: RuntimeWarning: invalid value encountered in less
  under = np.nonzero(lev < self.zmin)[0]
/home/haars001/miniconda2/envs/albacore/lib/python3.6/site-packages/matplotlib/contour.py:1205: RuntimeWarning: invalid value encountered in greater
  over = np.nonzero(lev > self.zmax)[0]
/home/haars001/miniconda2/envs/albacore/lib/python3.6/site-packages/statsmodels/nonparametric/kde.py:488: RuntimeWarning: invalid value encountered in true_divide
  binned = fast_linbin(X, a, b, gridsize) / (delta * nobs)
/home/haars001/miniconda2/envs/albacore/lib/python3.6/site-packages/statsmodels/nonparametric/kdetools.py:34: RuntimeWarning: invalid value encountered in double_scalars
  FAC1 = 2*(np.pi*bw/RANGE)**2
2019-03-11 16:09:07,711 Created LengthvsQual plot
2019-03-11 16:09:07,712 Writing html report.
2019-03-11 16:09:15,382 Finished!
jvhaarst commented 5 years ago

As I do not know where it would make the most sense to change the action, I haven't tried a pull request. One could argue that creating quality plots without any quality info makes no sense, so checking in nanoget whether the qualities are all zeroes, and then dropping that column is one way to go. Another way would be to let nanoplotter just create the plots, but check there whether the y-limits are the same, and if so, change that so matplotlib doesn't error out.

wdecoster commented 5 years ago

Hmm based on the log file it seems the plotting gave a set of warnings - without errors/crashing the program? It may be that the plot doesn't really make sense though. What I think would be the way forward is that you convert the reads to fasta and use --fasta as input for NanoPlot, in this scenario. Then no plots with quality scores are attempted.

Alternatively, there might be a way in nanoget to check if the qualities are all 0, but starting there that probably doesn't suffice: if the qualities are all 7 or something else we might have other problems.

jvhaarst commented 5 years ago

Yeah, next time I'll extract FASTA from the BAM files, but I still think it would be better if these warnings didn't occur. The "problem" is that all the values are the same, so that df.quals,std() == 0. That in itself is a quick test, but I think a user would be surprised if they didn't get the quality graphs. The might be non-informative as all the values are the same, but they gave NanoPlot FASTQ for a reason. I'm happy with any decision, I'm already very happy that NanoPlot exists.

wdecoster commented 5 years ago

Yes, I agree that there should be a cleaner solution than all those warnings. By extension, bivariate plots with one axis for which all values are the same don't really make sense, so probably the appropriate place to solve this is nanoplotter. I'm not entirely convinced that the plot shouldn't be created at all. I'll see if I can find an appropriate solution. Thanks for your feedback - highly appreciated!

wdecoster commented 5 years ago

Thank you for highlighting this issue. It turns out that depending on the bivariate plot type selected (with --plot) also real errors happened in this case. I have added a check in nanoplotter that will skip those bivariate plots if all values are the same.

This change is included in nanoplotter 1.5.0.

Happy to get any feedback!