wdecoster / cramino

A *fast* tool for BAM/CRAM quality evaluation, intended for long reads
MIT License
127 stars 11 forks source link

Dorado BAM output and stats #16

Closed mattloose closed 1 year ago

mattloose commented 1 year ago

Hi Wouter,

Have you run this across an aligned BAM file as output by dorado? It seerms to me that the calculated data is reflecting alignments and not the underlying read data - for example Yield and N50 calculations are incorrect as is Mean coverage etc.

This might be an issue with the dorado BAM file but I think it may also be that you are really looking at alignments?

Cheers

Matt

wdecoster commented 1 year ago

Hi Matt,

I don't expect that a dorado bam would cause a problem, but I haven't tried. And indeed, cramino works (by default) with the primary and supplementary alignments, not the secondary ones or unaligned reads. This can be changed with --ubam, added relatively recently. When you say incorrect, do you mean by a lot, and is this compared with the metrics based on fastq? I can imagine that differences can happen due to softclipping adapters, for example.

Wouter

mattloose commented 1 year ago

Thanks for the prompt reponse.

It's out by a lot here...

This is the output of trusty NanoStat on the fastq extracted from the BAM file using samtools:

General summary: Mean read length: 36,689.1 Mean read quality: 12.3 Median read length: 7,677.0 Median read quality: 18.5 Number of reads: 708,681.0 Read length N50: 129,276.0 STDEV read length: 64,679.2 Total bases: 26,000,838,396.0

This is the relevant output from cramino (and also NanoPlot) from the aligned bam file:

Number of reads 878325 Yield [Gb] 57.81 Mean coverage 18.72 Yield [Gb] (>25kb) 54.63 N50 182736 N75 112611 Median length 16377.00 Mean length 65823 Median identity 98.63 Mean identity 94.89

The NanoStat from the fastq is in line with what I expect to see.

Cheers

Matt

mattloose commented 1 year ago

An update - having done some digging I believe that dorado is outputting soft clipped supplementary alignments which leads to tools miscalculating lengths if they assume only hard clipping is in use. Using hard clipping will cause downstream issues with methylation tags.

wdecoster commented 1 year ago

This will be fixed in the next release, tentatively this weekend.

mattloose commented 1 year ago

Hey Wouter - any update on this?

THanks

Matt

wdecoster commented 1 year ago

Pushing a release today, thanks for your patience. Softclips will be ignored now.