Closed fellen31 closed 6 months ago
I think the problem is that if you are running --ubam
on an actual ubam, there is no such thing as secondary or softclipped. The intention for --ubam
is to represent all the reads, in the way they come from the sequencer. Having reads with a length of 0 is definitely not desirable, though. Any thoughts on how this could be improved?
I have added a suggestion in #30.
One thing is to think about if you have an aligned BAM with some unaligned reads. For example, add 100 unaligned reads to test-data/small-test-phased.bam
, and #30 will report 7416 alignments without --ubam
and 7516 alignments with --ubam
, while the current version reports 7416 without and 8205 alignments with (since it includes unaligned reads and secondary alignments).
Another thing is if you would rather want to represent all alignments in an aligned BAM as unaligned (like running samtools reset), but I think that is a different flag than --ubam
.
samtools reset small-test-phased.bam | grep -v '^@^' | wc -l
6180
--ubam
will and already functions properly on a BAM that only contains unaligned reads you say, since there you never have any secondary reads or soft clipped bases.
Yeh I think you are right, there are indeed two scenarios to keep in mind.
I agree a different flag makes most sense then...
When running with
--ubam
on a file with aligned reads, secondary alignments will get a length of 0 and be included in the median read length calculations, arrow output etc.test-data/small-test-phased.bam
out.arrow
If not also running with
--min-read-len 1
.However, running with
--ubam
will no longer removed soft-clipped bases from the primary alignments. Compare the output to running without--ubam
.If secondary alignments are filtered out by default, should they also not be filtered out by default when running with
--ubam
? And should soft-clipped bases then not be removed from primary alignments when running with--ubam
?