wwood / CoverM

Read coverage calculator for metagenomics
GNU General Public License v3.0
273 stars 30 forks source link

Coverm filter inverse paired reads #159

Closed Rridley7 closed 1 year ago

Rridley7 commented 1 year ago

Hello, I had a question about the filtering module of coverm, I'm guessing the answer may have to do with the SAM flags within the files passed to coverm filter. When running coverm filter on a bam file (mapped with only paired reads) with the --inverse --proper-pairs-only flags, I would expect that the output file contains all singletons and unmapped pairs of the original bam file, and any other reads not passing filtering, thus being 'proper unmapped pairs'. However, when I look to output all of these unmapped reads to a fastq file using samtools fastq -1 r1.fq -2 r2.fq, I get an uneven number of reads in the forward and reverse file. Is there a reason for this?

I understand this issue may not be on the side of coverm, but rather on the samtools call, and can ask this elsewhere if that is the case. Thanks!

Example:

coverm filter --inverse --min-read-percent-identity 0.95 --min-read-aligned-percent 0.7 -t 24 --proper-pairs-only -b coverm-genome.S03_5e9863-t_1.fq.ump.bam  -o out_filtered.bam

samtools collate  -O out_filtered.bam | samtools fastq -1 r1.fq -2 r2.fq -

wc -l < r1.fq
#42333736
wc -l < r2.fq
#42487636
wwood commented 1 year ago

Hi,

Good point. The way coverm is interpreting this is that you want a BAM file with only proper pairs that don't pass the --min-read-percent-identity 0.95 --min-read-aligned-percent 0.7 thresholds. It doesn't "invert" the --proper-pairs flag.

I suggest taking a two step approach - grab reads that fail in the same way you did above, and then concatenate unpaired reads and unmapped reads with samtools and its flags for filtering.

Does that make sense? I understand this is a bit confusing, and the documentation could be better - will fix. ben

Rridley7 commented 1 year ago

Thanks for this quick and helpful response! I see, I think that makes sense. One follow up question, I guess I would then expect the bam file to still contain a matching number of reads in the forward and reverse file with this proper pairs flag, rather than having a different number between the files. I.e., if only proper pairs are passed to this output bam file, and the bam file contains output of properly paired reads which do not meet the identity and alignment requirements, should the fwd and rev fastq files have the same number of reads? Could you explain this piece a bit more?

wwood commented 1 year ago

Good q.

I think what is happening is that they are aligned as a proper pair, but then since you used thresholds which apply to a single side of each pair, then only one of the pair is present in the output BAM. Does that make sense?

I have to admit it has been a while since I looked at that code though. Perhaps it would be worth verifying by inspecting the mappings of now-singleton mappings in the original file?

Rridley7 commented 1 year ago

I see, yes that does make sense. I'll definitely inspect the singletons to see if that is the case. Thanks!