Potential bug in compute_fraglen.py

mrmilhaven commented 2 years ago

Describe the bug A clear and concise description of what the bug is. Fragment Length Distribution is incorrect, even when using NEAT's gen_reads.py

To Reproduce Steps to reproduce the behavior:

Create NEAT dataset with fragment length distribution as mean = 300 and SD = 10 (test.fa is 1Mb of random nucleotides, no Ns): python ~/NEAT/gen_reads.py -R 100 -r test.fa -c 10 --pe 300 10 -o NEAT_test

2.Map reads (I used BWA): bwa mem -M test.fa NEAT_test_read1.fq.gz NEAT_test_read2.fq.gz > NEAT_test.bam

Compute fragment distribution as sanity check: python ~/NEAT/utilities/compute_fraglen.py -i NEAT_test.bam -o NEAT_test
open pickle file: import pandas pandas.read_pickle("NEAT_test.p") [[100], [1.0]]

Expected behavior I would expect the same distribution that I gave gen_reads.py (mean of 300 with SD of 10) to come back out.

Potential Reason I think the bug has to do with the count_frags method: First 9 fields in BAM from BAM file: NEAT_test-test-1 99 test 9039 60 100M = 9254 315

First 9 fields of BAM from script(adding print line within count_frags for loop, approx. line 96): NEAT_test-test-1 99 0 9038 60 100M 0 9253 100

I believe that converting from an iterable pysam type to a string (line 95) causes some of the fields to change. This also implies that , since mate mapping(field 8) is always equal to the reference (field 3), every read will look like it was mate mapped, per your code (line 108).

I think this can be fixed by using pysams inherent parameters, such as item.mate_is_unmapped or item.template_length instead of splitting the line by tabs.

joshfactorial commented 2 years ago

Can you duplicate this bug on our main development repo: https://github.com/ncsa/NEAT/issues? That will help us keep things together. I'm currently investigating those utilities, so this is timely.

mrmilhaven commented 2 years ago

Duplicated

zstephens / neat-genreads

Potential bug in compute_fraglen.py #99