Describe the bug
A clear and concise description of what the bug is.
Fragment Length Distribution is incorrect, even when using NEAT's gen_reads.py
To Reproduce
Steps to reproduce the behavior:
Create NEAT dataset with fragment length distribution as mean = 300 and SD = 10 (test.fa is 1Mb of random nucleotides, no Ns):
python ~/NEAT/gen_reads.py -R 100 -r test.fa -c 10 --pe 300 10 -o NEAT_test
2.Map reads (I used BWA):
bwa mem -M test.fa NEAT_test_read1.fq.gz NEAT_test_read2.fq.gz > NEAT_test.bam
Compute fragment distribution as sanity check:
python ~/NEAT/utilities/compute_fraglen.py -i NEAT_test.bam -o NEAT_test
open pickle file:
import pandas
pandas.read_pickle("NEAT_test.p")
[[100], [1.0]]
Expected behavior
I would expect the same distribution that I gave gen_reads.py (mean of 300 with SD of 10) to come back out.
Potential Reason
I think the bug has to do with the count_frags method:
First 9 fields in BAM from BAM file:
NEAT_test-test-1 99 test 9039 60 100M = 9254 315
First 9 fields of BAM from script(adding print line within count_frags for loop, approx. line 96):
NEAT_test-test-1 99 0 9038 60 100M 0 9253 100
I believe that converting from an iterable pysam type to a string (line 95) causes some of the fields to change. This also implies that , since mate mapping(field 8) is always equal to the reference (field 3), every read will look like it was mate mapped, per your code (line 108).
I think this can be fixed by using pysams inherent parameters, such as item.mate_is_unmapped or item.template_length instead of splitting the line by tabs.
Can you duplicate this bug on our main development repo: https://github.com/ncsa/NEAT/issues? That will help us keep things together. I'm currently investigating those utilities, so this is timely.
Describe the bug A clear and concise description of what the bug is. Fragment Length Distribution is incorrect, even when using NEAT's gen_reads.py
To Reproduce Steps to reproduce the behavior:
2.Map reads (I used BWA): bwa mem -M test.fa NEAT_test_read1.fq.gz NEAT_test_read2.fq.gz > NEAT_test.bam
Compute fragment distribution as sanity check: python ~/NEAT/utilities/compute_fraglen.py -i NEAT_test.bam -o NEAT_test
open pickle file: import pandas pandas.read_pickle("NEAT_test.p") [[100], [1.0]]
Expected behavior I would expect the same distribution that I gave gen_reads.py (mean of 300 with SD of 10) to come back out.
Potential Reason I think the bug has to do with the count_frags method: First 9 fields in BAM from BAM file: NEAT_test-test-1 99 test 9039 60 100M = 9254 315
First 9 fields of BAM from script(adding print line within count_frags for loop, approx. line 96): NEAT_test-test-1 99 0 9038 60 100M 0 9253 100
I believe that converting from an iterable pysam type to a string (line 95) causes some of the fields to change. This also implies that , since mate mapping(field 8) is always equal to the reference (field 3), every read will look like it was mate mapped, per your code (line 108).
I think this can be fixed by using pysams inherent parameters, such as item.mate_is_unmapped or item.template_length instead of splitting the line by tabs.