zstephens / neat-genreads

NEAT read simulation tools
Other
96 stars 27 forks source link

trouble with computeGC.py #71

Open RichardCorbett opened 4 years ago

RichardCorbett commented 4 years ago

Hi there,

I ran this, which gave me a very large file with depths at each genomic location: /home/rcorbett/bin/bedtools2//bin/genomeCoverageBed -d -ibam GM12878_DNA_1.bam -g /projects/alignment_references/Homo_sapiens/hg19a/genome/bwa_64/hg19a.fa > depths.txt

and then this command python ~/bin/neat-genreads/utilities/computeGC.py -i depths.txt -r /projects/alignment_references/Homo_sapiens/hg19a/genome/bwa_64/hg19a.fa -o GC.p

which printed many lines but finished with these:

88.00% 0.0 0
90.00% 0.0 0
92.00% 0.0 0
94.00% 0.0 0
96.00% 0.0 0
98.00% 0.0 0
100.00% 0.0 0
Traceback (most recent call last):
  File "/home/rcorbett/bin/neat-genreads/utilities/computeGC.py", line 103, in <module>
    avgCov = allMean/float(runningTot)
ZeroDivisionError: float division by zero

I'm sure I'm doing something wrong, can you help me out? thanks Richard

zstephens commented 4 years ago

Greetings,

Do the reference contig names in the depths.txt file exactly match those found in hg19a.fa?

RichardCorbett commented 4 years ago

Wow. Thanks for the quick check in.

Chromosome header lines in the fasta have other strings. Might be the problem?

>1 CM000663.1 Homo sapiens chromosome 1, GRCh37 primary reference assembly
>2 CM000664.1 Homo sapiens chromosome 2, GRCh37 primary reference assembly
>3 CM000665.1 Homo sapiens chromosome 3, GRCh37 primary reference assembly
>4 CM000666.1 Homo sapiens chromosome 4, GRCh37 primary reference assembly
head depths.txt
1   1   0
1   2   0
1   3   0
1   4   0
1   5   0
zstephens commented 4 years ago

Indeed, computeGC.py expects the refnames to be identical (and if none of the lines in depths.txt have a valid name, nothing gets counted, leading to the eventual division by zero error).

I've run into this before. I think bedtools truncates reference names after the first space (but only if they're above some length? I'm not sure exactly). I've considered implementing bandaid logic in computeGC.py: e.g. if a contig name from depths.txt isn't found in the reference fasta, maybe try again but with a trimmed name. But I started thinking up all the pathological ways that could introduce bugs: e.g. if you fed it a reference with contigs like 1 CM000663.1 and 1 CM000664.1, how to handle trimming down to the same string?

Since the script itself instructs the user to use bedtools, it debatably has a responsibility to account for this kind of quirk. I'll dig into it a bit more and see if I can find the precise logic they use for trimming. In the meantime, the issue could be resolved by find-replacing the contig names in your hg19a.fa file with their simplified equivalents.

RichardCorbett commented 4 years ago

Thanks @zstephens ,

I suppose one way around this would be to stick with the fasta spec, not specific as it is. This post suggests the first string before a space is the "ID" but its far from definitive: https://www.biostars.org/p/11254/#11258

I stripped the strings after the first space in my fasta and computeGC.py worked as advertised. thanks Richard