Open RichardCorbett opened 4 years ago
Greetings,
Do the reference contig names in the depths.txt file exactly match those found in hg19a.fa?
Wow. Thanks for the quick check in.
Chromosome header lines in the fasta have other strings. Might be the problem?
>1 CM000663.1 Homo sapiens chromosome 1, GRCh37 primary reference assembly
>2 CM000664.1 Homo sapiens chromosome 2, GRCh37 primary reference assembly
>3 CM000665.1 Homo sapiens chromosome 3, GRCh37 primary reference assembly
>4 CM000666.1 Homo sapiens chromosome 4, GRCh37 primary reference assembly
head depths.txt
1 1 0
1 2 0
1 3 0
1 4 0
1 5 0
Indeed, computeGC.py expects the refnames to be identical (and if none of the lines in depths.txt have a valid name, nothing gets counted, leading to the eventual division by zero error).
I've run into this before. I think bedtools truncates reference names after the first space (but only if they're above some length? I'm not sure exactly). I've considered implementing bandaid logic in computeGC.py: e.g. if a contig name from depths.txt isn't found in the reference fasta, maybe try again but with a trimmed name. But I started thinking up all the pathological ways that could introduce bugs: e.g. if you fed it a reference with contigs like 1 CM000663.1
and 1 CM000664.1
, how to handle trimming down to the same string?
Since the script itself instructs the user to use bedtools, it debatably has a responsibility to account for this kind of quirk. I'll dig into it a bit more and see if I can find the precise logic they use for trimming. In the meantime, the issue could be resolved by find-replacing the contig names in your hg19a.fa file with their simplified equivalents.
Thanks @zstephens ,
I suppose one way around this would be to stick with the fasta spec, not specific as it is. This post suggests the first string before a space is the "ID" but its far from definitive: https://www.biostars.org/p/11254/#11258
I stripped the strings after the first space in my fasta and computeGC.py worked as advertised. thanks Richard
Hi there,
I ran this, which gave me a very large file with depths at each genomic location:
/home/rcorbett/bin/bedtools2//bin/genomeCoverageBed -d -ibam GM12878_DNA_1.bam -g /projects/alignment_references/Homo_sapiens/hg19a/genome/bwa_64/hg19a.fa > depths.txt
and then this command
python ~/bin/neat-genreads/utilities/computeGC.py -i depths.txt -r /projects/alignment_references/Homo_sapiens/hg19a/genome/bwa_64/hg19a.fa -o GC.p
which printed many lines but finished with these:
I'm sure I'm doing something wrong, can you help me out? thanks Richard