Open robertwhbaldwin opened 3 years ago
Hi, Based on the size of the haploid genome at 650Mb, your coverage estimate should be set to 11x and not 5.5x. However, it would be great if you could you clarify the point about ploidy of the genome? Is your genome a haploid or diploid in nature? Hardip
Thanks for the quick response. It is biologically diploid. I think the haploid size is 650 Mb (that's what's reported on ncbi by flow cytometry) I'll try it with 11. The example I provided (above) had it set to 9x. So would it really fail like this if I was off by 2x? The genome size is only an estimate. Could be 700 Mb for example.
We use findGSE under the hood to get the read-depth estimates of heterozygous k-mers. Typically if the genome has a read depth of 10x, the heterozygous k-mers should have a read-depth of 5x. However, heterozygosity rate in the genome, sequencing error rates etc would come into play and affect calculations of identifying the clear range for heterozygous k-mer peak. Generally we recommend read-depth of ~30 for best results. However, I would love to see your results with 11x read depth. We haven't tested it fully with low read-depths and your results would help the community to decide if the software is able to handle low read depth.
It didn't work at 11 either. I'm going to try every int between 4 and 15 before giving up completely. Below is the full log for the attempt at 11 X coverage. Is there any information here that might be helpful in terms of solving this problem? Thanks - Robert
/usr/bin/python3 /home/robert/tools/kmer2snp.py population --k 31 --cfile ../cov11 --faqfile ../fastq
Namespace(b=None, cfile='../cov11', faqfile='../fastq', k='31', output_dir=None, ref=None, which='population')
debug ./
/data/MAHI/results/F003
sh /home/robert/tools/Kmer2SNP/script/run_dsk.sh 31 11 /data//F003/F-003-_S82_L002_R1_001.fastq.gz,/data//F003/F-003-_S82_L002_R2_001.fastq.gz
[DSK: nb solid kmers: 634411108 ] 120 % elapsed: 8 min 33 sec remaining: 0 min 0 sec cpu: 792.9 % mem: [ 167, 4326, 4327] MB
config
kmer_size : 31
mini_size : 10
solidity_kind : sum
abundance_min : 2
abundance_max : 2147483647
available_space : 440529
estimated_sequence_number : 39344243
estimated_sequence_volume : 5474
estimated_kmers_number : 4524587945
estimated_kmers_volume : 34519
max_disk_space : 438529
max_memory : 5000
nb_passes : 1
nb_partitions : 50
nb_bits_per_kmer : 64
nb_cores : 10
minimizer_type : lexicographic (kmc2 heuristic)
repartition_type : unordered
nb_cores_per_partition : 1
nb_partitions_in_parallel : 10
nb_cached_items_per_core_per_part : 131072
nb_banks : 2
dsk
bank
bank_uri : /data//F_003_i/F-003-.fastq.gz,/data//F_003_i/F-003-_S82_L002_R2_001.fastq.gz
bank_size : 14439111736
bank_total_nt : 6825942985
sequences
seq_number : 46825672
seq_size_min : 35
seq_size_max : 151
seq_size_mean : 145.8
seq_size_deviation : 18.7
kmers
kmers_nb_valid : 5419760084
kmers_nb_invalid : 1412741
stats
temp_files
nb_superkmers : 492327595
avg_superk_length : 11.01
minimizer_density : 2.09
totalsize(MB) : 5409
tmp_filebiggest(MB) : 185
tmp_filesmallest(MB) : 98
tmp_filemean(MB) : 108.2
histogram
cutoff : 4
nb_ge_cutoff : 299723344
ratio_weak_volume : 0.26
first_peak : 325
kmers
solidity_kind : sum
thresholds : 2 2
kmers_nb_distinct : 1260218289
kmers_nb_solid : 634411108
kmers_nb_weak : 625807181
kmers_percent_weak : 49.7
partitions
nb_partitions : 50
nb_items : 634411108
part_biggest : 17378595
part_smallest : 11732855
part_mean : 12688222.2
kind
vector : 50
fillsolid_time : 249.380
1.read : 37.396
2.sort : 53.160
3.dump : 158.824
time : 512.983
fill_partitions : 202.766
fill_solid_kmers : 310.217
[parsing ] 100 % elapsed: 2 min 1 sec remaining: 0 min 0 sec
stats
kmer_size : 31
nb_kmers : 634411108
sh /home/robert/tools/Kmer2SNP/script/run_findgse.sh 31 11 /data//F003/F-003-_S82_L002_R1_001.fastq.gz,/data//F003/F-003-_S82_L002_R2_001.fastq.gz
Error in findGSE(histo = args[1], sizek = args[2], outdir = args[3], exp_hom = strtoi(args[4])) :
No k-mer freq peak was found with the cutoff given by -exp_hom.
You may need to increase the cutoff!
In addition: Warning message:
In dir.create(file.path(path), showWarnings = T) : '.' already exists
Execution halted
cat: 'v1*txt': No such file or directory
Traceback (most recent call last):
File "/home/robert/tools/kmer2snp.py", line 129, in
Still no luck. But there is some output in the results folder, including this histogram (below). It says there's a "main peak at kmer freq 5". Can someone explain what this means? What is this telling me about what I should be using as cov? I've already tried 5 and 10. Thanks - Robert
Do you know the heterozygous rate of this species? And I notice you use population mode, how many samples do you have and the estimated coverage for each of them?
It seems for this low coverage data, findGSE cannot estimate the coverage range of heterozygous kmer. You can skip findGSE step and provide coverage range by yourself. For each sample, in the sample folder, you need to write a hete.para file.
For example, write the following three lines heterozygous rate is 0.00070857 heterozygous lowCoverage is 14 heterozygous highCoverage is 45 in sample1/hete.para.
But base on the histogram you provided, I cannot estimate the coverage range.
Thanks for your answers and advice. I have 47 samples all at around 10X coverage (range 8-14) based on the estimated genome size and the number of bases in the fastq files. I don't know about the heterozygous rate but it's going to be greater than for human (it's a wild fish species).
I was getting the error below. The reason for this error was stated to be incorrect coverage given as input. I thought that I had estimated coverage properly. My genome is ~650 Mb. For the sample failing I have 150 bp reads X (23 X 2 million reads) = 6.9 billion base. So 6.9 billion divided by 650 million ~ 11 X coverage. But the genome is a haploid genome. So the diploid coverage would be about 11/2 or 5.5. Is this right? Can someone please clear this up? Thanks
sh /home/robert/tools/Kmer2SNP/script/run_findgse.sh 31 9 /data/MAHI/F003/F-003-_S82_L002_R1_001.fastq.gz,/data/MAHI/F003/F-003i_S82_L002_R2_001.fastq.gz Error in findGSE(histo = args[1], sizek = args[2], outdir = args[3], exp_hom = strtoi(args[4])) : No k-mer freq peak was found with the cutoff given by -exp_hom. You may need to increase the cutoff! In addition: Warning message: In dir.create(file.path(path), showWarnings = T) : '.' already exists Execution halted cat: 'v1*txt': No such file or directory Traceback (most recent call last): File "/home/robert/tools/Kmer2SNP/kmer2snp.py", line 129, in
run_population(args)
File "/home/robert/tools/Kmer2SNP/kmer2snp.py", line 48, in run_population
run_single(args.k, covMap[sampleID], faqFile, args.b)
File "/home/robert/tools/Kmer2SNP/kmer2snp.py", line 70, in run_single
c1, c2, r = tools.read_findGSE_result("hete.para")
File "/home/robert/tools/Kmer2SNP/libprism/local/tools.py", line 271, in read_findGSE_result
r = float(words[3])
IndexError: list index out of range