yanboANU / Kmer2SNP

22 stars 2 forks source link

what is homozygous coverage? #10

Open robertwhbaldwin opened 3 years ago

robertwhbaldwin commented 3 years ago

I was getting the error below. The reason for this error was stated to be incorrect coverage given as input. I thought that I had estimated coverage properly. My genome is ~650 Mb. For the sample failing I have 150 bp reads X (23 X 2 million reads) = 6.9 billion base. So 6.9 billion divided by 650 million ~ 11 X coverage. But the genome is a haploid genome. So the diploid coverage would be about 11/2 or 5.5. Is this right? Can someone please clear this up? Thanks

sh /home/robert/tools/Kmer2SNP/script/run_findgse.sh 31 9 /data/MAHI/F003/F-003-_S82_L002_R1_001.fastq.gz,/data/MAHI/F003/F-003i_S82_L002_R2_001.fastq.gz Error in findGSE(histo = args[1], sizek = args[2], outdir = args[3], exp_hom = strtoi(args[4])) : No k-mer freq peak was found with the cutoff given by -exp_hom. You may need to increase the cutoff! In addition: Warning message: In dir.create(file.path(path), showWarnings = T) : '.' already exists Execution halted cat: 'v1*txt': No such file or directory Traceback (most recent call last): File "/home/robert/tools/Kmer2SNP/kmer2snp.py", line 129, in run_population(args) File "/home/robert/tools/Kmer2SNP/kmer2snp.py", line 48, in run_population run_single(args.k, covMap[sampleID], faqFile, args.b) File "/home/robert/tools/Kmer2SNP/kmer2snp.py", line 70, in run_single c1, c2, r = tools.read_findGSE_result("hete.para") File "/home/robert/tools/Kmer2SNP/libprism/local/tools.py", line 271, in read_findGSE_result r = float(words[3]) IndexError: list index out of range

hp2048 commented 3 years ago

Hi, Based on the size of the haploid genome at 650Mb, your coverage estimate should be set to 11x and not 5.5x. However, it would be great if you could you clarify the point about ploidy of the genome? Is your genome a haploid or diploid in nature? Hardip

robertwhbaldwin commented 3 years ago

Thanks for the quick response. It is biologically diploid. I think the haploid size is 650 Mb (that's what's reported on ncbi by flow cytometry) I'll try it with 11. The example I provided (above) had it set to 9x. So would it really fail like this if I was off by 2x? The genome size is only an estimate. Could be 700 Mb for example.

hp2048 commented 3 years ago

We use findGSE under the hood to get the read-depth estimates of heterozygous k-mers. Typically if the genome has a read depth of 10x, the heterozygous k-mers should have a read-depth of 5x. However, heterozygosity rate in the genome, sequencing error rates etc would come into play and affect calculations of identifying the clear range for heterozygous k-mer peak. Generally we recommend read-depth of ~30 for best results. However, I would love to see your results with 11x read depth. We haven't tested it fully with low read-depths and your results would help the community to decide if the software is able to handle low read depth.

robertwhbaldwin commented 3 years ago

It didn't work at 11 either. I'm going to try every int between 4 and 15 before giving up completely. Below is the full log for the attempt at 11 X coverage. Is there any information here that might be helpful in terms of solving this problem? Thanks - Robert

/usr/bin/python3 /home/robert/tools/kmer2snp.py population --k 31 --cfile ../cov11 --faqfile ../fastq Namespace(b=None, cfile='../cov11', faqfile='../fastq', k='31', output_dir=None, ref=None, which='population') debug ./ /data/MAHI/results/F003 sh /home/robert/tools/Kmer2SNP/script/run_dsk.sh 31 11 /data//F003/F-003-_S82_L002_R1_001.fastq.gz,/data//F003/F-003-_S82_L002_R2_001.fastq.gz [DSK: nb solid kmers: 634411108 ] 120 % elapsed: 8 min 33 sec remaining: 0 min 0 sec cpu: 792.9 % mem: [ 167, 4326, 4327] MB config
kmer_size : 31 mini_size : 10 solidity_kind : sum abundance_min : 2 abundance_max : 2147483647 available_space : 440529 estimated_sequence_number : 39344243 estimated_sequence_volume : 5474 estimated_kmers_number : 4524587945 estimated_kmers_volume : 34519 max_disk_space : 438529 max_memory : 5000 nb_passes : 1 nb_partitions : 50 nb_bits_per_kmer : 64 nb_cores : 10 minimizer_type : lexicographic (kmc2 heuristic) repartition_type : unordered nb_cores_per_partition : 1 nb_partitions_in_parallel : 10 nb_cached_items_per_core_per_part : 131072 nb_banks : 2 dsk
bank
bank_uri : /data//F_003_i/F-003-.fastq.gz,/data//F_003_i/F-003-_S82_L002_R2_001.fastq.gz bank_size : 14439111736 bank_total_nt : 6825942985 sequences
seq_number : 46825672 seq_size_min : 35 seq_size_max : 151 seq_size_mean : 145.8 seq_size_deviation : 18.7 kmers
kmers_nb_valid : 5419760084 kmers_nb_invalid : 1412741 stats
temp_files
nb_superkmers : 492327595 avg_superk_length : 11.01 minimizer_density : 2.09 totalsize(MB) : 5409 tmp_filebiggest(MB) : 185 tmp_filesmallest(MB) : 98 tmp_filemean(MB) : 108.2 histogram
cutoff : 4 nb_ge_cutoff : 299723344 ratio_weak_volume : 0.26 first_peak : 325 kmers
solidity_kind : sum thresholds : 2 2 kmers_nb_distinct : 1260218289 kmers_nb_solid : 634411108 kmers_nb_weak : 625807181 kmers_percent_weak : 49.7 partitions
nb_partitions : 50 nb_items : 634411108 part_biggest : 17378595 part_smallest : 11732855 part_mean : 12688222.2 kind
vector : 50 fillsolid_time : 249.380 1.read : 37.396 2.sort : 53.160 3.dump : 158.824 time : 512.983 fill_partitions : 202.766 fill_solid_kmers : 310.217 [parsing ] 100 % elapsed: 2 min 1 sec remaining: 0 min 0 sec stats
kmer_size : 31 nb_kmers : 634411108 sh /home/robert/tools/Kmer2SNP/script/run_findgse.sh 31 11 /data//F003/F-003-_S82_L002_R1_001.fastq.gz,/data//F003/F-003-_S82_L002_R2_001.fastq.gz Error in findGSE(histo = args[1], sizek = args[2], outdir = args[3], exp_hom = strtoi(args[4])) : No k-mer freq peak was found with the cutoff given by -exp_hom. You may need to increase the cutoff! In addition: Warning message: In dir.create(file.path(path), showWarnings = T) : '.' already exists Execution halted cat: 'v1*txt': No such file or directory Traceback (most recent call last): File "/home/robert/tools/kmer2snp.py", line 129, in run_population(args) File "/home/robert/tools/kmer2snp.py", line 48, in run_population run_single(args.k, covMap[sampleID], faqFile, args.b) File "/home/robert/tools/kmer2snp.py", line 70, in run_single c1, c2, r = tools.read_findGSE_result("hete.para") File "/home/robert/tools/Kmer2SNP/libprism/local/tools.py", line 271, in read_findGSE_result r = float(words[3]) IndexError: list index out of range

robertwhbaldwin commented 3 years ago

Still no luck. But there is some output in the results folder, including this histogram (below). It says there's a "main peak at kmer freq 5". Can someone explain what this means? What is this telling me about what I should be using as cov? I've already tried 5 and 10. Thanks - Robert

Screenshot from 2021-08-11 11-30-24

yanboANU commented 3 years ago

Do you know the heterozygous rate of this species? And I notice you use population mode, how many samples do you have and the estimated coverage for each of them?

yanboANU commented 3 years ago

It seems for this low coverage data, findGSE cannot estimate the coverage range of heterozygous kmer. You can skip findGSE step and provide coverage range by yourself. For each sample, in the sample folder, you need to write a hete.para file.

For example, write the following three lines heterozygous rate is 0.00070857 heterozygous lowCoverage is 14 heterozygous highCoverage is 45 in sample1/hete.para.

But base on the histogram you provided, I cannot estimate the coverage range.

robertwhbaldwin commented 3 years ago

Thanks for your answers and advice. I have 47 samples all at around 10X coverage (range 8-14) based on the estimated genome size and the number of bases in the fastq files. I don't know about the heterozygous rate but it's going to be greater than for human (it's a wild fish species).