ythuang0522 / homopolish

High-quality Nanopore-only genome polisher
GNU General Public License v3.0
65 stars 12 forks source link

homopolish modpolish Index error #47

Open giriarteS opened 2 years ago

giriarteS commented 2 years ago

Can modpolish be used after racon+medaka? I am trying to polish my assembly (flye+6 rounds of racon+medaka) with modpolish but I get the following error: "python3 /mnt/DATA/git_repos/homopolish/homopolish.py modpolish -a medaka/consensus.fasta -q /mnt/Backup/Genomes_test/raw_dna/minion/F.verticillioides/MRC826E/MRC826E_trim_fastq.gz -s /mnt/DATA/git_repos/homopolish/fungi.msh -o homopolish -t 50

Downloaded GCA_003316995.2_ASM331699v2_genomic.fna.gz Downloaded GCA_003316975.2_ASM331697v2_genomic.fna.gz Downloaded GCA_011035275.1_ASM1103527v1_genomic.fna.gz Downloaded GCA_011036225.1_ASM1103622v1_genomic.fna.gz Downloaded GCA_000149555.1_ASM14955v1_genomic.fna.gz Downloaded GCF_000149555.1_ASM14955v1_genomic.fna.gz Downloaded GCA_003317015.2_ASM331701v2_genomic.fna.gz Downloaded GCA_011036015.1_ASM1103601v1_genomic.fna.gz Downloaded GCA_011036905.1_ASM1103690v1_genomic.fna.gz

Reference = [/mnt/Backup/Genomes_test/assembly/flye/F.verticillioides/MRC826E/homopolish/debug/homologous_sequences/GCA_003316995.2_ASM331699v2_genomic.fna.gz.fna.gz, /mnt/Backup/Genomes_test/assembly/flye/F.verticillioides/MRC826E/homopolish/debug/homologous_sequences/GCA_003316975.2_ASM331697v2_genomic.fna.gz.fna.gz, /mnt/Backup/Genomes_test/assembly/flye/F.verticillioides/MRC826E/homopolish/debug/homologous_sequences/GCA_011035275.1_ASM1103527v1_genomic.fna.gz.fna.gz, /mnt/Backup/Genomes_test/assembly/flye/F.verticillioides/MRC826E/homopolish/debug/homologous_sequences/GCA_011036225.1_ASM1103622v1_genomic.fna.gz.fna.gz, /mnt/Backup/Genomes_test/assembly/flye/F.verticillioides/MRC826E/homopolish/debug/homologous_sequences/GCA_000149555.1_ASM14955v1_genomic.fna.gz.fna.gz, /mnt/Backup/Genomes_test/assembly/flye/F.verticillioides/MRC826E/homopolish/debug/homologous_sequences/GCF_000149555.1_ASM14955v1_genomic.fna.gz.fna.gz, /mnt/Backup/Genomes_test/assembly/flye/F.verticillioides/MRC826E/homopolish/debug/homologous_sequences/GCA_003317015.2_ASM331701v2_genomic.fna.gz.fna.gz, /mnt/Backup/Genomes_test/assembly/flye/F.verticillioides/MRC826E/homopolish/debug/homologous_sequences/GCA_011036015.1_ASM1103601v1_genomic.fna.gz.fna.gz, /mnt/Backup/Genomes_test/assembly/flye/F.verticillioides/MRC826E/homopolish/debug/homologous_sequences/GCA_011036905.1_ASM1103690v1_genomic.fna.gz.fna.gz] Query = [medaka/consensus.fasta] Kmer size = 16 Fragment length = 3000 Threads = 1 ANI output file = homopolish/debug/ANI.txt

INFO [thread 0], skch::main, Count of threads executing parallel_for : 1 INFO [thread 0], skch::Sketch::build, window size for minimizer sampling = 24 INFO [thread 0], skch::Sketch::build, minimizers picked from reference = 30851510 INFO [thread 0], skch::Sketch::index, unique minimizers = 4853959 INFO [thread 0], skch::Sketch::computeFreqHist, Frequency histogram of minimizers = (1, 1098191) ... (7088, 1) INFO [thread 0], skch::Sketch::computeFreqHist, consider all minimizers during lookup. INFO [thread 0], skch::main, Time spent sketching the reference : 26.7354 sec INFO [thread 0], skch::main, Time spent mapping fragments in query #1 : 75.2743 sec INFO [thread 0], skch::main, Time spent post mapping : 0.0381457 sec INFO [thread 0], skch::main, ready to exit the loop INFO, skch::main, parallel_for execution finished

load Homo

Reference = [homopolish/debug/All_homologous_sequences.fna.gz] Query = [medaka/consensus.fasta] Kmer size = 16 Fragment length = 3000 Threads = 1 ANI output file = homopolish/debug/truth_ANI.txt

INFO [thread 0], skch::main, Count of threads executing parallel_for : 1 INFO [thread 0], skch::Sketch::build, window size for minimizer sampling = 24 INFO [thread 0], skch::Sketch::build, minimizers picked from reference = 30851510 INFO [thread 0], skch::Sketch::index, unique minimizers = 4853959 INFO [thread 0], skch::Sketch::computeFreqHist, Frequency histogram of minimizers = (1, 1098191) ... (7088, 1) INFO [thread 0], skch::Sketch::computeFreqHist, consider all minimizers during lookup. INFO [thread 0], skch::main, Time spent sketching the reference : 26.885 sec INFO [thread 0], skch::main, Time spent mapping fragments in query #1 : 75.9234 sec INFO [thread 0], skch::main, Time spent post mapping : 0.0205193 sec INFO [thread 0], skch::main, ready to exit the loop INFO, skch::main, parallel_for execution finished [M::mm_idx_gen::0.8991.00] collected minimizers [M::mm_idx_gen::1.0612.05] sorted minimizers [M::main::1.0612.05] loaded/built the index for 83 target sequence(s) [M::mm_mapopt_update::1.1341.98] mid_occ = 100 [M::mm_idx_stat] kmer size: 19; skip: 19; is_hpc: 0; #seq: 83 [M::mm_idx_stat::1.1761.94] distinct minimizers: 4248851 (99.24% are singletons); average occurrences: 1.018; average spacing: 9.991 [M::worker_pipeline::19.96514.14] mapped 49912 sequences [M::main] Version: 2.17-r941 [M::main] CMD: minimap2 -cx asm5 --cs=long -t 50 medaka/consensus.fasta homopolish/debug/All_homologous_sequences.fna.gz [M::main] Real time: 20.085 sec; CPU: 282.508 sec; Peak RSS: 4.439 GB 127.97561192512512 Traceback (most recent call last): File "/mnt/DATA/git_repos/homopolish/homopolish.py", line 102, in main() File "/mnt/DATA/git_repos/homopolish/homopolish.py", line 77, in main getPos(fixData,FLAGS.debug) File "/mnt/DATA/git_repos/homopolish/modules/mod_polish.py", line 80, in getPos H_misAry,H_AllAry = getPileUpAry(fixData,fixData.output_dir+"debug",fixData.output_dir+"debug/All_homologous_sequences.fna.gz")
File "/mnt/DATA/git_repos/homopolish/modules/mod_polish.py", line 146, in getPileUpAry misAry,posData_ary = MismatchPileup(fixData.output_dir+"debug/truth.paf",len(fixData.seq))#get SNP position ATCG array File "/mnt/DATA/git_repos/homopolish/modules/mod_polish.py", line 440, in MismatchPileup coverage[start_pos] += 1 IndexError: index 2852431 is out of bounds for axis 0 with size 2852431"

Gloria

ythuang0522 commented 2 years ago

Hi @giriarteS,

We haven't tested modpolish on fungi yet. Have you ever successfully run the original homopolish polish? If so, can you give us your fasta genome and fastq reads for debugging? modpolish is a new feature for genomes extensively edited by modifications untrained in the ONT basecalling. If your genomes are not extensively modified, you don't need to run modpolish.

giriarteS commented 2 years ago

I run homopolish polish but I got the following for each contig: python3 /mnt/DATA/git_repos/homopolish/homopolish.py polish -a medaka/consensus.fasta -m R9.4.pkl -s /mnt/DATA/git_repos/homopolish/fungi.msh -o homopolish -t 50

[2022/05/25 17:24] INFO: Stage: Select closely-related genomes TIME Select closely-related genomes: 0 MINS 0 SECS. This contig contig_75 closely-related genome is less than 5, not to polish... [2022/05/25 17:24] INFO: RUN-ID: contig_197

giriarteS commented 2 years ago

I successfully run homopolish polish: homopolish.py polish -a medaka/consensus.fasta -m R9.4.pkl -l reference/F.verticillioides/GCF_000149555.1_ASM14955v1_genomic.fna.gz -o homopolish -t 25

ythuang0522 commented 2 years ago

We found the error was due to an insufficient number of related genomes (<5), a bug in modpolish but not original homopolish. It looks to us there are six in GenBank but only one in RefSeq. Therefore, as you have done, the local db option -l can specify the related genomes downloaded on your own. However, I would suggest you concatenate all the six related genomes into one fasta not just provided only one.

giriarteS commented 2 years ago

I polished my genomes with several rounds of racon, then medaka, then homopolish. If I want to use illumina reads to polish with Pilon, should homopolish be run before or after pilon?

ythuang0522 commented 2 years ago

It was trained on Medaka-polished genomes. We would recommend running homopolish right after Medaka and then followed by pilon.