ythuang0522 / homopolish

High-quality Nanopore-only genome polisher
GNU General Public License v3.0
65 stars 12 forks source link

bacterial isolates: less than 5 closely-related genomes error message & problems with -g (genus) #44

Open Dwrgi opened 2 years ago

Dwrgi commented 2 years ago

Hi, I have just installed homopolish and was trying to run it on assemblies of a set of Polaromonas isolates, but I keep getting the 'less than 5 closely-related genomes' error message and I don't understand why?

The sizes of the contigs suggest to me that I have one whole (circular) chromosom and one or a few plasmids, so there ought to be at least one contig that matches against Polarmononas (or some bacterium at least!) in the database.

Assemblies are from Nanopore MinIon sequences that have been through Flye assembly+polish and 1x Medaka polish.


My run:

(homopolish) sara@VB-ubuntu:~$ python3 homopolish/homopolish.py polish -a Polaromonas_isolates/medaka_polished_assemblies/Pol_01_consensus.fasta -s homopolish/bacteria.msh -m homopolish/R9.4.pkl -o Polaromonas_isolates/homopolished_assemblies

The resulting output:

[2022/05/03 13:19] INFO: RUN-ID: contig_3 contig_3 /home/sara/Polaromonas_isolates/homopolished_assemblies/debug [2022/05/03 13:19] INFO: Stage: Select closely-related genomes TIME Select closely-related genomes: 1 MINS 26 SECS. This contig contig_3 closely-related genome is less than 5, not to polish...

[2022/05/03 13:20] INFO: RUN-ID: contig_2 contig_2 /home/sara/Polaromonas_isolates/homopolished_assemblies/debug [2022/05/03 13:20] INFO: Stage: Select closely-related genomes TIME Select closely-related genomes: 0 MINS 14 SECS. This contig contig_2 closely-related genome is less than 5, not to polish...

[2022/05/03 13:20] INFO: RUN-ID: contig_1 contig_1 /home/sara/Polaromonas_isolates/homopolished_assemblies/debug [2022/05/03 13:20] INFO: Stage: Select closely-related genomes TIME Select closely-related genomes: 0 MINS 17 SECS. This contig contig_1 closely-related genome is less than 5, not to polish... TIME Total: 1 MINS 58 SECS.

I also tried using the -g option and specifying the genus as Polarmonas but that threw up an error message that I don't know how to handle:

(homopolish) sara@VB-ubuntu:~$ python3 homopolish/homopolish.py polish -a Polaromonas_isolates/medaka_polished_assemblies/Pol_01_consensus.fasta -g Polaromonas -m homopolish/R9.4.pkl -o Polaromonas_isolates/homopolished_test

Traceback (most recent call last): File "homopolish/homopolish.py", line 64, in main() File "homopolish/homopolish.py", line 46, in main FLAGS.output_dir, FLAGS.minimap_args, FLAGS.mash_threshold, FLAGS.download_contig_nums, FLAGS.debug, FLAGS.meta, FLAGS.local_DB_path,FLAGS.coverage,FLAGS.distance) File "/home/sara/homopolish/modules/polish_interface.py", line 326, in polish_genome out = genus_species_polish(out, assembly_name, output_dir_debug,mash_screen, assembly, model_path, sketch_path, genus_species, threads, output_dir, minimap_args, mash_threshold, download_contig_nums, debug, meta) TypeError: genus_species_polish() missing 2 required positional arguments: 'coverage' and 'distance'

Thank you!

ythuang0522 commented 2 years ago

Hi @Dwrgi, homopolish first searches against the RefSeq for closely-related genomes (>95% Mash identity). If less than five it won't polish the genome. We suspect the phylogenetic distances between your isolate and other Polarmononas genomes in RefSeq are below this threshold. You might consider relaxing the threshold to 0.9.

  --mash_threshold MASH_THRESHOLD
                        Mash output threshold. [0.95]

If still less than five, the --genus is indeed one way to skip the similarity search. The error was due to a bug in this module without handling new parameters added in the last version. Will push a fixed version soon to reenable this genus-specified module.

Dwrgi commented 2 years ago

Thank you, Yao-Ting! I had to go down to 0.8 for it to work, but at least now I know that I can run it successfully. I guess I will have to wait for --genus to be sorted.

I ran into another problem, which I think is a bug: Running homopolish from Home

(homopolish) sara@VB-ubuntu:~$ python3 homopolish/homopolish.py polish -a Polaromonas_isolates/medaka_polished_assemblies/Pol_01_consensus.fasta -s homopolish/bacteria.msh -m homopolish/R9.4.pkl -o Polaromonas_isolates/homopolished_0-80 --mash_threshold 0.80

I go the following error message (more output below for context)

FileNotFoundError: [Errno 2] No such file or directory: '/home/sara/homopolish/homopolish/R9.4.pkl'

Moving into ~/homopolish and running it from there solved the problem, so it looks like the code adds an extra /homopolish/ somehow.

This worked:

(homopolish) sara@VB-ubuntu:~/homopolish$ python3 homopolish.py polish -a ~/Polaromonas_isolates/medaka_polished_assemblies/Pol_01_consensus.fasta -s bacteria.msh -m R9.4.pkl -o ~/Polaromonas_isolates/homopolished_0-80 --mash_threshold 0.80

More of the output leading up to the error for context:

[...] [2022/05/05 12:20] INFO: RUN-ID: contig_1 contig_1 /home/sara/Polaromonas_isolates/homopolished_0-80/debug [2022/05/05 12:20] INFO: Stage: Select closely-related genomes TIME Select closely-related genomes: 0 MINS 17 SECS. [2022/05/05 12:21] INFO: Stage: Download closely-related genomes INFO: 15 homologous sequence need to download: Downloaded GCF_000015505.1_ASM1550v1_genomic.fna.gz Downloaded GCF_000709345.1_Polaromonas_sp._genomic.fna.gz Downloaded GCF_000751355.1_Polaromonas_sp_strain_CG9_12_genomic.fna.gz Downloaded GCF_000282655.1_Polaromonas.strCF318_v1.0_genomic.fna.gz Downloaded GCF_001598235.1_ASM159823v1_genomic.fna.gz Downloaded GCF_003711205.1_ASM371120v1_genomic.fna.gz Downloaded GCF_009664225.1_ASM966422v1_genomic.fna.gz Downloaded GCF_900103405.1_IMG-taxon_2636416056_annotated_assembly_genomic.fna.gz Downloaded GCF_900112285.1_IMG-taxon_2609459740_annotated_assembly_genomic.fna.gz Downloaded GCF_900116715.1_IMG-taxon_2615840640_annotated_assembly_genomic.fna.gz Downloaded GCF_000013865.1_ASM1386v1_genomic.fna.gz Downloaded GCF_001476815.1_ASM147681v1_genomic.fna.gz Downloaded GCF_003570885.1_ASM357088v1_genomic.fna.gz Downloaded GCF_900107605.1_IMG-taxon_2675903230_annotated_assembly_genomic.fna.gz Downloaded GCF_900113035.1_IMG-taxon_2634166330_annotated_assembly_genomic.fna.gz

Reference = [/home/sara/Polaromonas_isolates/homopolished_0-80/debug/contig_1/homologous_sequences/GCF_000751355.1_Polaromonas_sp_strain_CG9_12_genomic.fna.gz.fna.gz, /home/sara/Polaromonas_isolates/homopolished_0-80/debug/contig_1/homologous_sequences/GCF_000709345.1_Polaromonas_sp._genomic.fna.gz.fna.gz, /home/sara/Polaromonas_isolates/homopolished_0-80/debug/contig_1/homologous_sequences/GCF_000015505.1_ASM1550v1_genomic.fna.gz.fna.gz, /home/sara/Polaromonas_isolates/homopolished_0-80/debug/contig_1/homologous_sequences/GCF_000282655.1_Polaromonas.strCF318_v1.0_genomic.fna.gz.fna.gz, /home/sara/Polaromonas_isolates/homopolished_0-80/debug/contig_1/homologous_sequences/GCF_001598235.1_ASM159823v1_genomic.fna.gz.fna.gz, /home/sara/Polaromonas_isolates/homopolished_0-80/debug/contig_1/homologous_sequences/GCF_003711205.1_ASM371120v1_genomic.fna.gz.fna.gz, /home/sara/Polaromonas_isolates/homopolished_0-80/debug/contig_1/homologous_sequences/GCF_009664225.1_ASM966422v1_genomic.fna.gz.fna.gz, /home/sara/Polaromonas_isolates/homopolished_0-80/debug/contig_1/homologous_sequences/GCF_900103405.1_IMG-taxon_2636416056_annotated_assembly_genomic.fna.gz.fna.gz, /home/sara/Polaromonas_isolates/homopolished_0-80/debug/contig_1/homologous_sequences/GCF_900112285.1_IMG-taxon_2609459740_annotated_assembly_genomic.fna.gz.fna.gz, /home/sara/Polaromonas_isolates/homopolished_0-80/debug/contig_1/homologous_sequences/GCF_900116715.1_IMG-taxon_2615840640_annotated_assembly_genomic.fna.gz.fna.gz, /home/sara/Polaromonas_isolates/homopolished_0-80/debug/contig_1/homologous_sequences/GCF_000013865.1_ASM1386v1_genomic.fna.gz.fna.gz, /home/sara/Polaromonas_isolates/homopolished_0-80/debug/contig_1/homologous_sequences/GCF_001476815.1_ASM147681v1_genomic.fna.gz.fna.gz, /home/sara/Polaromonas_isolates/homopolished_0-80/debug/contig_1/homologous_sequences/GCF_003570885.1_ASM357088v1_genomic.fna.gz.fna.gz, /home/sara/Polaromonas_isolates/homopolished_0-80/debug/contig_1/homologous_sequences/GCF_900107605.1_IMG-taxon_2675903230_annotated_assembly_genomic.fna.gz.fna.gz, /home/sara/Polaromonas_isolates/homopolished_0-80/debug/contig_1/homologous_sequences/GCF_900113035.1_IMG-taxon_2634166330_annotated_assembly_genomic.fna.gz.fna.gz] Query = [/home/sara/Polaromonas_isolates/homopolished_0-80/debug/contig_1/contig_1.fasta] Kmer size = 16 Fragment length = 3000 Threads = 1 ANI output file = /home/sara/Polaromonas_isolates/homopolished_0-80/debug/contig_1/ANI.txt

INFO [thread 0], skch::main, Count of threads executing parallel_for : 1 INFO [thread 0], skch::Sketch::build, window size for minimizer sampling = 24 INFO [thread 0], skch::Sketch::build, minimizers picked from reference = 5898061 INFO [thread 0], skch::Sketch::index, unique minimizers = 4182496 INFO [thread 0], skch::Sketch::computeFreqHist, Frequency histogram of minimizers = (1, 3247755) ... (284, 1) INFO [thread 0], skch::Sketch::computeFreqHist, consider all minimizers during lookup. INFO [thread 0], skch::main, Time spent sketching the reference : 12.9034 sec INFO [thread 0], skch::main, Time spent mapping fragments in query #1 : 8.37616 sec INFO [thread 0], skch::main, Time spent post mapping : 0.00503978 sec INFO [thread 0], skch::main, ready to exit the loop INFO, skch::main, parallel_for execution finished

TIME Download closely-related genomes time: 0 MINS 53 SECS. [M::mm_idx_gen::0.2711.00] collected minimizers [M::mm_idx_gen::0.3291.00] sorted minimizers [M::main::0.3291.00] loaded/built the index for 1 target sequence(s) [M::mm_mapopt_update::0.3401.00] mid_occ = 100 [M::mm_idx_stat] kmer size: 19; skip: 19; is_hpc: 0; #seq: 1 [M::mm_idx_stat::0.3441.00] distinct minimizers: 388043 (99.36% are singletons); average occurrences: 1.013; average spacing: 10.002 [M::worker_pipeline::81.2391.00] mapped 854 sequences [M::main] Version: 2.17-r941 [M::main] CMD: minimap2 -cx asm5 --cs=long -t 1 /home/sara/Polaromonas_isolates/homopolished_0-80/debug/contig_1/contig_1.fasta /home/sara/Polaromonas_isolates/homopolished_0-80/debug/contig_1/All_homologous_sequences.fna.gz [M::main] Real time: 81.257 sec; CPU: 81.243 sec; Peak RSS: 0.405 GB 81.2698929309845 [2022/05/05 12:23] INFO: Stage: Homologous retrieval TIME Homologous retrieval: 0 MINS 6 SECS. [2022/05/05 12:23] INFO: Stage: Prediction Traceback (most recent call last): File "homopolish/homopolish.py", line 64, in main() File "homopolish/homopolish.py", line 46, in main FLAGS.output_dir, FLAGS.minimap_args, FLAGS.mash_threshold, FLAGS.download_contig_nums, FLAGS.debug, FLAGS.meta, FLAGS.local_DB_path,FLAGS.coverage,FLAGS.distance) File "/home/sara/homopolish/modules/polish_interface.py", line 329, in polish_genome out = without_genus(out, assembly_name, output_dir_debug, mash_screen, assembly, model_path, sketch_path, genus_species, threads, output_dir, minimap_args, mash_threshold, download_contig_nums, debug, meta,coverage,distance) File "/home/sara/homopolish/modules/polish_interface.py", line 275, in without_genus out.append(check_homopolish(paf, contig_name, contig_output_dir, contig, minimap_args, threads, download_path, model_path)) File "/home/sara/homopolish/modules/polish_interface.py", line 131, in check_homopolish finish = homopolish(contig_name, minimap_args, threads, db_path, model_path, contig_output_dir, dataframe) File "/home/sara/homopolish/modules/polish_interface.py", line 91, in homopolish result = prediction.predict(dataframe, model_path, threads, contig_output_dir) File "/home/sara/homopolish/modules/prediction.py", line 19, in predict model = joblib.load(model) File "/home/sara/anaconda3/envs/homopolish/lib/python3.7/site-packages/joblib/numpy_pickle.py", line 577, in load with open(filename, 'rb') as f: FileNotFoundError: [Errno 2] No such file or directory: '/home/sara/homopolish/homopolish/R9.4.pkl'

ythuang0522 commented 2 years ago

The -g option has been fixed in the github version but not conda yet as we planned to release a major update soon. You may clone the github repo for your needs.

We can't reproduce the path error in our env. It looked to me this is possibly due to the system-defined path searching rule in your local env. i.e., we can't be sure how your env interpreted -s homopolish/bacteria.msh. Personally, I would avoid using this sort of system-defined searching behavior and define the path on my own. e.g., by specifying the full path or ./homopolish/bacteria.msh relative to where I execute.

I hope this helps.