ythuang0522 / homopolish

High-quality Nanopore-only genome polisher
GNU General Public License v3.0
65 stars 12 forks source link

ValueError: Input contains NaN #25

Closed JamesYang1209 closed 2 years ago

JamesYang1209 commented 3 years ago

Hi, I am using canu as my assembler, correct the sequence with racon and medaka. But when I try to use homopolish to complete the last step of correction. The error shows below.

$python homopolish.py polish -m R9.4.pkl -s bacteria.msh -o 03.SeqPolish/ -a consensus.fasta
......
......
[2021/04/20 11:01] INFO: Stage: Homologous retrieval
TIME Homologous retrieval: 0 MINS 32 SECS.
[2021/04/20 11:02] INFO: Stage: Prediction
Traceback (most recent call last):
  File "homopolish.py", line 52, in <module>
  main()
  File "homopolish.py", line 38, in main
      FLAGS.output_dir, FLAGS.minimap_args, FLAGS.mash_threshold, FLAGS.download_contig_nums, FLAGS.debug, FLAGS.meta)
  File "modules/polish_interface.py", line 303, in polish_genome
      finish = homopolish(contig_name, minimap_args, threads, db_path, model_path, contig_output_dir, dataframe)
  File "modules/polish_interface.py", line 90, in homopolish
      result = prediction.predict(dataframe, model_path, threads, contig_output_dir)
  File "modules/prediction.py", line 23, in predict
      result_prob = parallel(jobs)
  File "python3.6/site-packages/joblib/parallel.py", line 1029, in __call__
      if self.dispatch_one_batch(iterator):
  File "python3.6/site-packages/joblib/parallel.py", line 847, in dispatch_one_batch
      self._dispatch(tasks)
  File "python3.6/site-packages/joblib/parallel.py", line 765, in _dispatch 
      job = self._backend.apply_async(batch, callback=cb)
  File "python3.6/site-packages/joblib/_parallel_backends.py", line 208, in apply_async
      result = ImmediateResult(func)
  File "python3.6/site-packages/joblib/_parallel_backends.py", line 572, in __init__
      self.results = batch()
  File "python3.6/site-packages/joblib/parallel.py", line 253, in __call__
      for func, args, kwargs in self.items]
  File "python3.6/site-packages/joblib/parallel.py", line 253, in <listcomp>
      for func, args, kwargs in self.items]
  File "python3.6/site-packages/sklearn/svm/base.py", line 620, in _predict_proba
      X = self._validate_for_predict(X)
  File "python3.6/site-packages/sklearn/svm/base.py", line 454, in _validate_for_predict
      accept_large_sparse=False)
  File "python3.6/site-packages/sklearn/utils/validation.py", line 542, in check_array
      allow_nan=force_all_finite == 'allow-nan')
  File "python3.6/site-packages/sklearn/utils/validation.py", line 56, in _assert_all_finite
      raise ValueError(msg_err.format(type_err, X.dtype))
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

I am not sure about this error, the input fasta looks normal. Could you kindly help me to solve this problem ? Thank you.

chengjun109 commented 3 years ago

Hi @JamesYang1209

We update the new version(0.0.2), and this version has solved your problem. You can try it.

JamesYang1209 commented 3 years ago

Thanks for the update. It did solve my problem. But I found -d become a required argument?

Traceback (most recent call last):
  File "/home/james/tools/homopolish-0.2/homopolish.py", line 55, in <module>
    main()
  File "/home/james/tools/homopolish-0.2/homopolish.py", line 41, in main
    FLAGS.output_dir, FLAGS.minimap_args, FLAGS.mash_threshold, FLAGS.download_contig_nums, FLAGS.debug, FLAGS.meta, FLAGS.local_DB_path)
  File "/home/james/tools/homopolish-0.2/modules/polish_interface.py", line 326, in polish_genome
    shutil.rmtree(contig_output_dir_debug)
NameError: name 'contig_output_dir_debug' is not defined
ythuang0522 commented 3 years ago

That's used in debugging mode which should not be mandatory. We have pushed a fixed version. Please reinstall again. Sorry for the inconvenience.

JamesYang1209 commented 3 years ago

Thanks for the quick fix. However I found some warnings with some genome.

/usr/lib/python3.6/site-packages/urllib3/connectionpool.py:847: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
InsecureRequestWarning)
429 Client Error: Too Many Requests for url: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=NZ_CP008850.1&rettype=fasta

Are these warnings negligible ? Thank you.

ythuang0522 commented 3 years ago

Hi James, can you provide some information about your genome? N50, No. of contigs? That module is activated when the program suspects your contig is a plasmid instead of chromosome, which will then retrieve plasmids via NCBI eutils api instead of ftp. We haven't seen this warning before. If it's repeatable, we will need you providing the contig sequence for debugging.

schorlton commented 2 years ago

@ythuang0522 , I'm hitting this error too so perhaps I can help. Full log below, and the debug folder for this contig found here. Running homopolish 0.3.1. Let me know if you need any other info. Thanks for your help in troubleshooting and great software!

Query = [/data/homopolish_fail/homopolish/debug/contig_18/contig_18.fasta]                                                                                          [41/1428]
Kmer size = 16
Fragment length = 3000
Threads = 1
ANI output file = /data/homopolish_fail/homopolish/debug/contig_18/ANI.txt
>>>>>>>>>>>>>>>>>>
INFO [thread 0], skch::main, Count of threads executing parallel_for : 1
INFO [thread 0], skch::Sketch::build, window size for minimizer sampling  = 24
INFO [thread 0], skch::Sketch::build, minimizers picked from reference = 10649502
INFO [thread 0], skch::Sketch::index, unique minimizers = 870210
INFO [thread 0], skch::Sketch::computeFreqHist, Frequency histogram of minimizers = (1, 242459) ... (236, 1)
INFO [thread 0], skch::Sketch::computeFreqHist, consider all minimizers during lookup.
INFO [thread 0], skch::main, Time spent sketching the reference : 9.79632 sec
INFO [thread 0], skch::main, Time spent mapping fragments in query #1 : 19.1334 sec
INFO [thread 0], skch::main, Time spent post mapping : 0.0104783 sec
INFO [thread 0], skch::main, ready to exit the loop
INFO, skch::main, parallel_for execution finished
[M::mm_idx_gen::0.187*1.00] collected minimizers
[M::mm_idx_gen::0.232*1.00] sorted minimizers
[M::main::0.232*1.00] loaded/built the index for 1 target sequence(s)
[M::mm_mapopt_update::0.242*1.00] mid_occ = 50
[M::mm_idx_stat] kmer size: 19; skip: 19; is_hpc: 0; #seq: 1
[M::mm_idx_stat::0.251*1.00] distinct minimizers: 670654 (98.91% are singletons); average occurrences: 1.014; average spacing: 9.991; total length: 6792935
[M::worker_pipeline::95.572*1.00] mapped 2071 sequences
[M::main] Version: 2.22-r1101
[M::main] CMD: minimap2 -cx asm5 --cs=long -t 1 /data/homopolish_fail/homopolish/debug/contig_18/contig_18.fasta /data/homopolish_fail/homopolish/debug/contig_18/All_homolog
ous_sequences.fna.gz
[M::main] Real time: 95.577 sec; CPU: 95.490 sec; Peak RSS: 0.918 GB
TIME Download closely-related genomes time: 0 MINS 38 SECS.
[2021/09/04 16:17] INFO: Stage: Homologous retrieval
TIME Homologous retrieval: 4 MINS 11 SECS.
[2021/09/04 16:21] INFO: Stage: Prediction
Traceback (most recent call last):
  File "/homopolish/homopolish.py", line 58, in <module>
    main()
  File "/homopolish/homopolish.py", line 42, in main
    FLAGS.output_dir, FLAGS.minimap_args, FLAGS.mash_threshold, FLAGS.download_contig_nums, FLAGS.debug, FLAGS.meta, FLAGS.local_DB_path)
  File "/homopolish/modules/polish_interface.py", line 329, in polish_genome
    out = without_genus(out, assembly_name, output_dir_debug, mash_screen, assembly, model_path, sketch_path, genus_species, threads, output_dir, minimap_args, mash_threshol
d, download_contig_nums, debug, meta)
  File "/homopolish/modules/polish_interface.py", line 275, in without_genus
    out.append(check_homopolish(paf, contig_name, contig_output_dir, contig, minimap_args, threads, download_path, model_path))
  File "/homopolish/modules/polish_interface.py", line 130, in check_homopolish
    finish = homopolish(contig_name, minimap_args, threads, db_path, model_path, contig_output_dir, dataframe)
  File "/homopolish/modules/polish_interface.py", line 90, in homopolish
    result = prediction.predict(dataframe, model_path, threads, contig_output_dir)
  File "/homopolish/modules/prediction.py", line 23, in predict
    result_prob = parallel(jobs)
  File "/opt/conda/envs/bugseq/lib/python3.7/site-packages/joblib/parallel.py", line 1051, in __call__
    while self.dispatch_one_batch(iterator):
  File "/opt/conda/envs/bugseq/lib/python3.7/site-packages/joblib/parallel.py", line 866, in dispatch_one_batch
    self._dispatch(tasks)
  File "/opt/conda/envs/bugseq/lib/python3.7/site-packages/joblib/parallel.py", line 784, in _dispatch
    job = self._backend.apply_async(batch, callback=cb)
  File "/opt/conda/envs/bugseq/lib/python3.7/site-packages/joblib/_parallel_backends.py", line 208, in apply_async
    result = ImmediateResult(func)
  File "/opt/conda/envs/bugseq/lib/python3.7/site-packages/joblib/_parallel_backends.py", line 572, in __init__
    self.results = batch()
  File "/opt/conda/envs/bugseq/lib/python3.7/site-packages/joblib/parallel.py", line 263, in __call__
    for func, args, kwargs in self.items]
  File "/opt/conda/envs/bugseq/lib/python3.7/site-packages/joblib/parallel.py", line 263, in <listcomp>
    for func, args, kwargs in self.items]
  File "/opt/conda/envs/bugseq/lib/python3.7/site-packages/sklearn/svm/base.py", line 620, in _predict_proba
    X = self._validate_for_predict(X)
  File "/opt/conda/envs/bugseq/lib/python3.7/site-packages/sklearn/svm/base.py", line 454, in _validate_for_predict
    accept_large_sparse=False)
  File "/opt/conda/envs/bugseq/lib/python3.7/site-packages/sklearn/utils/validation.py", line 542, in check_array
    allow_nan=force_all_finite == 'allow-nan')
  File "/opt/conda/envs/bugseq/lib/python3.7/site-packages/sklearn/utils/validation.py", line 56, in _assert_all_finite
    raise ValueError(msg_err.format(type_err, X.dtype))
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
Traceback (most recent call last):
  File "/bugseq/lib/python/nextflow.py", line 67, in run_cmd
    result.check_returncode()
  File "/opt/conda/envs/bugseq/lib/python3.7/subprocess.py", line 444, in check_returncode
    self.stderr)
subprocess.CalledProcessError: Command '['python3', '/homopolish/homopolish.py', 'polish', '-a', 'consensus.fasta', '-s', 'refseq.msh', '-m', '/homopolish/R9.4.pkl', '-o', '
homopolish']' returned non-zero exit status 1.
Traceback (most recent call last):
  File "command.py", line 55, in <module>
    main(input_assembly, mash_sketch, output_dir, metadata)
  File "command.py", line 31, in main
    output_dir,
  File "/bugseq/lib/python/nextflow.py", line 67, in run_cmd
    result.check_returncode()
  File "/opt/conda/envs/bugseq/lib/python3.7/subprocess.py", line 444, in check_returncode
    self.stderr)
subprocess.CalledProcessError: Command '['python3', '/homopolish/homopolish.py', 'polish', '-a', 'consensus.fasta', '-s', 'refseq.msh', '-m', '/homopolish/R9.4.pkl', '-o', '
homopolish']' returned non-zero exit status 1.
ythuang0522 commented 2 years ago

Thanks for providing us the contig. Will get back to you later.

ythuang0522 commented 2 years ago

@schorlton I ran the program with the contig_18.fasta and it finished without any error (see below). However, it looks like the 20 related genomes you retrieved are totally different. e.g., GCF_001545205.1, GCF_001545185.1 in yours are not the ones found by mine (e.g., GCF_006364795.1). Are you using the default bacteria.msh for screening related genomes?

python3 homopolish.py polish -a contig_18.fasta -s bacteria.msh -m R9.4.pkl -d -o contig18

[2021/09/05 23:32] INFO: RUN-ID: contig_18
contig_18
/home/ythuang/homopolish/contig18/debug
[2021/09/05 23:32] INFO: Stage: Select closely-related genomes
TIME Select closely-related genomes: 0 MINS 12 SECS.
[2021/09/05 23:33] INFO: Stage: Download closely-related genomes
 INFO: 20 homologous sequence need to download: 
Downloaded GCF_005154325.1_ASM515432v1_genomic.fna.gz
Downloaded GCF_006364795.1_ASM636479v1_genomic.fna.gz
...
TIME Homologous retrieval: 0 MINS 22 SECS.
[2021/09/05 23:35] INFO: Stage: Prediction
TIME Prediction: 0 MINS 0 SECS.
[2021/09/05 23:35] INFO: Stage: Polish
TIME Polish: 0 MINS 6 SECS.
TIME Total: 2 MINS 59 SECS.
schorlton commented 2 years ago

I am not. Sorry for this omission. Can you please try with this mash sketch? Thanks!

ythuang0522 commented 2 years ago

The bug should have been fixed but reappeared due to merged errors. We have pushed a correct version on Github. Please pull the latest one and it should work on ur own sketch. Thanks for reporting this issue.

python3 homopolish.py polish -a contig_18.fasta -s refseq.genomes%2Bplasmid.k21s1000.msh -m R9.4.pkl -d -o contig18

[2021/09/06 10:00] INFO: RUN-ID: contig_18
contig_18
/home/ythuang/homopolish/contig18/debug
[2021/09/06 10:00] INFO: Stage: Select closely-related genomes
TIME Select closely-related genomes: 0 MINS 5 SECS.
...
[2021/09/06 10:03] INFO: Stage: Homologous retrieval
TIME Homologous retrieval: 0 MINS 39 SECS.
[2021/09/06 10:04] INFO: Stage: Prediction
TIME Prediction: 0 MINS 1 SECS.
[2021/09/06 10:04] INFO: Stage: Polish
TIME Polish: 0 MINS 8 SECS.
TIME Total: 3 MINS 17 SECS.
schorlton commented 2 years ago

Thanks! Can I suggest that you tag a new minor release with the bug fix?

ythuang0522 commented 2 years ago

Done. Tagged as v0.3.2. If no further issue i will close this one.

schorlton commented 2 years ago

Awesome, thanks again!