Failure at homologous retrieval stage: "DataFrame is empty!"

ezherman commented 1 year ago

Hi,

Thank you for releasing this awesome package! I'm excited to use it for my data. I was hoping you could help me with the following message, after which the program stops. I was expecting there to be prediction and polishing steps, in addition to homologous retrieval.

[2022/12/07 23:48] INFO: Stage: Homologous retrieval
DataFrame is empty!
TIME Homologous retrieval: 0 MINS 18 SECS.
TIME Total: 1 MINS 48 SECS.

Below you can find the command that I ran from within the homopolish directory. I am running homopolish v0.4.1.

python3 homopolish.py polish -a flye_medaka_assembly/consensus.fasta --genus pseudomonas_aeruginosa -m R9.4.pkl -o flye_medaka_homopolish -t 4 &> issue.log

Below are the contents of issue.log:

Genus: pseudomonas_aeruginosa
[92m[2022/12/07 23:59] INFO: Stage: Select closely-related genomes and download
[0m[92m[2022/12/08 00:00] INFO: Stage: Download closely-related genomes
[0m[92m INFO: 20 homologous sequence need to download: 
[0mDownloaded GCA_002026345.1_ASM202634v1_genomic.fna.gz
Downloaded GCA_002937155.1_ASM293715v1_genomic.fna.gz
Downloaded GCA_003591005.1_ASM359100v1_genomic.fna.gz
Downloaded GCA_007005445.1_ASM700544v1_genomic.fna.gz
Downloaded GCA_016215605.1_ASM1621560v1_genomic.fna.gz
Downloaded GCA_024330085.1_ASM2433008v1_genomic.fna.gz
Downloaded GCA_001020905.1_ASM102090v1_genomic.fna.gz
Downloaded GCA_003031265.1_ASM303126v1_genomic.fna.gz
Downloaded GCA_003547145.1_ASM354714v1_genomic.fna.gz
Downloaded GCA_004294885.1_ASM429488v1_genomic.fna.gz
Downloaded GCA_014145305.1_ASM1414530v1_genomic.fna.gz
Downloaded GCA_023155845.1_ASM2315584v1_genomic.fna.gz
Downloaded GCA_025811215.1_ASM2581121v1_genomic.fna.gz
Downloaded GCA_000935205.1_ASM93520v1_genomic.fna.gz
Downloaded GCA_003031245.1_ASM303124v1_genomic.fna.gz
Downloaded GCA_003226555.1_ASM322655v1_genomic.fna.gz
Downloaded GCA_003591275.1_ASM359127v1_genomic.fna.gz
Downloaded GCA_013415845.1_ASM1341584v1_genomic.fna.gz
Downloaded GCA_016584445.1_ASM1658444v1_genomic.fna.gz
Downloaded GCA_025811355.1_ASM2581135v1_genomic.fna.gz
>>>>>>>>>>>>>>>>>>
Reference = [/mnt/c/Users/elh605/homopolish/flye_medaka_homopolish/debug/homologous/homologous_sequences/GCA_000935205.1_ASM93520v1_genomic.fna.gz.fna.gz, /mnt/c/Users/elh605/homopolish/flye_medaka_homopolish/debug/homologous/homologous_sequences/GCA_001020905.1_ASM102090v1_genomic.fna.gz.fna.gz, /mnt/c/Users/elh605/homopolish/flye_medaka_homopolish/debug/homologous/homologous_sequences/GCA_002026345.1_ASM202634v1_genomic.fna.gz.fna.gz, /mnt/c/Users/elh605/homopolish/flye_medaka_homopolish/debug/homologous/homologous_sequences/GCA_002937155.1_ASM293715v1_genomic.fna.gz.fna.gz, /mnt/c/Users/elh605/homopolish/flye_medaka_homopolish/debug/homologous/homologous_sequences/GCA_003031245.1_ASM303124v1_genomic.fna.gz.fna.gz, /mnt/c/Users/elh605/homopolish/flye_medaka_homopolish/debug/homologous/homologous_sequences/GCA_003031265.1_ASM303126v1_genomic.fna.gz.fna.gz, /mnt/c/Users/elh605/homopolish/flye_medaka_homopolish/debug/homologous/homologous_sequences/GCA_003226555.1_ASM322655v1_genomic.fna.gz.fna.gz, /mnt/c/Users/elh605/homopolish/flye_medaka_homopolish/debug/homologous/homologous_sequences/GCA_003547145.1_ASM354714v1_genomic.fna.gz.fna.gz, /mnt/c/Users/elh605/homopolish/flye_medaka_homopolish/debug/homologous/homologous_sequences/GCA_003591005.1_ASM359100v1_genomic.fna.gz.fna.gz, /mnt/c/Users/elh605/homopolish/flye_medaka_homopolish/debug/homologous/homologous_sequences/GCA_003591275.1_ASM359127v1_genomic.fna.gz.fna.gz, /mnt/c/Users/elh605/homopolish/flye_medaka_homopolish/debug/homologous/homologous_sequences/GCA_004294885.1_ASM429488v1_genomic.fna.gz.fna.gz, /mnt/c/Users/elh605/homopolish/flye_medaka_homopolish/debug/homologous/homologous_sequences/GCA_007005445.1_ASM700544v1_genomic.fna.gz.fna.gz, /mnt/c/Users/elh605/homopolish/flye_medaka_homopolish/debug/homologous/homologous_sequences/GCA_013415845.1_ASM1341584v1_genomic.fna.gz.fna.gz, /mnt/c/Users/elh605/homopolish/flye_medaka_homopolish/debug/homologous/homologous_sequences/GCA_014145305.1_ASM1414530v1_genomic.fna.gz.fna.gz, /mnt/c/Users/elh605/homopolish/flye_medaka_homopolish/debug/homologous/homologous_sequences/GCA_016215605.1_ASM1621560v1_genomic.fna.gz.fna.gz, /mnt/c/Users/elh605/homopolish/flye_medaka_homopolish/debug/homologous/homologous_sequences/GCA_016584445.1_ASM1658444v1_genomic.fna.gz.fna.gz, /mnt/c/Users/elh605/homopolish/flye_medaka_homopolish/debug/homologous/homologous_sequences/GCA_023155845.1_ASM2315584v1_genomic.fna.gz.fna.gz, /mnt/c/Users/elh605/homopolish/flye_medaka_homopolish/debug/homologous/homologous_sequences/GCA_024330085.1_ASM2433008v1_genomic.fna.gz.fna.gz, /mnt/c/Users/elh605/homopolish/flye_medaka_homopolish/debug/homologous/homologous_sequences/GCA_025811215.1_ASM2581121v1_genomic.fna.gz.fna.gz, /mnt/c/Users/elh605/homopolish/flye_medaka_homopolish/debug/homologous/homologous_sequences/GCA_025811355.1_ASM2581135v1_genomic.fna.gz.fna.gz]
Query = [flye_medaka_assembly/consensus.fasta]
Kmer size = 16
Fragment length = 3000
Threads = 1
ANI output file = /mnt/c/Users/elh605/homopolish/flye_medaka_homopolish/debug/homologous/ANI.txt
>>>>>>>>>>>>>>>>>>
INFO [thread 0], skch::main, Count of threads executing parallel_for : 1
INFO [thread 0], skch::Sketch::build, window size for minimizer sampling  = 24
INFO [thread 0], skch::Sketch::build, minimizers picked from reference = 8526211
INFO [thread 0], skch::Sketch::index, unique minimizers = 3746071
INFO [thread 0], skch::Sketch::computeFreqHist, Frequency histogram of minimizers = (1, 2406126) ... (1008, 1)
INFO [thread 0], skch::Sketch::computeFreqHist, consider all minimizers during lookup.
INFO [thread 0], skch::main, Time spent sketching the reference : 10.0661 sec
INFO [thread 0], skch::main, Time spent mapping fragments in query #1 : 1.79509 sec
INFO [thread 0], skch::main, Time spent post mapping : 0.0006606 sec
INFO [thread 0], skch::main, ready to exit the loop
INFO, skch::main, parallel_for execution finished
[M::mm_idx_gen::0.167*0.73] collected minimizers
[M::mm_idx_gen::0.184*0.98] sorted minimizers
[M::main::0.184*0.98] loaded/built the index for 1 target sequence(s)
[M::mm_mapopt_update::0.191*0.98] mid_occ = 100
[M::mm_idx_stat] kmer size: 19; skip: 19; is_hpc: 0; #seq: 1
[M::mm_idx_stat::0.196*0.98] distinct minimizers: 618983 (98.93% are singletons); average occurrences: 1.015; average spacing: 9.994
[M::worker_pipeline::3.433*2.46] mapped 2357 sequences
[M::main] Version: 2.17-r941
[M::main] CMD: minimap2 -cx asm5 --cs=long -t 4 /mnt/c/Users/elh605/homopolish/flye_medaka_homopolish/debug/contig_1/contig_1.fasta /mnt/c/Users/elh605/homopolish/flye_medaka_homopolish/debug/homologous/All_homologous_sequences.fna.gz
[M::main] Real time: 3.444 sec; CPU: 8.463 sec; Peak RSS: 0.386 GB
[95mTIME Download closely-related genomes time: 0 MINS 57 SECS.
[0m[92m[2022/12/08 00:01] INFO: RUN-ID: contig_1
[0m[92m[2022/12/08 00:01] INFO: Stage: Homologous retrieval
[0m[95mTIME Homologous retrieval: 0 MINS 18 SECS.
[0m[95mTIME Total: 1 MINS 49 SECS.
[0m
contig_1
/mnt/c/Users/elh605/homopolish/flye_medaka_homopolish/debug
3.4507524967193604
DataFrame is empty!

Thanks in advance!

ezherman commented 1 year ago

Further details:

The failure did not occur when I ran homopolish using a local NCBI database containing 53 Pseudomonas aeruginosa complete assemblies. However, quast shows the same results before and after running homopolish. Is this issue specific to my assembly? I would have expected some indels to be removed by homopolish.

Hereby the output after running homopolish with the local database:

[2022/12/08 15:44] INFO: RUN-ID: contig_1
contig_1
/mnt/c/Users/elh605/assemble-cf-isolates/data/long-read-seq-sup/barcode24/flye_medaka_homopolish_assembly/debug

[M::mm_idx_gen::0.182*0.74] collected minimizers
[M::mm_idx_gen::0.204*1.02] sorted minimizers
[M::main::0.204*1.02] loaded/built the index for 1 target sequence(s)
[M::mm_mapopt_update::0.211*1.02] mid_occ = 100
[M::mm_idx_stat] kmer size: 19; skip: 19; is_hpc: 0; #seq: 1
[M::mm_idx_stat::0.217*1.02] distinct minimizers: 618983 (98.93% are singletons); average occurrences: 1.015; average spacing: 9.994
[M::worker_pipeline::41.389*3.25] mapped 56 sequences
[M::main] Version: 2.17-r941
[M::main] CMD: minimap2 -cx asm5 --cs=long -t 4 /mnt/c/Users/elh605/assemble-cf-isolates/data/long-read-seq-sup/barcode24/flye_medaka_homopolish_assembly/debug/contig_1/contig_1.fasta /mnt/c/Users/elh605/assemble-cf-isolates/data/long-read-seq-sup/barcode24/flye_medaka_homopolish_assembly/debug/contig_1/All_homologous_sequences.fna.gz
[M::main] Real time: 41.398 sec; CPU: 134.460 sec; Peak RSS: 1.188 GB
41.644439935684204
[2022/12/08 15:45] INFO: Stage: Homologous retrieval
TIME Homologous retrieval: 6 MINS 55 SECS.
[2022/12/08 15:52] INFO: Stage: Prediction
TIME Prediction: 0 MINS 2 SECS.
[2022/12/08 15:52] INFO: Stage: Polish
TIME Polish: 0 MINS 1 SECS.
TIME Total: 7 MINS 42 SECS.

ythuang0522 commented 1 year ago

@ezherman We have polished quite a few Pseudomonas aeruginosa genomes successfully using the default NCBI database. Can you provide the problematic contig for us to debug?

ezherman commented 1 year ago

Sure @ythuang0522, thank you for offering that! You can find the assembly here.

The reads come from a R9.4.1 pore and were called with the superaccuracy model in Guppy v6.3.9. The reads were filtered with filtlong v0.2.1, assembled with flye v2.9.1 and polished with medaka v1.7.2. Do you require any further details/data?

ezherman commented 1 year ago

Hi @ythuang0522, do you happen to have any updates on this? I am getting the same error with other Pseudomonas aeruginosa genomes from the same sequencing run. The download link expires tomorrow, but I can create a new one if needed. Thanks!

ythuang0522 commented 1 year ago

We just found that you added the wrong argument in your command. You shouldn't add --genus in your script, which will completely ignore the similarity of related genomes. You should just let the program search for most-related genomes automatically. My student confirmed your genome can be polished as expected. homopolish polish -a yourgenome.fasta -s bacteria.msh -m R9.4.pkl -o youroutput

ezherman commented 1 year ago

Thanks so much @ythuang0522, to your student as well!

I initially chose to use --genus instead of -s because the process was being killed on my local machine with -s (probably a memory issue?). I tried using -s on my university's cluster, which worked as expected 🎉.

For reproducibility, it would be good to solve the issue on my local machine with the -s option. I will open a separate issue for this.

ythuang0522 / homopolish

Failure at homologous retrieval stage: "DataFrame is empty!" #56