Closed DDavila10 closed 3 years ago
Can you provide the full command line and one short contig of your genome? It looks to me your MAGs are indeed highly fragmented (N50=~10-20kb). The current version is difficult to identify closely-related genomes for such short contigs (our next release attempts to address this issue). But these short contigs, even without polishing, should be written to the output folder. Would be helpful if you can provide more info. --Yao-Ting
Hi Yao-Ting, thanks a lot for replying and the help!
This is the full command line that I am using:
python3 homopolish.py polish -a ~/mice-0476/all_data/racon_4x_medaka_merged_mice_0476_Nanopore_metagenome.fasta -s bacteria.msh -m homopolish/R10.3.pkl -t 14 -o racon_4x_medaka_homopolish
and one short contig of the genome is this:
contig_100 GTTTATAGGCAATAATCAATCACGTTATATATAGAATGTCAAATAATAGTCATTATAAATAGTCATTGAATTAAAAAGAATAACTATACAAAAAATAAATTTGATGTATAATAATACATAAGGATAAATTTTTATAATTATAGATATTATCCCTGAAAGTCGTGAAATAGTTACTCAAAACATATTCTGTGAGATAACTCATGGGGTATTATAGGCTGAAACCATTCATTTTACTGGATATATTGTCCTTTTAGTATTGCTTTTTATTGTCTGGATAATGGGGTTAAAACCCTCTCGCCCTAAGGTGAAACCTTAGGTCAACCCCATAGCTTCAAGTTTAGAAAAAAGAAAATAGAAATACTAAAGCTTGAGGTGAACAATATGGAAAATTACAGAAAATCGTCCTATTGCATATATGATATAAAAAATACCATTTGGTATGGATAACAAATACCGTAAAAAAAACCGGTAATAACAGGGCAAATAGCCGTTAGGACAAGAAACCATATCGGAATGATGGACATCATCGAACATATAGGCTCCACATCGGGTATTGATAATGCGATATACTCTTCCACGGATTTGGCTACCGCACAGAAGATTATATCCCTGGCCAGGTACCTCCTTGCCACTAACGGGCAGTCCCTCCCGGGCATCCTGACATGGCAGTATAACCATCCCATTCCCTATGAGGATGGGATCACTGAGCATGTGTACCATAACCTTTTTGTGCAGCTGGGAACCGATGAAAGCCTGCAGCAGAGTTTTTTCAAGCACCGCTGCGCATCCCTTTCTGCCAGTCTGAGAACCAGAATGAAGCCAGATATGGATTTAACAAGGCTCATGACGGTCTGAAAACAATAAAACTGCTAACACTGTATTTCATAGAGAGCAGACAGCCAGTTGCCTTTACAAAGCAGCCTGGAATTCTCCCGCAGATGACATCCATTACAAACGCAATAAACCAGCTTTCAGCACTGGGTGTGCACACAACGTAGATTATTACTGACAATGGTCACTATTCCGAGCAGAACTTCGCAGAACTGCTCCTTGCAGGGTTTGATTTCATCAGATTGCAAAAACCAGCGGTCAAATAAAAGGATCAGGCCAGAAATCGATAAACAACGTGAAGCGCTGGACAATTTCAAAAGCGTATGTCGTTTGATACATCTACCCATGGGGTTTCCGTTCCCTCATGAATGGATTTTCGAAGAATCACAAATATGCCAGCCATAAAAGCGGTGCGCAAAAGGGGGATGCCGAAACCTTTACATGCAGGATTTATCTCAATATATATTTTAATCATTCCGGGGCAGTCAGCAGATAAAGCCGCATTTGAGGCTGACCTGTTTGAACTTAAGACCCTGCTTGAATTTGGCACTCCTGTTGATGAATTATCAGATTTATCGCAGGCAAAGGTAAAGAAATATTTTTCCATCAAAAAATGGGGCGGCAAAACCATTGTGGTTCCCAATAACAAAGCCATCGCAGACGCGAACAAATACCACGGTTACTTTGTCCTTGTGTCCAATAAGGAAAAAGATCCTTTTGAATGCCTGCGCAAGTACCGTAAAAGGGAAACGATAGAATCCTTTTTTGAAGCCGGGAAACAGCATGCTGACGGTAACAGGGTAAGGGTATGGAATACCGACACCCTCCGCGGTCGTATGTTTGTCCAGTTTGTGTCCCTCTGCTATTATGAGTACCCGAATGAAGAAATCCGAAAACTCGAAAAAAGCCTGGGCAAGGAAAACGGCGATGCAGCACATGATACCAAAGCCGTGCTGAACAATGAATCCAAGTTAAAGTCATGGCTTTGCAATACCCCTCTGTATCTACAGCTGCAATGGTTTGATACAGTGGAAAGCGTTGGCATATCGGCAAAGCTCAAGAGCCGCAGATGGACTACGGAGATCACATCGCGCGATGAGCTTTATTTACAAAAACTTGGGGTGACTATCAGTTGATGTTTTGAGTATATATTTTACGACTTTTAGGATATTATATGATCTCTCAGCAATTTGAGTAGCCATGCCAGCAAAATTTTTGTAATGCACTGCTACGAAAATAAAGCTCGTAGCAAAGAATTGAACCTTGAAACAAATATTTTCAACAATTATGAAGGCATCACTAAAAAGCAGTCACAATATTGGTTGCCAACAGTGAAATATGGATGATATGAAGATGGACTAACCTTGGACTTCTACATGATAGCCATGTTTCTCTAATTTACGTATGTATCGGGAAAGCTATTTTTGTTCACAGCTTTTGAGTTTGGCTTCAAAACATGATTCATTGAAAAGAGTGCCTTGCTTTAACATGGTGTAAATGATTACAAGAAGTTTTCTTGCAAGGGCAATAATAGCCTTTTAGCCTCCTTTTTCTGTTTGAACTTCCAATACTAGGCGGACAGAAAGGTATTGCGTTTTCCCGCGATTACCCAGGTAATTTCGCAAAGAATACTTCTTATGTAAGGATTGCCTTTTGTGATAGATGTGCTTTTCCGTTTCCCGGCACTTTCATTATGTGCCAAGGCACAGGCCTACCCATGAGCAGATATCCTCCACAGTTTTGAACGGCTTTATATCAATGCCGATCTCAGCAATGATTACACAGGAAGCCGTTGTTCTGATTCCATAAATGTTACTTAGCTGAGATAAGAACGCTACATGCAGGAAGGAGTTCGTGGAGCAGGCTCGAATTTGGAATATTTATCAGTATAGGATAAATCAAGATTCCAAAAGCGCTCAATGATAACCTAAGTGGAATAAGCAAGAACATCTGGATTGGGATAATATTTATGAAGATTAGCAACAACAAACTTTTGGTAGTCAGAATCACTGCCGCAGTTAACAGATAACAAATAAAATCACCTATAATCGAAAAAATATATCTTACAAATAAGCTCGTATAGCACCACATTTTTCGATAGGTTTGCCAAGTGGTTTCGGCAAAAAAGTGGGGGATTTTAATAAAAAAGGAATCAGAAAGCCTTATATATACTAGTATACCGTTTTTTAATTTCTCATATACCAAGCACTCACTATTTGCATCATTGCTGCGTTGTCGTCAGTCAACGATGCCACCGCATCGTTGACTGACGACAACGCAGCAATGATGCAAATATCAAGCATTTTGTAAACGGTGTACTAGCATAAAGATTCCAAGAGATTATTTATTAAAAATGGAGGTTTAAGGATGTATCGTGAAATTTCAAAATTGATTATGTATCAAGACGTAAAACAAGAAAGTATACTTTATAAGATGGGAGAAGTATTTCATTGCTTTCAAGAAGAAAAAAGTTCAAAACAGGAACTAATTAAAAAAAATTTATATACAATTGAGAAGATTATTAGATATAGCAACTGATTATGGTTTCAATTATAATTTATGGCATAATTATTTAACGTTTTATCTTGTTACGAATGAAAATCCATATAGTCTAACATGTGAAAAAGTTGGTGATGAAGGTAGCGCAAATTATTTCGCAAAAATGATTTCCGCGTTTTTAAAAATTTATTTGACTTTGATTTTTTACCGATTGAAAAAGAGTTAGGTATTGATTGTTTTACACAAATTGCTCATTATAAGTCAATTGAAAAAAAAGAATTTATGTATAATAAAAGAGTGTGAGTGATAAAATAATATCATTAAGTAAACGGTTGGAACAGGCACTTGATGAGAATGATTTTTTCAATGAAGTAACGGCATTTTATAAAAATTATGGTGTAGGAATGTTTGGATTAAACAAAGCATTTCGAATCAAAGAACAAACGGATATGGATATGGATTTTATACCTATTAATAATATGGATAAAGTTATGCTTGATGATTTAATAGGATATGAATTGCAAAAACAAAAATTAACTGATAATACAAAAGCTTTTGTTGAAGGACGGAAAGCAAATAATGTACTTCTTTTCGGAGATAGTGGAACAGGTAAATCAACCAGTATCAAAGCAATTGTGAATGCTTTCTATCCGCAAGGATTAAGAATGATTGAAATTTACAAACACCAGTTTAAATATTTATCAAATATAATTGCACAAATAAAGAAACGAAATTATAAGTTTATCATTTATATGGATGATTTATCTTTTGAAGAATATGAAATTGAATATAAATTTTTAAAGGCGCTAATTGAAGGAGGAGTAGAAACAAAACCAGATAATATATTGATCTATGCAACATCAAACAGACGACATATTATTAAAGAAACTTGGAATGATAGAAATGATATAGAAACTGAGAAGGGAATGCATCGTTCTGATACAATGGAAGAAAAACTTTCACTTGTTAATCGTTTTGGCGTAACAATCAACTATTCAAAACCATCTCAAAAAGAATATTTTCAAATTGTTATTGCACTTGCACGAAAACAAGGAATTACATTGTCAGACGAAGAATTGTGTAAAGAGGCTAATAAATGGGAATTAAGTCATGGAGGAATATCTGGACGTACAGCTCAGCAGTTTATTAATTATTTAGATGGAAAAGTGGAAGAGAAATAATCTATTAAATGAATTAAGGAGGTTTATTTGAATTCTAAAGAGATATAATGTTTTAGATTGTATACGTGGTTTTGCTTTATTAAATATGATTGCATATCACACTATTTGGGATTTAGTTTATTTGTTTGGGATAGACTGGAAATGGTATCATTCGCAAGGTGCTTACATATGGCAACAAGGAATTTGTTGGATATTTATCTTTTTATCTGGATTTTGCTGGTCTTTAAGTAAGCATCCTTTAAAACGTGGAATTATTGTGTTTTTGTGGAGGTGCTTTAATTTCTTGTGTTACAATACTTTTAATGCCACAAAACAGGGTTTTATTTGGTGTTCTAACATTGATTGGATCCTGCATGATATTAGTAACAATTTTGAATAAAATCTTGCAAAAAATTTATCCATTGACTGGAATGTCTATTTGCTGGATATTATTTATAATAACCAGAAGTATTAATGATGGTTATCTTGGCTTCGAGAAAATTCATCTATTAAAATTACCTAAATATTTATATCAAAATCTTATTTCTTCTTACTTGGGTTTTCCGAAACAGGGATTTTATTCAACAGATTATTTTTCAATTATTCCATGGTTTTTTTTATTTTTAAGTGGATATTATTTGTTTCACTATATGGAAGAAAAGAATAACCTTAATAAATCTGGCCCAAAAAGATTTTTCTGCATTATCGTGGTTTGGCAAACATTCTTTCATTTTATATATGATTCATCAGCCAATAATTTATATTATTTTGGAATTGATATGGAAGATATACAAATATCAGTTTAGATAAAAGACTCAACTAGCAGTAGAGTTTTTTATCTAAATTCTAAAAAACTGTTATACAAAAAATATAAAATTATATTAAAATAAAGTTATAATATTATTTGTTGGTTAAATAAAAGTAGGGAGGATGAATATGTTAGATAATAAAAAAATTGCGGATTTTTTTAATGAAGAAACGAAAAGAATTAGGTTATACACAGGCAGAAATTGCTCAAAAATTAAATGTATCTTTTCAGGCAGTTTCAAAATGGGAAAACGGGACACTTCCCAATATTGAAATTTTAGTAGATTTAGCAAAATTACTTAAAGTTTCTGTAAATGAAATTCTGGTGGGCAGGGAATTAAATGAAGAAAATTTTTCTTATCGAAAAGCAGGTGTAGATAATCTTATACAGATATTATAAAAAAAGAAATGTCCATTTATTTAAAATCTGATAATCAAAGAGTATTAA
This one looks fine at my test.
python3 homopolish.py polish -m R10.3.pkl -s /var/www/www/mash_sketches/bacteria.msh.gz -a contig100.fa -o racon_4x_medaka_homopolish
The output folder did contain the original contig, although no related genomes were found and polished.
more racon_4x_medaka_homopolish/contig100_homopolished.fasta
>contig_100 GTTTATAGGCAATAATCAATCACGTTATATATAGAATGTCAAATAATAGTCATTATAAATAGTCATTGAATTAAAAAGAATAACTATACAAAAAATAAATTTGATGTATAATAATACATA...
Can you help also polish this contig at your side and see if it works?
Hi Yao-Ting - Thanks a lot for the prompt response. I tried that test with the same contig and works. The output directory only contains the polished contig, no further information. How do I know that a related genome was or not found? This means that the output contig file is the same as the original because there are no related genome to use for polishing?
You can use argument -d
to keep the information of every contig after mash, such as homologous sequences and its identity information.
Yes. The related genomes can be displayed via -d
. But your MAGs are too short (N50=10kb) to find any related genome at current version. You have to wait for next release.
Hi Yao-Ting and Nicole, Thanks a lot for your valuable help and time. I am looking forward for the next version of homopolish! :)
Hi all,
After using hompolish, the output file is empty and this is the last message that I received
[2020/11/17 23:42] INFO: RUN-ID: tig00046071_len=10956_reads=23_class=contig_suggestRepeat=no_suggestBubble=no_suggestCircular=no [2020/11/17 23:42] INFO: Stage: Select closely-related genomes [2020/11/17 23:42] INFO: RUN-ID: tig00046075_len=19388_reads=41_class=contig_suggestRepeat=no_suggestBubble=no_suggestCircular=no [2020/11/17 23:42] INFO: Stage: Select closely-related genomes [2020/11/17 23:42] INFO: RUN-ID: tig00046076_len=14681_reads=88_class=contig_suggestRepeat=no_suggestBubble=no_suggestCircular=no [2020/11/17 23:42] INFO: Stage: Select closely-related genomes [2020/11/17 23:42] INFO: RUN-ID: tig00046120_len=11932_reads=18_class=contig_suggestRepeat=no_suggestBubble=no_suggestCircular=no [2020/11/17 23:42] INFO: Stage: Select closely-related genomes TIME Total: 49 MINS 28 SECS.
Do you have an idea what is happening?
Thanks again for your help