ythuang0522 / homopolish

High-quality Nanopore-only genome polisher
GNU General Public License v3.0
65 stars 12 forks source link

Custom genome database #23

Closed snayfach closed 2 years ago

snayfach commented 3 years ago

I was wondering if it would be possible to polish my contigs using a custom database of genomes. I've identified a number of viruses on Nanopore contigs, but most don't have any similar genome in NCBI. However, I do have a large database of viruses assembled using Illumina metagenomes that I'd like to use as a reference. Would this work using your tool?

ythuang0522 commented 3 years ago

Hi Stephen, Theoretical Yes if with some tweaks. The database can be any sequences compiled via Mash. But if your db is multi-fasta MAGs, you have to split multi-fasta file into one species per file (or single contig per file if u dunno species). Then we can revise the code by retrieving genomes from your own fasta folder instead of NCBI FTP. And there must be enough related strains (>5) in your db for distinguishing errors from strain variations of the model. Not straightforward but we are happy to help if you wanna try.

snayfach commented 3 years ago

Thanks for the reply and I'd love to try it out. I'm also considering using proovframe which is based on frameshifted alignments to a protein database. Have you used that tool, and if so can you comment on how the performance may differ from homopolish?

ythuang0522 commented 3 years ago

Thanks for your info. Never try that but we started with similar idea two years ago, though on polishing draft genomes instead of reads. Using UniProt with Diamond at first. Then switch to CDS db. But at the end we realized whole-genome alignment is way better than using proteins or CDS for error correction, because we have to distinguish highly-similar gene-family members and pseudogenes from original genes. I also don't think it's a good idea to polish reads in this way prior to assembly, because it creates non-uniform error rate across the entire read, which may affect the overlapping module in some assemblers. Having said that, I'am still happy to know if it works in your case.

Internally we knew homopolish works really good on hundreds of isolate genomes. But we knew it requires further improvement for highly-fragmented MAGs, especially those without ref genomes in NCBI. So your case is of particular interests to us. Look forward to your results.

snayfach commented 3 years ago

In the coming days I may try and adapt your code to use local genomes. Will let you know if I encounter any issues. Alternatively it would be great to have this as an extra feature if you have the time :-)

ythuang0522 commented 3 years ago

Is it possible we have a sample ONT virus and couple Illumina MAGs for testing? If not possible can we know the rough length of ONT and Illumina MAGs? Thanks.

luqrei82 commented 3 years ago

Hello, I'm having problem with the same approach using my own database to polish a section of a HIV-1 gene (1.7 kb). Although I have chosen for HIV-1 reference from the HIV-1 database specific for the region I'm trying to polish, I still can't get the polisher to work due to "similarity in genome < 5". I understand that there can be a lot of variation in the reference, but how close the genome should be for Homopolish to work? Because I also tried with a concatenation of 16 Sanger sequence results of gene clones from a PCR product as a database and I still get the same message. Is there a specific procedure to create the Mash sketch?

Thank you.

ythuang0522 commented 3 years ago

Hi, are these amplicon sequencing? The minimum mash identity is set to 0.95 by default, but this was tested on genome scale and not at amplicon/gene level. Would be helpful if you can provide the HIV gene and your database sequences for us to test.

luqrei82 commented 3 years ago

Hi, thank you for reply. Here are the medaka polished consensus and the database.

hiv-db_custom_no_gap.txt medaka_consensus_modified.txt

chengjun109 commented 3 years ago

Hi @luqrei82 ,

We update the new version(0.0.2), you can use parameter ('-l') . But there is an environment you need to recreate it, or you can download pycurl package in your environment. And also, we successfully use this command "python3 homopolish.py polish -a medaka_consensus_modified.txt -l hiv-db_custom_no_gap.txt -m R9.4.pkl --minimap_args asm10" to test your data.

You can try this.

ythuang0522 commented 3 years ago

Hi @luqrei82 , we note that the HIV sequences mutated frequently such that the -minimap_args is relaxed from asm5 to asm10 for recruiting sufficient homologs in your case.

ythuang0522 commented 2 years ago

The local database has been supported via -l.