ythuang0522 / homopolish

High-quality Nanopore-only genome polisher
GNU General Public License v3.0
65 stars 12 forks source link

Could you provide option homopolymer polishing only? #49

Closed piroonj closed 2 years ago

piroonj commented 2 years ago

Hi developer,

Is it possible to provide homopolymer polishing only without polish small indels?

Best, Piroon

ythuang0522 commented 2 years ago

Not sure what you mean. The ML model has been trained for correcting homopolyers and retaining strain variations (ie small indels). e.g., the retention of pseudo genes due to frameshifted INDELs. It’s left to the model for making the decision.

piroonj commented 2 years ago

I have tested by introducing mismatch and small indels to E. coli genome and perform homopolish with -l local_ref.fasta.

GATAAGCATGATAGCTACCCGTC G AA - original (used as reference) GATAAGCATGATAGCTACCCGTC T AA - manually change G>T GATAAGCATGATAGCTACCCGTC T AA - homopolish (mismatch retain)

ATCAGCATGATGCCAGCAATAAGTG C CAGG - original (used as reference) ATCAGCATGATGCCAGCAATAAGTG T CAGG - manually change C>T ATCAGCATGATGCCAGCAATAAGTG T CAGG - homopolish (mismatch retain)

GGCGTAATACTTAACTGGCGCTACGGC T GATGGC - original (used as reference) GGCGTAATACTTAACTGGCGCTACGGC TT GATGGC - manually insert T GGCGTAATACTTAACTGGCGCTACGGC T GATGGC - homopolish (fix the insertion)

GGCAACCGAACCGGCAAGCCCTGCAC C GACGATGA - original (used as reference) GGCAACCGAACCGGCAAGCCCTGCAC GACGATGA - manually delete C GGCAACCGAACCGGCAAGCCCTGCAC C GACGATGA - homopolish (fix the deletion)

CTGAGAGGATCACAAAGGTCAT GCCAACGGCAA - original (used as reference) CTGAGAGGATCACAAAGGTCAT TA GCCAACGGCAA - manually insert TA CTGAGAGGATCACAAAGGTCAT GCCAACGGCAA - homopolish (fix the insertion)

AGATCCCAGTAGCGAGAAGCCA CC AGCGTAGT - original (used as reference) AGATCCCAGTAGCGAGAAGCCA AGCGTAGT - manually delete CC AGATCCCAGTAGCGAGAAGCCA CC AGCGTAGT - homopolish (fix the deletion)

I found that homopolish will fix all small indels either homopolymer or not. The mismatch will be retain. In my case, I would like to fix only indels caused by hompolymer, if possible.

ythuang0522 commented 2 years ago

Although homopolymer length is one of the 12 features in the model (see Feature engineering section in the paper), the decision of correction or not was left to the trained classifier. This is mainly because the ONT indel errors are not limited to homopolymers only, and It's hard to tell if it's homopolymer from the draft genome sometimes. e.g., Draf: ACT vs Ref: ACCT. It looks like you are clear on what to polish or not. While it's possible to train your own model, the easier way might just compare the draft and polished genomes and roll back any correction you don't want.

piroonj commented 2 years ago

Got it. Thank you for suggesting.