ythuang0522 / homopolish

High-quality Nanopore-only genome polisher
GNU General Public License v3.0
65 stars 12 forks source link

The issues of downloading homologous_sequences #39

Closed LeeSongEpi2013 closed 2 years ago

LeeSongEpi2013 commented 2 years ago

Hello, Homopolish VERSION: 0.2.1 was installed by conda. my command is : “conda activate homopolish && homopolish polish -a 117.canu.0728.fasta -s /home/db/bacteria.msh -t 4 -o ./117_msh --download_contig_nums 5 -m R9.4.pkl -d” the output log information of debugging: INFO: Stage: Download closely-related genomes INFO: 5 homologous sequence need to download:

But the program kept the status of downloading the reference genome, Whenever the reference genome is about to be downloaded, the status changes to re-download. I don't know if there's something wrong with my command or something else.

ythuang0522 commented 2 years ago

Hi, can you check the connection to NCBI ftp (ftp://ftp.ncbi.nlm.nih.gov/genomes/) is normal? If it is stable, please download the latest version from Github (v0.3.2) using the following commands

git clone https://github.com/ythuang0522/homopolish.git
cd homopolish
conda env create -f environment.yml
conda activate homopolish

If the latest version does not work, we will need you to provide the contig FASTA for debugging.

Yao-Ting

LeeSongEpi2013 commented 2 years ago

Thank you very much for your reply. After the check, the NCBI connection is normal. I am currently updating the Homopolish version according to the method you provided. But it still doesn't work properly to get results.

I have another question about the use of homopolish,
The genome is after Racon Pilon but without Medaka, does that can apply to Homopolish?

LeeSongEpi2013 commented 2 years ago

117.canu.0728.zip This is my genome sequence,
Please help me find the problem
Thanks a lot.

Lee-song

ythuang0522 commented 2 years ago

It can be run after Racon and also improve but our results indicates the best quality was obtained after Racon+Medaka. Thanks for providing the contigs. Will look into it later.

LeeSongEpi2013 commented 2 years ago

OK, Thank you.

ythuang0522 commented 2 years ago

I ran the program v0.3.2 with the 117.canu.0728.fasta and it works fine with me. Because there are three contigs in your FASTA, there will be three downloading processes for polishing each of the three contigs, just in case some might be plasmids. Not sure if this is your problem. But you don't need to worry for duplicated downloaded genomes.

(homopolish) [ythuang@bip7 homopolish]$ python3 homopolish.py polish -a 117.canu.0728.fasta -s bacteria.msh -m R9.4.pkl -d -o 117
[2021/09/06 16:52] INFO: RUN-ID: tig00000001_pilon_pilon_pilon
tig00000001_pilon_pilon_pilon
/home/ythuang/homopolish/117/debug
[2021/09/06 16:52] INFO: Stage: Select closely-related genomes
TIME Select closely-related genomes: 0 MINS 11 SECS.
[2021/09/06 16:52] INFO: Stage: Download closely-related genomes
 INFO: 20 homologous sequence need to download: 
Downloaded GCF_009736665.1_ASM973666v1_genomic.fna.gz
Downloaded GCF_002843665.1_ASM284366v1_genomic.fna.gz
Downloaded GCF_009737885.1_ASM973788v1_genomic.fna.gz
Downloaded GCF_001514375.1_ASM151437v1_genomic.fna.gz
...
[2021/09/06 16:56] INFO: Stage: Homologous retrieval
TIME Homologous retrieval: 0 MINS 0 SECS.
[2021/09/06 16:56] INFO: Stage: Prediction
TIME Prediction: 0 MINS 0 SECS.
[2021/09/06 16:56] INFO: Stage: Polish
TIME Polish: 0 MINS 0 SECS.
TIME Total: 4 MINS 13 SECS.

(homopolish) [ythuang@bip7 homopolish]$ abyss-fac 117/117_homopolished.fasta 
n   n:500   L50 min N80 N50 N20 E-size  max sum name
3   3   1   16398   3975540 3975540 3975540 3880698 3975540 4074526 117/117_homopolished.fasta
LeeSongEpi2013 commented 2 years ago

All right. Thank you very much.
I'll try again.

Lee-song

LeeSongEpi2013 commented 2 years ago

Hi, Yao-Ting After re-installing the lastest version Homopolish, my problem was solved. Thansk again. I assume it's due to some unknown bug in the previous version.

However, I noticed that after passing Homopolish, the Q value didn't increase significantly, just from 17.47 to 17.55. What are the possible reasons for this?

ythuang0522 commented 2 years ago

That’s unusually low. Is it basecalled by Guppy > 3.2? Usually q40 after Racon + medaka and q50 after homopolish.

LeeSongEpi2013 commented 2 years ago

Yes, my Guppy >3.2. But the genome was polished before homopolish using Racon+Pilon with 3 iterations, not Racon+medaka. I will try Racon+Medaka+Homopolish. Another quick question: does the selection of reference genome have any effect on Q value assessment? Because this strain has multiple reference genomes in NCBI, I selected the most similar ones based on sequence similarity as a reference through fastANI.

ythuang0522 commented 2 years ago

You should try that. The lack of no ground truth in your case is the other main reason. The Q score is way more sensitive than fastANI. Q30=99.9% Not to say Q40 or Q50 genomes. The strain variations would reduce the Q score quite a lot. Without ground truth, you should use Busco or CheckM for evaluating the completeness of core genes in the genomes instead of Q scores.

LeeSongEpi2013 commented 2 years ago

OK, thank U. The results of Busco and Checkm showed excellent integrity of core genomes, but the Q value is 17.5. Maybe the strain variations reduce the Q score quite a lot. I used the Q value to see if the genome polished by Homopolish's strategy could achieve >Q40 without using the next-generation data.
I don't know if my understanding is correct, the Racon+ Medaka + Homopolish genome can be directly used for subsequent analysis.

ythuang0522 commented 2 years ago

Yes. The quality is sufficient for most of our analysis. If no further question i will close this issue.

LeeSongEpi2013 commented 2 years ago

OK, no more questions. Thank you