ythuang0522 / homopolish

High-quality Nanopore-only genome polisher
GNU General Public License v3.0
65 stars 12 forks source link

Sequence download fails (Too Many Requests) #55

Open yari-iw opened 2 years ago

yari-iw commented 2 years ago

Hi, I'm using homopolish (polish mode) in a pipeline and I noticed that some of the results I was getting were not reproducible. The logs helped me to identify that the problem comes from the sequences download :

command: python3 homopolish.py polish -t 12 -a $assembly -s $homopolish_db -m R9.4.pkl -o . logs:

...
[2022/08/23 14:04] INFO: Stage: Select closely-related genomes
TIME Select closely-related genomes: 0 MINS 3 SECS.
[2022/08/23 14:04] INFO: Stage: Download closely-related genomes
 INFO: 20 homologous sequence need to download:
Downloaded NZ_CP021908.1
Downloaded NZ_CP021906.1
Downloaded NZ_CP021906.1
429 Client Error: Too Many Requests for url: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=NZ_CP021908.1&rettype=fasta
429 Client Error: Too Many Requests for url: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=NZ_CP009362.1&rettype=fasta
429 Client Error: Too Many Requests for url: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=NZ_CP011527.1&rettype=fasta
Downloaded NZ_CP035102.1
Downloaded NC_013322.1
Downloaded NZ_AP014943.1
429 Client Error: Too Many Requests for url: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=NZ_CP028469.1&rettype=fasta
429 Client Error: Too Many Requests for url: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=NZ_CP028471.1&rettype=fasta
429 Client Error: Too Many Requests for url: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=NC_018965.1&rettype=fasta
Downloaded NC_019009.1
Downloaded NC_013340.1
Downloaded NZ_CP026065.1
429 Client Error: Too Many Requests for url: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=NC_013292.1&rettype=fasta
429 Client Error: Too Many Requests for url: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=NZ_CP029079.1&rettype=fasta
Downloaded NZ_CP029199.1
Downloaded NC_003140.1
Downloaded NC_021552.1
...

The error comes from the fact that by default homopolish download sequences by batches of 3 and this seems to overload some clients. By changing the variable max_pool_size (from the download.py script) to 1 instead of 3, all sequences are correctly downloaded.

As this can be a problem for reproducibility (and can be stay unnoticed until a proper testing is performed) would it be possible to add an option to manually set the number of requests or to lower he number of requests by default ?

I'm using the latest version of homopolish cloned from github earlier today (which I suppose is v0.4) but the --version option tells me I'm using : Homopolish VERSION: 0.3.4

ythuang0522 commented 2 years ago

Thank you for reporting this issue. We ever saw the same errors but they were not easily reproducible from our servers. We ever suspected this might be due to firewall protection or loading policy of NCBI. We will test again and very likely lower the default downloading threads from 3 to 1 in order to fit their protection policy. The option will be added then. We forgot to change the version number in the code. Will fix them together. Thanks again for your helpful feedback.

yari-iw commented 2 years ago

Hi @ythuang0522, thank you for your quick answer.