Can I use genbank all bacterial complete genome to make kmer database?

xa6xa6 / metaOthello

Other

16 stars 7 forks source link

Can I use genbank all bacterial complete genome to make kmer database? #2

Open alienzj opened 7 years ago

alienzj commented 7 years ago

Hello, I have encountered some tricky problems.

I can't download your kmer database from Google drive in the HPC cluster under terminal. Although I tried several google drive command line download tool, but failed to end. Could you supply a real download address? So I can use wget or curl to get it.
After that, I decided to make a kmer database using all bacterial complete genomes from genbank(not refseq). So far, genbank has nearly eight thousand complete bacterial genomes. But refseq has less than three thousand bacteria complete reference genomes.
So does the following file need to be regenerated based on its own database? 1). bacterial reference seq associated taxonomy info file 2). bacterial speciesId2taxoInfo_file 3). NCBI names file Could you provide a tool for generating related files?

Very much looking forward to the upcoming great tool！ Thank you very much!

boulund commented 7 years ago

@alienzj, Regarding your first point, I experienced the same issue. However, I did some googling and found a Stack Overflow post where someone posted a Python script that makes it possible to download files shared from Google Drive. I made an improved version of that script here: https://bitbucket.org/boulund/scripts/raw/834e64a7d5b35dae154378e34fa2f185537955ed/download_from_gdrive.py

As far as I can tell, you shouldn't have to redo the NCBI names file (it looks like a slightly modified version of the official NCBI Taxonomy dump file, only including the scientific names of all taxonomic nodes). Haven't looked into the other files in any detail yet, so I'll let someone else answer that.

alienzj commented 7 years ago

@boulund Thanks very much! The script you provide is a great tool that makes it easy for me to download to the file， and on the NCBI names file, do not need to be regenerated, thank you for your reminder.

Goohoo commented 7 years ago

@boulund thanks for providing this wonderful script. @alienzj Sorry for the late reply. Thanks for your interest in MetaOthello. We will release a tool/script to generate all the needed file automatically in the next version.

boulund commented 7 years ago

@alienzj @Goohoo You really should be thanking the original authors in the Stack Overflow post (https://stackoverflow.com/questions/25010369/wget-curl-large-file-from-google-drive), I merely made some small adaptations to make it more friendly to use on the command line, but I agree it's a nice script! Apparently there's a way to make links to shared files on Google Drive in a way that you can access the file directly (by making them entirely public somehow, I don't know the details, but I think it's mentioned in the SO thread).