yatisht / usher

Ultrafast Sample Placement on Existing Trees
MIT License
120 stars 40 forks source link

Adding Genbase sequences to usher #337

Closed aviczhl2 closed 1 year ago

aviczhl2 commented 1 year ago

Starting from March 2023, China began to upload most of its sequences not to GISAID but to a self-developed platform called GenBase. Sequences are free to download on NGDC by selecting dataset ”GenBase".

GenBase

I wonder if sequences in GenBase could be included in usher database.

russcd commented 1 year ago

@AngieHinrichs can you take a look? Just glancing at the database now, it looks relatively straightforward.

Two considerations that we should look into:

  1. How much data is unique to GenBase?
  2. Are any of these data posted elsewhere and therefore would require additional deduplication efforts?
AngieHinrichs commented 1 year ago

Yes, it does look really straightforward and it's easy to form a URL to download metadata for all GenBase sequences. However, at the moment I am not able to download any sequences from the website; even if I select only one or two sequences, I'm getting an empty file. I will try to contact the operators of ngdc.cncb.ac.cn.

aviczhl2 commented 1 year ago

Yes, it does look really straightforward and it's easy to form a URL to download metadata for all GenBase sequences. However, at the moment I am not able to download any sequences from the website; even if I select only one or two sequences, I'm getting an empty file. I will try to contact the operators of ngdc.cncb.ac.cn.

Seems that there was a bug yesterday and the bug is fixed today.

However there seems to be a 2000 upperbound threshold per download.

AngieHinrichs commented 1 year ago

Yes, manual download with a limit of 2000 is working for me too today. I will download sequences that way for now, but I hope there is an automated solution. The site has some download files but they are either outdated (2022 & earlier) or mostly GenBank with very few GenBase sequences, as far as I can tell. I emailed the Contact addresses for the search page and for GenBase asking if there could be compressed fasta downloads or an API to query sequences.

aviczhl2 commented 1 year ago

Yes, manual download with a limit of 2000 is working for me too today. I will download sequences that way for now, but I hope there is an automated solution. The site has some download files but they are either outdated (2022 & earlier) or mostly GenBank with very few GenBase sequences, as far as I can tell. I emailed the Contact addresses for the search page and for GenBase asking if there could be compressed fasta downloads or an API to query sequences.

The system works very badly, I don't find an API too.

Select ”GenBase" on database option, this shows all GenBase sequences, sequences that have been submitted to other platforms will have a "related_ID" showing its ID on GISAID or GenBank, so sort by related_id and exclude sequences with any related_ID other than None you get unique GenBase sequences.

However, it seems there's still no way to query the “create date" of sequences, only a "view the latest data” option to show sequences with the most recent create date.

After the initial build, either download the "view the latest" daily, or download all GenBase sequences weekly and de-duplicate with previous week's result. I guess these two are the best ways under current situation...

Screen Shot 2023-04-13 at 11 24 00
AngieHinrichs commented 1 year ago

I emailed the contact listed for GenBase on https://ngdc.cncb.ac.cn/databasecommons/database/id/8197 and he replied that there is an API to fetch one GenBase sequence at a time (e.g. https://ngdc.cncb.ac.cn/genbase/api/file/fasta?acc=C_AA004835.1). So I wrote a script (in production for the first time in today's build) that fetches metadata for all sequences in GenBase, GWH, CNGBdb and NMDC from CNCB, compares it to the previous day's metadata, and fetches new GenBase sequences one at a time with delays so I don't DoS the server. I updated my script that combines sequences from all sources, identifies sequences not already in the tree that pass quality filters, and makes input for UShER, to also check for new CNCB sequences. The deduplication could still use a little work -- there are some sequences in both GISAID and GenBase that may appear twice in the tree.

If all goes well, then hopefully by tomorrow the 2023-04-13 tree will be available including GenBase sequences.