williamgilpin / pypdb

A Python API for the RCSB Protein Data Bank (PDB)
MIT License
306 stars 77 forks source link

Search by taxonomy #11

Closed mirix closed 3 years ago

mirix commented 4 years ago

Is there a way to retrieve a list of PDBs for a given organism?

Similar to PDB's internal "Source Organism Taxonomy Name".

williamgilpin commented 4 years ago

Hello, not currently, but I think it is possible to implement this, in principle. Looking at the API, I see an option "ExpressionOrganismQuery". Does this sound like the right query type for this purpose?

mirix commented 4 years ago

Thanks Willian. Not really. There is the source organism and the expression organism.

The source organism is the one whose DNA actually encodes the protein.

The expression organism is the one in which the protein was expressed.

For instance, you can genetically modify a bacteria to express a human protein. The bacteria would be the expression organism but Homo Sapiens would be the source organism.

williamgilpin commented 4 years ago

Thank you, I understand now. I looked through the API here, and I can't find anything that looks like what we want. There's a drop-down list of possible XML queries at the bottom of the page, and (as far as I can tell) none of them are what we want.

For now, I would suggest a keyword search. It's a bit crude, but if there isn't an existing query type for this, then this might take a bit longer to add.

mirix commented 4 years ago

Thanks again William.

I have tried this:

from pypdb import *

covid_pdb = Query('SARS-CoV-2').search()

plist = []
for pdb in covid_pdb:
    taxa = pypdb.list_taxa(pdb)
    if taxa == 'Severe acute respiratory syndrome-related coronavirus':
        plist.append([pdb])

But it is extremely slow and ends up like this:

/home/qrl/.conda/envs/coronavirus/lib/python3.6/site-packages/pypdb/pypdb.py:409: UserWarning: Retrieval failed, returning None warnings.warn("Retrieval failed, returning None") Traceback (most recent call last): File "chembl_search.py", line 67, in taxa = pypdb.list_taxa(pdb) File "/home/qrl/.conda/envs/coronavirus/lib/python3.6/site-packages/pypdb/pypdb.py", line 1377, in list_taxa all_info = get_all_info(pdb_id) File "/home/qrl/.conda/envs/coronavirus/lib/python3.6/site-packages/pypdb/pypdb.py", line 527, in get_all_info out = to_dict( get_info(pdb_id) )['molDescription']['structureId'] TypeError: 'NoneType' object is not subscriptable

However, it works if given a list as opposed to individual PDB IDs.

mirix commented 4 years ago

The following works:

import requests

if __name__ == '__main__':

    url = 'http://www.rcsb.org/pdb/rest/search'

    query_text = """
<?xml version="1.0" encoding="UTF-8"?>

<orgPdbCompositeQuery version="1.0">
 <queryRefinement>
  <queryRefinementLevel>0</queryRefinementLevel>
  <orgPdbQuery>
    <version>head</version>
    <queryType>org.pdb.query.simple.OrganismQuery</queryType>
    <description>Organism Search: Organism Name=Severe acute respiratory syndrome coronavirus 2 </description>
    <organismName>Severe acute respiratory syndrome coronavirus 2</organismName>
  </orgPdbQuery>
 </queryRefinement>
   <queryRefinement>
    <queryRefinementLevel>1</queryRefinementLevel>
    <conjunctionType>and</conjunctionType>
    <orgPdbQuery>
     <version>head</version>
     <queryType>org.pdb.query.simple.ResolutionQuery</queryType>
     <description>Resolution is between 0.0 and 4.0 </description>
     <refine.ls_d_res_high.comparator>between</refine.ls_d_res_high.comparator>
     <refine.ls_d_res_high.min>0.0</refine.ls_d_res_high.min>
     <refine.ls_d_res_high.max>4.0</refine.ls_d_res_high.max>
    </orgPdbQuery>
 </queryRefinement>
</orgPdbCompositeQuery>

"""
    header = {'Content-Type': 'application/x-www-form-urlencoded'}

    response = requests.post(url, data=query_text, headers=header)

    if response.status_code == 200:
        pdbids = []
        for pdbid in response.text.splitlines():
            pdbids.append([pdbid])
    else:
        print("Failed to retrieve results")

print(pdbids)
williamgilpin commented 4 years ago

It looks like the RCSB has a new API, and so I will need to do an overhaul of the whole package in the near future. I will add a taxonomy search at that time, since this looks like it will tricky to solve using the legacy API. My regrets for the delay---I'll post again when I have an update I'm happy with.

williamgilpin commented 4 years ago

I just pushed a minor update that allows searching organisms by their NCBI TaxID. I tried the COVID-19 TaxId and nothing came up, and so I would guess that I'll need to implement a CompositeQuery class like the in example you provided.

For other source organisms, however, the following will work:

found_pdbs = Query('6239', 'TreeEntityQuery').search() #TaxID for C elegans
print(found_pdbs[:5])
williamgilpin commented 3 years ago

@mirix This is now fixed in the latest GitHub version; it might not make it into the PyPI or conda versions for a while, but do let me know if you really need it in one of those:

q = Query("Dicty", query_type="OrganismQuery")
print(q.search()[:10])
# Returns ['1B4S', '1B99', '1BUX', '1C0F', '1C0G', '1D0X', '1D0Y', '1D0Z', '1D1A', '1D1B']