williamgilpin / pypdb

A Python API for the RCSB Protein Data Bank (PDB)
MIT License
309 stars 77 forks source link

Searching #13

Closed bhavaygg closed 3 years ago

bhavaygg commented 4 years ago

Hello,

Searching only returns 100 results. Is it possible to get all the results and download them?

williamgilpin commented 4 years ago

Hi @Chokerino thanks for the heads up, I will take a look at this today.

williamgilpin commented 4 years ago

Hi, can you please provide an example? When I run certain types of queries, I get more than 100 results:

found_pdbs = Query('6239', 'TreeEntityQuery').search() #TaxID for C elegans
print(len(found_pdbs)) # Returns 447
bhavaygg commented 4 years ago

Then it probably depends on the type of query found_pdbs = Query('Protease bound with agonist').search() #length is 100 where online search returns about 160k results

williamgilpin commented 4 years ago

Thank you! It looks like the issue is the structure of the API; I tried out different parameter combinations here, and it looks like the issue is that "Protease bound with agonist" is being incorrectly treated as a sequence of keywords, rather than a general search query.

This is probably fixable, but it requires determining if RCSB has exposed a method for doing a standard text search, like the default on their website. I'll see what I can find.

bonetwo2 commented 3 years ago

So this is still open, huh ?
When I just search for the word "kinase" on https://www.rcsb.org/ I get 30991 hits. But when I do

> import pypdb
> 
> found_pdbs = pypdb.Query('kinase').search()
> print(len(found_pdbs))

I get 11622.

bonetwo2 commented 3 years ago

Here's my workaround with their newer API. The Legacy API you're using is not maintained and is going to be taken down in December.

import json
import urllib
from urllib.request import urlopen

url = 'https://search.rcsb.org/rcsbsearch/v1/query'

json_query_string = '''
{
  "query": {
    "type": "terminal",
    "service": "text",
    "parameters": {
      "value": "kinase"
    }
  },
  "return_type": "entry"
}
'''

def basic_search(req_url,json_str,print_query=True,read_and_load=True):
    req_url = url+'?json={request}'
    query = urllib.parse.quote(json_str)
    url_query = req_url.format(request=query)
    if print_query:
        print(url_query)
    response = urlopen(url_query)
    if read_and_load:
        return json.loads(response.read())
    else: 
        return response

basic_search_results = basic_search(url,json_query_string)
print(basic_search_results['total_count'])
https://search.rcsb.org/rcsbsearch/v1/query?json=%0A%7B%0A%20%20%22query%22%3A%20%7B%0A%20%20%20%20%22type%22%3A%20%22terminal%22%2C%0A%20%20%20%20%22service%22%3A%20%22text%22%2C%0A%20%20%20%20%22parameters%22%3A%20%7B%0A%20%20%20%20%20%20%22value%22%3A%20%22kinase%22%0A%20%20%20%20%7D%0A%20%20%7D%2C%0A%20%20%22return_type%22%3A%20%22entry%22%0A%7D%0A
30991

So this is still open, huh ? When I just search for the word "kinase" on https://www.rcsb.org/ I get 30991 hits. But when I do

> import pypdb
> 
> found_pdbs = pypdb.Query('kinase').search()
> print(len(found_pdbs))

I get 11622.

williamgilpin commented 3 years ago

Thanks very much @bonetwo2 for the comment and code. I was unable to find a fix with the old API, but I agree that I will need to do a major refactor soon anyway

williamgilpin commented 3 years ago

@Chokerino the latest GitHub version should resolve this problem: the results should exactly match those returned by the online interface. It may be some time before we update the pypi and conda versions to match the development version.

Please feel free to re-open this issue if you run into problems. Thank you.