williamgilpin / pypdb

A Python API for the RCSB Protein Data Bank (PDB)
MIT License
309 stars 77 forks source link

[RCSB Search: Text Service] - Added full support #23

Closed lacoperon closed 3 years ago

lacoperon commented 3 years ago

Added Support for QueryGraph-Based Searches

The new API allows for building up search queries based on boolean aggregation of various search query nodes. I did this through implementing perform_search_through_graph, within search_client.py.

Example Question

For example, you can ask "I would like structures that are under 4 angstroms, and published after 2019 using CryoEM, and are either Homo sapiens or Mus musculus".

Example Syntax for Group Query

(this specific example asks for structures under 4 angstroms that are Homo sapiens or Mus musculus, but you can go as complicated as you like hypothetically).

I added this example to EXAMPLES.md.

from pypdb.clients.search.search_client import QueryNode, QueryGroup, perform_search_with_graph
from pypdb.clients.search.search_client import LogicalOperator, ReturnType
from pypdb.clients.search.operators import text_operators

# QueryNode associated with structures with under 4 Angstroms of resolution
under_4A_resolution_operator = text_operators.ComparisonOperator(
       value=4,
       attribute="rcsb_entry_info.resolution_combined",
       comparison_type=text_operators.ComparisonType.GREATER)
under_4A_query_node = QueryNode(SearchService.TEXT,
                                  under_4A_resolution_operator)

# QueryNode associated with entities containing 'Mus musculus' lineage
is_mus_operator = text_operators.ExactMatchOperator(
            value="Mus musculus",
            attribute="rcsb_entity_source_organism.taxonomy_lineage.name")
is_mus_query_node = QueryNode(SearchService.TEXT, is_mus_operator)

# QueryNode associated with entities containing 'Homo sapiens' lineage
is_human_operator = text_operators.ExactMatchOperator(
            value="Homo sapiens",
            attribute="rcsb_entity_source_organism.taxonomy_lineage.name")
is_human_query_node = QueryNode(SearchService.TEXT, is_human_operator)

# QueryGroup associated with being either human or `Mus musculus`
is_human_or_mus_group = QueryGroup(
    queries = [is_mus_query_node, is_human_query_node],
    logical_operator = LogicalOperator.OR
)

# QueryGroup associated with being ((Human OR Mus) AND (Under 4 Angstroms))
is_under_4A_and_human_or_mus_group = QueryGroup(
    queries = [is_human_or_mus_group, under_4A_query_node],
    logical_operator = LogicalOperator.AND
)

return_type = ReturnType.ENTRY

results = perform_search_with_graph(
  query_object=is_under_4A_and_human_or_mus_group,
  return_type=return_type)
print(results) # Huzzah

Bugfixes

Fixed bug in which you couldn't correctly query for structure resolution. (due to needing to support integer QueryNode values).

Tests + mypy

All tests pass, and all files pass mypy typing analysis. (Done using: mypy --namespace-packages pypdb/clients/search/search_client_test.py or another path)

williamgilpin commented 3 years ago

Thank you, I pushed a minor update adding clients to setup.py, as well as an empty __init__.py I had thought that the latter was not necessary for Python 3, but it looks like it actually needs to be in each subdirectory to avoid that pesky ModuleNotFound error.

williamgilpin commented 3 years ago

I just had a chance to play around with these updates---thank you, this is really amazing! I fixed a few minor bugs, and extended setup.py.

Eventually I will migrate all the legacy Query functions in pypdb.py to use perform_search under the hood; most of the functions in the older versions can be deprecated eventually.

Thanks again, this is really amazing!

lacoperon commented 3 years ago

Totally yeah, that was the hope (that we can use the new functions under-the-hood in the meantime).

I assume the other critical thing we're currently missing is the Data API, which is in the works (probably in the next week? IDK depends how motivated I feel).