williamgilpin / pypdb

A Python API for the RCSB Protein Data Bank (PDB)
MIT License
309 stars 77 forks source link

Implemented FASTA fetching, autoresolving sequence type for `SequenceOperator` searches #29

Closed lacoperon closed 3 years ago

lacoperon commented 3 years ago

Changes

This effectively reimplements the get_blast functionality with the new API. I also actually reimplemented get_blast with a deprecation warning.

Hypothetically, this should solve #26

Example Usage

Let's say I want to find any structures that are similar in sequence to the first polymer sequence in 6TML's FASTA file. I would do so using the following code:

from pypdb.clients.fasta.fasta_client import get_fasta_from_rcsb_entry
from pypdb.clients.search.search_client import perform_search
from pypdb.clients.search.search_client import SearchService, ReturnType
from pypdb.clients.search.operators.sequence_operators import SequenceOperator

# Fetches FASTA results from RCSB, as a list of `FastaSequence` objects.
fasta_sequence_list = get_fasta_from_rcsb_entry("6TML")
# Let's arbitrarily pick the first element in the list to search with
sequence_of_interest = fasta_sequence_list[0].sequence

# Performs sequence search ('BLAST'-like) using the FASTA sequence
results = perform_search(
    search_service=SearchService.SEQUENCE,
    return_type=ReturnType.ENTRY,
    search_operator=SequenceOperator(
        sequence=sequence_of_interest,
        identity_cutoff=0.99,
        evalue_cutoff=1000
        # note that the search SequenceType is autoresolved (this fails with ambiguous sequences like "AAAAA")
      ),
    return_with_scores=True
)

results
>>> [ScoredResult(entity_id='6TMK', score=1.0), ScoredResult(entity_id='6TML', score=1.0), ScoredResult(entity_id='6TMJ', score=1.0), ScoredResult(entity_id='6TMG', score=1.0)]

Tests + mypy

Tests pass with pytest. Typechecking passes with mypy --namespace-packages pypdb/path/to/file.py for all files changed.