DeepDrugDomain is a comprehensive Python toolkit aimed at simplifying and accelerating the process of drug-target interaction (DTI) and drug-target affinity (DTA) prediction using deep learning. With a flexible preprocessing pipeline and modular design, DeepDrugDomain supports innovative research and development in computational drug discovery.
DeepDrugDomain is built with a suite of powerful features designed to empower researchers in the field of computational drug discovery. Below are some of the core capabilities that make DeepDrugDomain an indispensable tool:
By integrating these advanced features, DeepDrugDomain stands out as a toolkit that not only meets the current demands of drug discovery but also adapts to its future challenges and opportunities.
For now you can use this environments for usage and development,
conda create --name deepdrugdomain python=3.11
conda activate deepdrugdomain
pip install dgl -f https://data.dgl.ai/wheels/repo.html
conda install -c conda-forge rdkit
pip install git+https://github.com/yazdanimehdi/deepdrugdomain.git
import deepdrugdomain as ddd
# setting device on GPU if available, else CPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = ModelFactory.create("attentionsitedti")
preprocesses = ddd.data.PreprocessingList(model.default_preprocess(
"SMILES", "pdb_id", "Label"))
dataset = ddd.data.DatasetFactory.create(
"human", file_paths="data/human/", preprocesses=preprocesses)
datasets = dataset(split_method="random_split",
frac=[0.8, 0.1, 0.1], seed=seed, sample=0.1)
collate_fn = model.collate
data_loader_train = DataLoader(
datasets[0], batch_size=64, shuffle=True, num_workers=0, pin_memory=True, drop_last=True, collate_fn=collate_fn)
data_loader_val = DataLoader(datasets[1], drop_last=False, batch_size=32,
num_workers=4, pin_memory=False, collate_fn=collate_fn)
data_loader_test = DataLoader(datasets[2], drop_last=False, batch_size=32,
num_workers=4, pin_memory=False, collate_fn=collate_fn)
criterion = torch.nn.BCELoss()
optimizer = OptimizerFactory.create(
"adam", model.parameters(), lr=1e-3, weight_decay=0.0)
scheduler = None
device = torch.device("cpu")
model.to(device)
train_evaluator = ddd.metrics.Evaluator(["accuracy_score"], threshold=0.5)
test_evaluator = ddd.metrics.Evaluator(
["accuracy_score", "f1_score", "auc", "precision_score", "recall_score"], threshold=0.5)
epochs = 3000
accum_iter = 1
print(model.evaluate(data_loader_val, device,
criterion, evaluator=test_evaluator))
for epoch in range(epochs):
print(f"Epoch {epoch}:")
model.train_one_epoch(data_loader_train, device, criterion,
optimizer, num_epochs=200, scheduler=scheduler, evaluator=train_evaluator, grad_accum_steps=accum_iter)
print(model.evaluate(data_loader_val, device,
criterion, evaluator=test_evaluator))
print(model.evaluate(data_loader_test, device,
criterion, evaluator=test_evaluator))
The example folder contains a collection of scripts and notebooks demonstrating various capabilities of DeepDrugDomain. Below is an overview of what each example covers:
The following table lists the preprocessing methods supported by the package, detailing the data conversion, settings options, and the models that use them:
Method | Converts From | Converts To | Settings Options | Used in Models |
---|---|---|---|---|
smiles_to_encoding | SMILES | Encoding Tensor | one_hot: bool, embedding_dim: Optional[int], max_sequence_length: Optional[int], replacement_dict: Dict[str, str], token_regex: Optional[str], from_set: Optional[Dict[str, int]] | DrugVQA, AttentionDTA |
smile_to_graph | SMILES | Graph | node_featurizer: Callable, edge_featurizer: Optional[Callable], consider_hydrogen: bool, fragment: bool, hops: int | AMMVF, AttentionSiteDTI, FragXsiteDTI, CSDTI |
smile_to_fingerprint | SMILES | Fingerprint | method: str, Refer to Supported Fingerprinting Methods table for detailed settings. | AMMVF |
For detailed information on fingerprinting methods, please see the Supported Fingerprinting Methods section.
Method Name | Description | Settings Options |
---|---|---|
RDKit | Converts SMILES to RDKit fingerprints, capturing molecular structure information. | radius: Optional[int], nBits: Optional[int] |
Morgan | Generates circular fingerprints, representing the environment of each atom in a molecule. | radius: Optional[int], nBits: Optional[int] |
Daylight | Traditional method to encode molecular features, focusing on specific substructure patterns. | nBits: Optional[int] |
ErG | Extended reduced graph-based approach, emphasizing molecular topology. | nBits: Optional[int], atom_dict: Optional[AtomDictType], bond_dict: Optional[BondDictType] |
RDKit2D | Two-dimensional variant of RDKit, detailing planar molecular structures. | nBits: Optional[int], atom_dict: Optional[AtomDictType], bond_dict: Optional[BondDictType] |
PubChem | Utilizes PubChem's approach to fingerprinting, highlighting unique chemical structures. | nBits: Optional[int] |
AMMVF | Custom fingerprinting method specific to the AMMVF model. | num_finger: Optional[int], fingerprint_dict: Optional[FingerprintDictType], edge_dict: Optional[Dict] |
Custom | Allows for user-defined fingerprinting techniques, adaptable to specific research requirements. | custom_fingerprint: Optional[Callable], consider_hydrogen: bool |
Method | Converts From | Converts To | Settings Options | Used in Models |
---|---|---|---|---|
contact_map_from_pdb | PDB ID | Contact Map | pdb_path: str, method: str, distance_threshold: float, normalize_distance: bool | DrugVQA |
sequence_to_fingerprint | Protein Sequence | Fingerprint | method: str, Refer to Supported Protein Fingerprinting Methods for settings. | DrugVQA-Sequence |
kmers | Protein Sequence | Kmers Encoded Tensor | ngram: int, word_dict: Optional[dict], max_length: Optional[int] | AMMVF, CSDTI |
protein_pockets_to_dgl_graph | PDB ID | Binding Pocket Graph | pdb_path: str, protein_size_limit: int | AttentionSiteDTI, FragXsiteDTI |
word2vec | Protein Sequence | Word2Vec Tensor | model_path: str, vec_size: int, k: int, update_vocab: Optional[bool] | AMMVF |
sequence_to_one_hot | Protein Sequence | Encoding Tensor | amino_acids: str, max_sequence_length: Optional[int], one_hot: bool | AttentionDTA |
sequence_to_motif | Protein Sequence | Motif Tensor | ngram: int, word_dict: Optional[dict], max_length: Optional[int], one_hot: bool, number_of_combinations: Optional[int] | WideDTA |
For detailed information on protein fingerprinting methods, please see the Supported Protein Fingerprinting Methods section.
Method Name | Description | Settings Options |
---|---|---|
Quasi | A protein fingerprinting method that captures quasi-sequence information. | [] |
AAC | Encodes protein sequences based on amino acid composition. | [] |
PAAC | Generates pseudo amino acid composition fingerprints for proteins. | [] |
CT | A method focusing on the composition, transition, and distribution of amino acids in sequences. | [] |
Custom | Allows for user-defined protein fingerprinting techniques, adaptable to specific research needs. | custom settings as required |
Method | Converts From | Converts To | Settings Options |
---|---|---|---|
interaction_to_binary | Binary | Binary Tensor | [] |
ic50_to_binary | IC50 | Binary | threshold: float |
Kd_to_binary | Kd | Binary | threshold: float |
value_to_log | Float | Log | [] |
PreprocessingObject
attribute
The attribute
parameter specifies the key or column name in the input dataset that contains the data to be preprocessed.
attribute="SMILES"
means the preprocessing will be applied to the data in the "SMILES" column of the dataset.from_dtype
This parameter defines the data type or format of the input data before preprocessing.
from_dtype="smile"
indicates that the input data is in SMILES (Simplified Molecular Input Line Entry System) format, a textual representation of chemical structures.to_dtype
The to_dtype
parameter specifies the desired data type or format after preprocessing.
to_dtype="graph"
implies that the preprocessing will convert the input data (in this case, SMILES format) into a graph representation, which is often used in molecular modeling and cheminformatics.preprocessing_settings
This parameter is a dictionary that contains specific settings or options for the preprocessing step. It allows for customization of the preprocessing process based on the requirements of the model or the nature of the dataset.
in_memory
FlagThe in_memory
flag controls whether the preprocessed data is stored entirely in the system's memory (RAM).
True
: Setting in_memory
to True
loads and stores the entire dataset in memory. This speeds up data retrieval during training but requires significant memory resources. It's ideal for datasets that can fit comfortably in RAM.False
: Setting in_memory
to False
means the data is not stored in memory but processed and loaded during training iterations. This approach is more memory-efficient, suitable for large datasets, but can lead to slower data access times.online
FlagThe online
flag indicates whether preprocessing is performed in real-time (online) or preprocessed once and stored.
True
: With online
set to True
, preprocessing occurs in real-time during each data access. This is beneficial for datasets requiring dynamic transformations during training.False
: Setting online
to False
pre-processes and stores the data in its final form. This method is efficient for computationally expensive preprocessing steps on static datasets.In DeepDrugDomain, PreprocessingObject
can be configured with these flags to optimize data handling:
import deepdrugdomain as ddd
from dgllife.utils import CanonicalAtomFeaturizer
feat = CanonicalAtomFeaturizer()
preprocess_drug = ddd.data.PreprocessingObject(attribute="SMILES", from_dtype="smile", to_dtype="graph", preprocessing_settings={
"fragment": False, "node_featurizer": feat}, in_memory=True, online=False)
DeepDrugDomain provides support for a variety of datasets, each tailored for specific use cases in drug discovery. The table below details the datasets available:
Dataset Name | Description | Use Case |
---|---|---|
Celegans | Consists of chemical-genetic interaction data in C. elegans organisms. | DTI |
Human | Encompasses human protein-target interaction datasets. | DTI |
DrugBankDTI | A comprehensive drug-target interaction dataset from DrugBank. | DTI |
Kiba | Combines kinase inhibitor bioactivity data across multiple sources. | DTA, DTI |
Davis | Focuses on kinase inhibitor target affinity profiles. | DTA, DTI |
IBM_BindingDB | Derived from BindingDB, focuses on binding affinity of drug-like molecules. | DTA, DTI |
BindingDB | Contains measured binding affinities for protein-ligand complexes. | DTA, DTI |
DrugTargetCommon | A curated set of drug-target interactions from various databases. | DTA, DTI |
All TDC Datasets | Includes all datasets from the Therapeutics Data Commons (TDC). | All drug discovery tasks |
All datasets listed above support the following split methods:
import deepdrugdomain as ddd
# Define PreprocessorObject
preprocess = [...]
preprocesses = ddd.data.PreprocessingList(preprocess)
# Load dataset
dataset = ddd.data.DatasetFactory.create("human", file_paths="data/human/", preprocesses=preprocesses)
datasets = dataset(split_method="random_split", frac=[0.8, 0.1, 0.1], seed=4)
Disclaimer: This implementation of DeepDrugDomain is not an official version and may contain inaccuracies or differences compared to the original models. While efforts have been made to ensure reliability, the models provided may not perform at the same level as officially published versions and should be used with this understanding.
The following table showcases the models supported by our package and the datasets each model is compatible with:
Model | Supported Datasets |
---|---|
AttentionSiteDTI | DTI, DTA |
FragXsiteDTI | DTI, DTA |
DrugVQA | DTI, DTA |
CSDTI | DTI, DTA |
AMMVF | DTI, DTA |
AttentionDTA | DTI, DTA |
DeepDTA | DTI, DTA |
WideDTA | DTI, DTA |
GraphDTA | DTI, DTA |
DGraphDTA | DTI, DTA |
Contribution: We are actively looking to add new models to the package. Feel free to add any model to the package and shoot a pull request!
For now please read the docstring inside the module for more information.
We welcome contributions to DeepDrugDomain! Please check out our Contribution Guidelines for more details on how to contribute.
@misc{my2024ddd,
author = {Mehdi Yazdani-Jahromi},
title = {From Data to Discovery: The DeepDrugDomain Framework for Predicting Drug-Target Interactions and Affinity},
year = {2024},
publisher = {GitHub},
journal = {GitHub repository},
doi = {10.5281/zenodo.13974011},
howpublished = {\url{https://github.com/yazdanimehdi/DeepDrugDomain}}
}