zchwang / DeepRMSD-Vina_Optimization

DeepRMSD+Vina is a computational framework that integrates ligand binding pose optimization and screening.
37 stars 10 forks source link

DeepRMSD+Vina

This algorithm is based on deep learning and a classical scoring function (Vina score) and is designed to optimize ligand conformations.

Install

pytorch >= 1.10
conda install -c conda-forge spyrmsd
conda install numpy pandas

Contact

Liangzhen Zheng, Shanghai Zelixir Biotech Company Ltd, astrozheng@gmail.com

Zechen Wang, Shandong University, wangzch97@163.com

Citation

If you find our scripts useful, please consider citing the following paper:

@article{wang2023fully,
  title={A fully differentiable ligand pose optimization framework guided by deep learning and a traditional scoring function},
  author={Wang, Zechen and Zheng, Liangzhen and Wang, Sheng and Lin, Mingzhi and Wang, Zhihao and Kong, Adams Wai-Kin and Mu, Yuguang and Wei, Yanjie and Li, Weifeng},
  journal={Briefings in Bioinformatics},
  volume={24},
  number={1},
  pages={bbac520},
  year={2023},
  publisher={Oxford University Press}
  }

Run optimization

1. Prepare structural files for proteins and ligands.

The algorithm simultaneously optimizes multiple poses of a ligand, which must be generated by the same docking program and placed in the same directory in PDBQT format. The PDBQT files for proteins and ligands can be generated by MGLTools. The detailed process is as follows.

pythonsh prepare_receptor4.py -r protein.pdb -U lps -o protein.pdbqt
pythonsh prepare_ligand4.py -l ligand.mol2 -U lps -o ligand.pdbqt 

2. Prepare the input file with a pdb code, a protein PDBQT file and the directory where the ligand poses (PDBQT file) are located written on each line, separated by a space.

The content of the input file is as follows

1gpn ./samples/1gpn/1gpn_protein_atom_noHETATM.pdbqt samples/1gpn/decoys
1syi ./samples/1syi/1syi_protein_atom_noHETATM.pdbqt samples/1syi/decoys

3. Run the optimization framework with default parameters included in scripts.

bash run_pose_optimization.sh inputs.dat

Finally, the program outputs the optimized ligand conformation ("final_optimized_cnfr.pdb") and the final score. In addition, the conformation changes and scores during optimization are recorded in the "optimized_traj.pdb" and "opt_data.csv" files, respectively.

Only using DeepRMSD+Vina for scoring

The scoring components are placed in the "scoring" directory. You can execute the following command to perform scoring using DeepRMSD+Vina.

python scripts/run.py \
-rec_fpath $rec_fpath \   
-pose_fpath $pose_fpath \ 
-mean_std_file ../models/r6-r1_0.3-2.0nm_train_mean_std.csv \
-model ../models/bestmodel_cpu.pth \
-out_fpath $out_fpath

where rec_fpath, pose_fpath, and out_fpath represent the paths for the input protein pdbqt file, ligand pdbqt file, and the file where the scores will be stored, respectively. You can also directly run the "run_scoring.sh" file as follows:

bash run_scoring.sh $rec_fpath $pose_fpath $out_fpath

Here is a simple example to test this process, as follows:

bash run_scoring.sh samples/1bcu_protein_noHETATM.pdbqt samples/1bcu_decoys.pdbqt out.csv

Retrainig DeepRMSD

1. Generate datasets

Firstly, generate the ".pkl" file containing features and labels in advance before training. We provide the "generate_features.py" script in the "retrain" directory for creating the required features and labels for DeepRMSD. You can run:

python generate_features.py -inp inputs.dat -out data_label.pkl 

each line in "inputs.dat" file represents a protein-ligand pair, specifying the protein-ligand id, protein file, poses file, and crystal ligand file, respectively.

2. Training

We provide the "train.py" script in the "retrain" directory. You can run the following command to retrain DeepRMSD:

python train.py \
-train_file $train_file \
-valid_file $valid_file \
-device cuda:0

"train_file" and "valid_file" represent the training set and validation set, respectively, generated in the previous step.