wells-wood-research / timed-design

Protein Sequence Design with Deep Learning and Tooling like Monte Carlo Sampling and Analysis
46 stars 11 forks source link

Handling of Multiple Chain Structures #79

Closed sunal1996 closed 4 months ago

sunal1996 commented 4 months ago

At the moment, I am struggling with TIMED-design giving me a sequence prediction where a multiple chain protein is processed as if it is a single chain, and in the output chain A and chain B are merged.

Note: Aposteriori is installed via pip install aposteriori in an environment called apo. TIMED is used in an environment called timed_design. Here is an example:

PDB code of the input structure: 3W8O (HasA)

Aposteriori commands used to generate a dataset:

make-frame-dataset pdb/ --voxels-per-side 21 --frame-edge-length 21 -g True -p 3 -n benchmark_set -v -r -cb True -ae CNOCACB -o .

Where pdb/ contained 3w8o.pdb as the only protein structure. Then, TIMED-design was run with:


python3 predict.py --path_to_dataset benchmark_set.hdf5 --path_to_model TIMED.h5 --path_to_output .

Here is the created TIMED.fasta file


>3w8oA
MMIRIYYHPEYRDMTLKDWLTEYQKWFGNINMEPGKITDDDNLGFFYPGPNSGDQYGQRSLHTDACFIFRGDLSYTDDEWPAWTLYGELDGVVFGLNLEGGAETGGYRLEHTHVSFSNLNLNSPLHEGRDGLVHLVIYGLMLGDADALLNLIDELLKEHDSELSIDSTFSELVELGIARLDPYPMPIRVTYRSDYRDMTIRDFLDRFSEWFGNIKMEPGKVHSNKNFGRFSPGPYFGTQYAWQSTCSDTCFIFEGDLYYTMFMDPANTLWGELDGVDLGYNLVGGASGPGYYLENPIVSITNLGLWSPLWQGRDSLVHLVVYGLMNGDMDELIGLVTELLRAIDPELSSDSTFEELADHGIAHLIPSC

Here is the dataset.fasta file that TIMED creates for the WT sequence


>3w8oA
MSISISYSTTYSGWTVADYLADWSAYFGDVNHRPGQVVDGSNTGGFNPGPFDGSQYALKSTASDAAFIAGGDLHYTLFSNPSHTLWGKLDSIALGDTLTGGASSGGYALDSQEVSFSNLGLDSPIAQGRDGTVHKVVYGLMSGDSSALQGQIDALLKAVDPSLSINSTFDQLAAAGVAHATPAAMSISISYSTTYSGWTVADYLADWSAYFGDVNHRPGQVVDGSNTGGFNPGPFDGSQYALKSTASDAAFIAGGDLHYTLFSNPSHTLWGKLDSIALGDTLTGGASSGGYALDSQEVSFSNLGLDSPIAQGRDGTVHKVVYGLMSGDSSALQGQIDALLKAVDPSLSINSTFDQLAAAGVAHATPAA

Although 3w8o contains both A and B chains, the sequence is converted into a single chain. Chain A contains 184 aminoacids and chain B contains 184 aminoacids in this case. TIMED gives predictions for 368 aminoacids, so it does take chain B into account. But somehow, it merges it to the chain A.

universvm commented 4 months ago

Bug confirmed. I believe it is due to this: https://github.com/wells-wood-research/timed-design/commit/fa0dc435f43ee046db85347fda14a808a40dc22e

universvm commented 4 months ago

closed in #81