westlake-repl / SaProt

Saprot: Protein Language Model with Structural Alphabet (AA+3Di)
MIT License
360 stars 35 forks source link

KeyError: 'XcXgXyXtXyXgXd #59

Open 1412140736 opened 2 months ago

1412140736 commented 2 months ago

I encountered an error while using this method: KeyError: 'XcXgXyXtXyXgXd'. So, I checked the result returned by the get_struc_seq method. The seq, which should store the original amino acid sequence, is different from the result directly read from the PDB file. This is my code: ` from Bio import PDB from utils.foldseek_util import get_struc_seq def get_chain_sequence(pdb_file, chain_id): parser = PDB.PDBParser(QUIET=True) structure = parser.get_structure('structure', pdb_file) model = structure[0]

chain = model[chain_id]

ppb = PDB.PPBuilder()
peptides = ppb.build_peptides(chain)

sequence_str = ''.join([str(peptide.get_sequence()) for peptide in peptides])

return sequence_str

pdb_file = '../example/5jqb.pdb' chain_id = 'A'
sequence = get_chain_sequence(pdb_file, chain_id)

parsed_seqs = get_struc_seq("../bin/foldseek", '../example/5jqb.pdb', ["A"], plddt_mask=False)["A"] seq, foldseek_seq, combined_seq = parsed_seqs print(f"Chain {chain_id} sequence: {sequence}") print("seq:",seq) print("combined_seq:",combined_seq) `

Here is the output of the code execution. Command: ../bin/foldseek structureto3didescriptor -v 0 --threads 1 --chain-name-mode 1 ../example/5jqb.pdb get_struc_seq_0_1725676715.414907.tsv stdout: Chain A sequence: SIPLGVIHNSALQVSDVDKLVCRDKLSSTNQLRSVGLNLEGNGVATDVPSATKRWGFRSGVPPKVVNYEAGEWAENCYNLEIKKPDGSECLPAAPDGIRGFPRCRYVHKVSGTGPCAGDFAFHKEGAFFLYDRLASTVIYRGTTFAEGVVAFLILPQAKKDFFSGYYSTTIRYQATGFGTNETEYLFEVDNLTYVQLESRFTPQFLLQLNETIYTSGKRSNTTGKLIWKVNPEIDTTEWAFWETLSFTVV seq: SIPLGVIHNSALQVSDVDKLVCRDKLSSTNQLRSVGLNLEGNGVATDVPSATKRWGFRSGVPPKVVNYEAGEWAENCYNLEIKKPDGSECLPAAPDGIRGFPRCRYVHKVSGTGPCAGDFAFHKEGAFFLYDRLASTVIYRGTTFAEGVVAFLILPQAKKDFFSGYYSTTIRYQATGFGTNETEYLFEVDNLTYVQLESRFTPQFLLQLNETIYTSGKRSNTTGKLIWKVNPEIDTTEWAFWETLSFTVVXXXXXXX combined_seq: SdIaPwLaGwVeIdHdNpSqAdLiQdVtSdDdVpDvKpLdVdCpRvDdKdLdSpSdTcNvQlLkRfSkVeGkLeNfLvElGvNvGqVqAqTqDfVpPvSrAvTlKlRqWkGaFaRaSaGdVdPdPkKdVkVdNfYgEdAyGyEdWaAeEaNeCkYeNaLeEfIeKdKePpDvGrShEgClLfPaAaAdPdDpGpIfRaGaFdPdRhCyRqYeVyHeKyVeSyGeTyGaPpCnApGhDnFfAmFaHgKnEvGqAwFwFwLdYtDhRrLmAtSmTrVtIdYhRhGgThTiFtAgEtGtVhVmAhFmLyIrLhPdQpAdKdKrDhFdFdSdGdYgYhSyTdTyIwRyYkQyAwTyGhFrGrTdNpEdTiEwYiLwFtEdVlDdNpLqTeYtVeQgLdEdSsRqFaTdPpQvFnLsLvQvLvNsEvTcIcYvTvSvGvKvRgSdNpTdTpGhKhLpIyWeKyVeNdPpEpIdDgThTdEsWdArFpWvEpTdLaSaFwTdVaVpXcXgXyXtXyXgXd

LTEnjoy commented 2 months ago

Hi,

Both amino acid sequence and foldseek sequence are obtained by using foldseek binary file. Sometimes foldseek parses extra amino acids given a pdb file. Maybe you could check the pdb file for more details?