zjunlp / OntoProtein

[ICLR 2022] OntoProtein: Protein Pretraining With Gene Ontology Embedding
MIT License
141 stars 22 forks source link

Generating Embedding of Protein Sequence #30

Closed anonimoustt closed 7 months ago

anonimoustt commented 8 months ago

Hi is it possible to use https://huggingface.co/zjunlp/OntoProtein model to get the embedding of a protein sequence?

anonimoustt commented 8 months ago

Hi is it possible to use https://huggingface.co/zjunlp/OntoProtein model to get the embedding of a protein sequence?

Alexzhuan commented 8 months ago

Yes, you can use the model to get the embedding of a protein sequence by applying mean pooling on the hidden states after the encoder.

anonimoustt commented 8 months ago

from transformers import AutoTokenizer, AutoModelForSequenceClassification, AutoModel,EsmTokenizer, EsmModel import torch import numpy as np from sklearn.pipeline import Pipeline#GrimSqueaker/proteinBERT zjunlp/OntoProtein

tokenizer = AutoTokenizer.from_pretrained("zjunlp/OntoProtein")#,config=config,max_position_embeddings=320)#,embeddings=vx)#,graph=s3) model = AutoModel.from_pretrained("zjunlp/OntoProtein")#,config=config)#max_position_embeddings=320)

tokenizer.add_tokens(useq)

model.resize_token_embeddings(320)

def preem(seqf):

Tokenize protein sequences

    inputs1 =tokenizer(seqf, padding=True, truncation=True, return_tensors='pt',max_length=320)
    inputs={}
    for kk in inputs1:
        if  kk!='token_type_ids':
            inputs[kk]=inputs1[kk]
    # Compute token embeddings
    #model.resize_token_embeddings(320)
    with torch.no_grad():
        outputs = model(**inputs1)
    last_hidden_states = outputs.last_hidden_state
    em =last_hidden_states #F.normalize(

Here last_hidden_states is giving embedding of size 1024 using ontoprotein. But is it possible to resize the vector size to 320 or reduced size?

Alexzhuan commented 8 months ago

Sorry, there might be methods to compress the vector, but we are not certain how much information loss this could cause.

anonimoustt commented 8 months ago

I think UMAP works fine here.

anonimoustt commented 8 months ago

Hi is it possible get the score of a protein and score would be the weight of the sequence. Higher the weight, the protein would be more important. For instance, protein, P1, and protein, P2 . P2 has weight 0.9 and P1 has weight 0.85. P2 is more important sequence as it has higher weight. Can Onto-protein define the weight s to the sequences ?

Alexzhuan commented 8 months ago

Sorry, our method could not provide the importance of proteins.

zxlzr commented 7 months ago

hi, do you have further questions?

anonimoustt commented 7 months ago

Hi, In https://www.zjukg.org/project/ProteinKG25/ knowledge graph data I see the relation2id file where there are different types of relation. Is it possible to learn which relation are closer to phosphorylation function. Secondly, I see NOT|located_in type relation, so there are negative relation right ?

anonimoustt commented 7 months ago

Hi, In https://www.zjukg.org/project/ProteinKG25/ knowledge graph data I see the relation2id file where there are different types of relation. Is it possible to learn which relation are closer to phosphorylation function. Secondly, I see NOT|located_in type relation, so there are negative relation right ?

anonimoustt commented 7 months ago

Hi,

I see the following relations in the knowledge graph: ['enables_nucleotide_binding', 'enables_metal_ion_binding', 'enables_transferase_activity', 'enables', 'involved_in_signal_transduction', 'involved_in_regulation_of_transcription,_DNA-templated', 'involved_in_phosphorylation', 'involved_in', 'part_of_nucleus', 'part_of_cytoplasm', 'part_of', 'part_of_cytosol', 'part_of_membrane', 'colocalizes_with', 'involved_in_proteolysis', 'NOT|involved_in', 'part_of_integral_component_of_membrane', 'involved_in_cation_transport', 'involved_in_cellular_response_to_DNA_damage_stimulus', 'part_of_mitochondrion', 'involved_in_metabolic_process', 'involved_in_cell_cycle', 'involved_in_cell_division', 'involved_in_lipid_metabolic_process', 'enables_RNA_binding', 'acts_upstream_of_or_within', 'enables_catalytic_activity', 'enables_hydrolase_activity', 'enables_DNA_binding', 'contributes_to', 'involved_in_carbohydrate_metabolic_process', 'involved_in_translation', 'part_of_extracellular_region', 'acts_upstream_of_or_within_positive_effect', 'involved_in_protein_transport', 'NOT|enables', 'acts_upstream_of', 'part_of_ribosome', 'involved_in_transmembrane_transport', 'NOT|part_of', 'NOT|involved_in_tRNA_processing', 'is_active_in', 'located_in', 'NOT|located_in', 'acts_upstream_of_positive_effect']

which relation is the most important for protein sequence?