zjunlp / OntoProtein

[ICLR 2022] OntoProtein: Protein Pretraining With Gene Ontology Embedding
MIT License
138 stars 22 forks source link

Running Code in Google Colab #29

Closed anonimoustt closed 4 months ago

anonimoustt commented 8 months ago

Hi,

I am really interested in your work on OntoProtein . Is your code run on Google Colab? It would be helpful if you have google Colab tutorials to run your code.

Alexzhuan commented 8 months ago

Hi,

Thanks for your interest in our work.

At present the code isn't ready to run on Google Colab as it necessitates certain modifications. However, by adhering to the instructions provided in our README, which include the installation of necessary Python packages and preparation of the training data, as well as adjusting the relative file paths as needed, you should be able to execute the code within the Colab environment successfully.

anonimoustt commented 8 months ago

Thanks . Do I need to download downstream tasks related data?

Alexzhuan commented 8 months ago

Sure, you can download just the tasks you need.

anonimoustt commented 8 months ago

Hi I was trying to running bash script/run_pretrain.sh and in src/training_args.py I see swiss_seq data as default. I am not getting this data file in github repository.

Also I see zjunlp/OntoProtein huggingface model. Can I use this model as follows:

from transformers import BertForMaskedLM, BertTokenizer, pipeline tokenizer = BertTokenizer.from_pretrained("zjunlp/OntoProtein", do_lower_case=False ) model = BertForMaskedLM.from_pretrained("zjunlp/OntoProtein")

Next, let us say I want to check Protein Sequence-1 and Protein Sequence-2 are related or not. Or I want to check the class of Protein Sequence-1. Then, if write model.predict(Protein Sequence-1) will it work in this way? Or there is any specific command to get the prediction?

anonimoustt commented 8 months ago

Let us say I am checking new data like https://huggingface.co/datasets/waylandy/phosformer_curated where I can see the relation between kinase enzyme and substrate protein sequence ( class 1 means related and 0 means not related).

Now I run the code:

from transformers import BertForMaskedLM, BertTokenizer, pipeline tokenizer = BertTokenizer.from_pretrained("zjunlp/OntoProtein", do_lower_case=False ) model = BertForMaskedLM.from_pretrained("zjunlp/OntoProtein")

Then I try to predict model.predict( kinase enzyme, substrate) and will return the class label 0 or 1. Will it work in this way? or it has specific function to get the inferred class label? Thanks

anonimoustt commented 8 months ago

Hi I have installed the python packages in google colab but not sure how to proceed to execute the . I have downloaded ProteinKG25 data. I am trying run ProtBERT model using bash run_main.sh \ --model model_data/ProtBertModel \ --output_file ss3-ProtBert \ --task_name ss3 \ --do_train True \ --epoch 5 \ --optimizer AdamW \ --per_device_batch_size 2 \ --gradient_accumulation_steps 8 \ --eval_step 100 \ --eval_batchsize 4 \ --warmup_ratio 0.08 \ --frozen_bert False command. I got the following error: Traceback (most recent call last): File "/content/OntoProtein/run_downstream.py", line 9, in from src.benchmark.datasets import dataset_mapping, output_modes_mapping ModuleNotFoundError: No module named 'src.benchmark.datasets'. can you help to run an example?

anonimoustt commented 8 months ago

I was trying to run the following command:

!bash run_main.sh --model model_data/ProtBertModel --output_file ss8-ProtBert --task_name ss8 --do_train True --epoch 5 --optimizer AdamW --per_device_batch_size 2 --gradient_accumulation_steps 8 --eval_step 100 --eval_batchsize 4 --warmup_ratio 0.08 --frozen_bert False

I am getting following error

Traceback (most recent call last): File "/content/OntoProtein/run_downstream.py", line 288, in main() File "/content/OntoProtein/run_downstream.py", line 211, in main model = model_fn.from_pretrained( File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 2600, in from_pretrained resolved_config_file = cached_file( File "/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py", line 430, in cached_file resolved_file = hf_hub_download( File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_validators.py", line 110, in _inner_fn validate_repo_id(arg_value) File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_validators.py", line 158, in validate_repo_id raise HFValidationError( huggingface_hub.utils._validators.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/content/model_data/model.safetensors'. Use repo_type argument if needed.

Alexzhuan commented 7 months ago

Hi, The above two issues seem to be related to file paths (the relevant files are inaccessible). To resolve this, you may refer to materials about accessing and importing local python files on Colab, which will ensure that your script run on Colab.

anonimoustt commented 7 months ago

Hi I am still facing the same error. I am tryin to understand where did you incorporate knowledge graph. with pre-trained model? is it trainer.py file where you integrate knowledge ?

Alexzhuan commented 7 months ago

In our approach, we do not integrate Knowledge Graph (KG) information directly into Protein Language Models architecture. Instead, we utilize knowledge graph (ProteinKG) to enhance protein representation learning with contrastive learning akin to Knowledge Graph Embedding (KGE) (the contrastive loss implemented in here). Specifically, we first consider a protein (e.g., A4_HUMAN) as the head entity within the KG, while its associated property or function (such as DNA binding) serves as the tail entity, which forms the triplet (A4_HUMAN, enable, DNA binding). Intuitively, we aim to align the protein sequence's embedding more closely with the associated tail entity, and consequently proteins sharing similar functions will be closer in the representation space, as facilitated by the principles of contrastive learning.

anonimoustt commented 7 months ago

Thanks. I want to know regarding data, ProteinKG25.
protein_go_train_triplet file defines the relation between protein sequence and gene ontology. I see only ids.

4667 31 2588

Here 4667 is the protein sequence, 31 is the relation 2588 is the gene ontology?

I can see protein sequence from protein_seq.txt file relation2id.txt file defines the relation Which file gene ontology here? is it go_type.txt file?

Thanks

Alexzhuan commented 7 months ago

ProteinKG25 has the following files:

anonimoustt commented 7 months ago

Hi would you please provide the full meaning of MF, BF, CC? In go_type.txt file I see only Process Function Component?

In go2id.txt file I see GO :0018186 9758 I am not understanding this mapping ?

anonimoustt commented 5 months ago

Hi is it possible to use https://huggingface.co/zjunlp/OntoProtein model to get the embedding of a protein sequence?