westlake-repl / SaprotHub

SaprotHub: Making Protein Modeling Accessible to All Biologists
MIT License
36 stars 5 forks source link

Couple of errors. #14

Open EvanKomp opened 2 weeks ago

EvanKomp commented 2 weeks ago

Hello - I am trying to rune a finetuning (potentially a useful model for the hub!) And ran into a couple of errors.

  1. When testing out the log likelihood scores of the base foundational model on a multi site mutational library I received:
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
[<ipython-input-3-9e1c691e3ebe>](https://localhost:8080/#) in <cell line: 94>()
    110     # result_df['Sequence'] = dataset_df['Sequence']
    111     # result_df['mutation'] = dataset_df['mutation']
--> 112     dataset_df['score'] = results
    113 
    114     output_path = OUTPUT_HOME / f"{timestamp}_prediction_output_{Path(dataset_csv_path).stem}.csv"

6 frames
[/usr/local/lib/python3.10/dist-packages/torch/_tensor.py](https://localhost:8080/#) in __array__(self, dtype)
    970             return self.numpy()
    971         else:
--> 972             return self.numpy().astype(dtype, copy=False)
    973 
    974     # Wrap Numpy array again in a suitable tensor when done, to support e.g.

TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.

results is a list of size 1 tensors. This was easily fixed by results = [r.cpu().item() for r in results] and then manually downloading.

  1. When finetuning, post training, I received:
---------------------------------------------------------------------------
HFValidationError                         Traceback (most recent call last)
[/usr/local/lib/python3.10/dist-packages/peft/config.py](https://localhost:8080/#) in _get_peft_type(cls, model_id, **hf_hub_download_kwargs)
    196             try:
--> 197                 config_file = hf_hub_download(
    198                     model_id,

9 frames
[/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_validators.py](https://localhost:8080/#) in _inner_fn(*args, **kwargs)
    105             if arg_name in ["repo_id", "from_id", "to_id"]:
--> 106                 validate_repo_id(arg_value)
    107 

[/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_validators.py](https://localhost:8080/#) in validate_repo_id(repo_id)
    153     if repo_id.count("/") > 1:
--> 154         raise HFValidationError(
    155             "Repo id must be in the form 'repo_name' or 'namespace/repo_name':"

HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/content/SaprotHub/adapters/regression/Local/Model-evc-35M'. Use `repo_type` argument if needed.

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
[<ipython-input-14-6d9f4a5baa00>](https://localhost:8080/#) in <cell line: 258>()
    256 
    257 from saprot.scripts.training import finetune
--> 258 finetune(config)
    259 
    260 

[/usr/local/lib/python3.10/dist-packages/saprot/scripts/training.py](https://localhost:8080/#) in finetune(config)
     48                 config.model.kwargs.lora_kwargs.config_list = [{"lora_config_path": model.save_path}]
     49 
---> 50             model = my_load_model(config.model)
     51 
     52         else:

[/usr/local/lib/python3.10/dist-packages/saprot/utils/module_loader.py](https://localhost:8080/#) in my_load_model(config)
     35       if 'num_labels' in model_config: del model_config['num_labels']
     36       from model.saprot.saprot_regression_model import SaprotRegressionModel
---> 37       return SaprotRegressionModel(**model_config)
     38 
     39     if model_type == "saprot/saprot_pair_classification_model":

[/usr/local/lib/python3.10/dist-packages/saprot/model/saprot/saprot_regression_model.py](https://localhost:8080/#) in __init__(self, test_result_path, **kwargs)
     16         """
     17         self.test_result_path = test_result_path
---> 18         super().__init__(task="regression", **kwargs)
     19 
     20     def initialize_metrics(self, stage):

[/usr/local/lib/python3.10/dist-packages/saprot/model/saprot/base.py](https://localhost:8080/#) in __init__(self, task, config_path, extra_config, load_pretrained, freeze_backbone, gradient_checkpointing, lora_kwargs, **kwargs)
     67 
     68             self.lora_kwargs = EasyDict(lora_kwargs)
---> 69             self._init_lora()
     70 
     71         self.valid_metrics_list = {}

[/usr/local/lib/python3.10/dist-packages/saprot/model/saprot/base.py](https://localhost:8080/#) in _init_lora(self)
     94                 if i == 0:
     95                     # If i == 0, initialize a PEFT model
---> 96                     self.model = PeftModelForSequenceClassification.from_pretrained(self.model,
     97                                                                                     lora_config_path,
     98                                                                                     adapter_name=adapter_name,

[/usr/local/lib/python3.10/dist-packages/saprot/model/saprot/self_peft/peft_model.py](https://localhost:8080/#) in from_pretrained(cls, model, model_id, adapter_name, is_trainable, config, **kwargs)
    326         if config is None:
    327             config = PEFT_TYPE_TO_CONFIG_MAPPING[
--> 328                 PeftConfig._get_peft_type(
    329                     model_id,
    330                     subfolder=kwargs.get("subfolder", None),

[/usr/local/lib/python3.10/dist-packages/peft/config.py](https://localhost:8080/#) in _get_peft_type(cls, model_id, **hf_hub_download_kwargs)
    201                 )
    202             except Exception:
--> 203                 raise ValueError(f"Can't find '{CONFIG_NAME}' at '{model_id}'")
    204 
    205         loaded_attributes = cls.from_json_file(config_file)

ValueError: Can't find 'adapter_config.json' at '/content/SaprotHub/adapters/regression/Local/Model-evc-35M'

This goes into the importable finetune function. I do not currently have the time to go fix it and make a pull.

Thanks for your work in making PLMs zero shot and finetuning more accessible! Really great work.

Evan

LTEnjoy commented 2 weeks ago

Hello, thank you very much for providing sush useful feedback!

For the first error, as you said it can be easily fixed by moving all tensors from GPU to CPU. We have corrected this error and updated the notebook so it should work well now.

For the second error, could you provide more information about how you ran the training? It seems like the program didn't find the adapter_config.json file to initialize a lora module. If you have finished training and saved the model weights to a folder, it should contain a list of files like:

image

EvanKomp commented 1 week ago

@LTEnjoy I was running a simple regression training! I lost the traceback because I did not copy the notebook. I will try to run it again and see if I can give something more informative.

EDIT: here is the full stack.

Training task type: regression
Dataset: /content/SaprotHub/LMDB/demo
Base Model: westlake-repl/SaProt_35M_AF2
====================================================================================================
{'Trainer': {'accelerator': 'gpu',
             'accumulate_grad_batches': 2,
             'devices': 1,
             'enable_checkpointing': False,
             'limit_test_batches': 1.0,
             'limit_train_batches': 1.0,
             'limit_val_batches': 1.0,
             'log_every_n_steps': 1,
             'logger': False,
             'max_epochs': 2,
             'num_nodes': 1,
             'num_sanity_val_steps': 0,
             'precision': 16,
             'strategy': {},
             'val_check_interval': 0.5},
 'dataset': {'dataloader_kwargs': {'batch_size': 2, 'num_workers': 2},
             'dataset_py_path': 'saprot/saprot_regression_dataset',
             'kwargs': {'plddt_threshold': None,
                        'tokenizer': 'westlake-repl/SaProt_35M_AF2'},
             'test_lmdb': '/content/SaprotHub/LMDB/demo/test',
             'train_lmdb': '/content/SaprotHub/LMDB/demo/train',
             'valid_lmdb': '/content/SaprotHub/LMDB/demo/valid'},
 'model': {'kwargs': {'config_path': 'westlake-repl/SaProt_35M_AF2',
                      'extra_config': {'attention_probs_dropout_prob': 0,
                                       'hidden_dropout_prob': 0},
                      'load_pretrained': True,
                      'lora_kwargs': {'config_list': [], 'num_lora': 1}},
           'lr_scheduler_kwargs': {'class': 'ConstantLRScheduler',
                                   'init_lr': 0.001},
           'model_py_path': 'saprot/saprot_regression_model',
           'save_path': '/content/SaprotHub/adapters/regression/Local/Model-demo-35M'},
 'setting': {'os_environ': {'CUDA_VISIBLE_DEVICES': '0,1,2,3,4,5,6,7',
                            'MASTER_ADDR': 'localhost',
                            'MASTER_PORT': 12315,
                            'NODE_RANK': 0,
                            'WANDB_API_KEY': None,
                            'WANDB_RUN_ID': None,
                            'WORLD_SIZE': 1},
             'run_mode': 'train',
             'seed': 20000812}}
====================================================================================================
/usr/local/lib/python3.10/dist-packages/Bio/pairwise2.py:278: BiopythonDeprecationWarning: Bio.pairwise2 has been deprecated, and we intend to remove it in a future release of Biopython. As an alternative, please consider using Bio.Align.PairwiseAligner as a replacement, and contact the Biopython developers if you still need the Bio.pairwise2 module.
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_token.py:89: UserWarning: 
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
  warnings.warn(
tokenizer_config.json: 100%
 40.0/40.0 [00:00<00:00, 857B/s]
vocab.txt: 100%
 1.35k/1.35k [00:00<00:00, 22.7kB/s]
special_tokens_map.json: 100%
 125/125 [00:00<00:00, 1.75kB/s]
config.json: 100%
 645/645 [00:00<00:00, 24.1kB/s]
pytorch_model.bin: 100%
 137M/137M [00:01<00:00, 114MB/s]
Some weights of the model checkpoint at westlake-repl/SaProt_35M_AF2 were not used when initializing EsmForSequenceClassification: ['lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.decoder.weight', 'lm_head.bias', 'lm_head.layer_norm.bias', 'lm_head.layer_norm.weight']
- This IS expected if you are initializing EsmForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing EsmForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of EsmForSequenceClassification were not initialized from the model checkpoint at westlake-repl/SaProt_35M_AF2 and are newly initialized: ['classifier.dense.bias', 'classifier.out_proj.bias', 'classifier.out_proj.weight', 'classifier.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
/usr/local/lib/python3.10/dist-packages/torchmetrics/utilities/prints.py:36: UserWarning: Metric `SpearmanCorrcoef` will save all targets and predictions in the buffer. For large datasets, this may lead to large memory footprint.
  warnings.warn(*args, **kwargs)
No optimizer_kwargs provided. The default optimizer is AdamW.
Now active LoRA model: default
trainable params: 1,060,801 || all params: 34,759,682 || trainable%: 3.0518144556098066
/usr/local/lib/python3.10/dist-packages/lightning_fabric/connector.py:558: `precision=16` is supported for historical reasons but its usage is discouraged. Please set your precision to 16-mixed instead!
INFO:pytorch_lightning.utilities.rank_zero:Using 16bit Automatic Mixed Precision (AMP)
INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:IPU available: False, using: 0 IPUs
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.utilities.rank_zero:`Trainer(limit_train_batches=1.0)` was configured so 100% of the batches per epoch will be used..
INFO:pytorch_lightning.utilities.rank_zero:`Trainer(limit_val_batches=1.0)` was configured so 100% of the batches will be used..
INFO:pytorch_lightning.utilities.rank_zero:`Trainer(limit_test_batches=1.0)` was configured so 100% of the batches will be used..
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
INFO:pytorch_lightning.callbacks.model_summary:
   | Name           | Type                               | Params
-----------------------------------------------------------------------
0  | model          | PeftModelForSequenceClassification | 34.8 M
1  | train_loss     | MeanSquaredError                   | 0     
2  | train_spearman | SpearmanCorrCoef                   | 0     
3  | train_R2       | R2Score                            | 0     
4  | train_pearson  | PearsonCorrCoef                    | 0     
5  | valid_loss     | MeanSquaredError                   | 0     
6  | valid_spearman | SpearmanCorrCoef                   | 0     
7  | valid_R2       | R2Score                            | 0     
8  | valid_pearson  | PearsonCorrCoef                    | 0     
9  | test_loss      | MeanSquaredError                   | 0     
10 | test_spearman  | SpearmanCorrCoef                   | 0     
11 | test_R2        | R2Score                            | 0     
12 | test_pearson   | PearsonCorrCoef                    | 0     
-----------------------------------------------------------------------
1.1 M     Trainable params
33.7 M    Non-trainable params
34.8 M    Total params
139.039   Total estimated model params size (MB)
/usr/lib/python3.10/multiprocessing/popen_fork.py:66: RuntimeWarning: os.fork() was called. os.fork() is incompatible with multithreaded code, and JAX is multithreaded, so this will likely lead to a deadlock.
  self.pid = os.fork()
/usr/local/lib/python3.10/dist-packages/pytorch_lightning/utilities/data.py:104: Total length of `DataLoader` across ranks is zero. Please make sure this was your intention.
Epoch 1: 100%
 118/118 [00:12<00:00,  9.36it/s, loss=38.80]
/usr/lib/python3.10/multiprocessing/popen_fork.py:66: RuntimeWarning: os.fork() was called. os.fork() is incompatible with multithreaded code, and JAX is multithreaded, so this will likely lead to a deadlock.
  self.pid = os.fork()
INFO:pytorch_lightning.utilities.rank_zero:`Trainer.fit` stopped: `max_epochs=2` reached.
/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
Some weights of the model checkpoint at westlake-repl/SaProt_35M_AF2 were not used when initializing EsmForSequenceClassification: ['lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.decoder.weight', 'lm_head.bias', 'lm_head.layer_norm.bias', 'lm_head.layer_norm.weight']
- This IS expected if you are initializing EsmForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing EsmForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of EsmForSequenceClassification were not initialized from the model checkpoint at westlake-repl/SaProt_35M_AF2 and are newly initialized: ['classifier.dense.bias', 'classifier.out_proj.bias', 'classifier.out_proj.weight', 'classifier.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
/usr/local/lib/python3.10/dist-packages/torchmetrics/utilities/prints.py:36: UserWarning: Metric `SpearmanCorrcoef` will save all targets and predictions in the buffer. For large datasets, this may lead to large memory footprint.
  warnings.warn(*args, **kwargs)
No optimizer_kwargs provided. The default optimizer is AdamW.
---------------------------------------------------------------------------
HFValidationError                         Traceback (most recent call last)
[/usr/local/lib/python3.10/dist-packages/peft/config.py](https://localhost:8080/#) in _get_peft_type(cls, model_id, **hf_hub_download_kwargs)
    196             try:
--> 197                 config_file = hf_hub_download(
    198                     model_id,

9 frames
[/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_validators.py](https://localhost:8080/#) in _inner_fn(*args, **kwargs)
    105             if arg_name in ["repo_id", "from_id", "to_id"]:
--> 106                 validate_repo_id(arg_value)
    107 

[/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_validators.py](https://localhost:8080/#) in validate_repo_id(repo_id)
    153     if repo_id.count("/") > 1:
--> 154         raise HFValidationError(
    155             "Repo id must be in the form 'repo_name' or 'namespace/repo_name':"

HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/content/SaprotHub/adapters/regression/Local/Model-demo-35M'. Use `repo_type` argument if needed.

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
[<ipython-input-3-932a5a0b7f0b>](https://localhost:8080/#) in <cell line: 258>()
    256 
    257 from saprot.scripts.training import finetune
--> 258 finetune(config)
    259 
    260 

[/usr/local/lib/python3.10/dist-packages/saprot/scripts/training.py](https://localhost:8080/#) in finetune(config)
     50                 config.model.kwargs.lora_kwargs.config_list = [{"lora_config_path": model.save_path}]
     51 
---> 52             model = my_load_model(config.model)
     53 
     54         else:

[/usr/local/lib/python3.10/dist-packages/saprot/utils/module_loader.py](https://localhost:8080/#) in my_load_model(config)
     35       if 'num_labels' in model_config: del model_config['num_labels']
     36       from model.saprot.saprot_regression_model import SaprotRegressionModel
---> 37       return SaprotRegressionModel(**model_config)
     38 
     39     if model_type == "saprot/saprot_pair_classification_model":

[/usr/local/lib/python3.10/dist-packages/saprot/model/saprot/saprot_regression_model.py](https://localhost:8080/#) in __init__(self, test_result_path, **kwargs)
     16         """
     17         self.test_result_path = test_result_path
---> 18         super().__init__(task="regression", **kwargs)
     19 
     20     def initialize_metrics(self, stage):

[/usr/local/lib/python3.10/dist-packages/saprot/model/saprot/base.py](https://localhost:8080/#) in __init__(self, task, config_path, extra_config, load_pretrained, freeze_backbone, gradient_checkpointing, lora_kwargs, **kwargs)
     67 
     68             self.lora_kwargs = EasyDict(lora_kwargs)
---> 69             self._init_lora()
     70 
     71         self.valid_metrics_list = {}

[/usr/local/lib/python3.10/dist-packages/saprot/model/saprot/base.py](https://localhost:8080/#) in _init_lora(self)
     94                 if i == 0:
     95                     # If i == 0, initialize a PEFT model
---> 96                     self.model = PeftModelForSequenceClassification.from_pretrained(self.model,
     97                                                                                     lora_config_path,
     98                                                                                     adapter_name=adapter_name,

[/usr/local/lib/python3.10/dist-packages/saprot/model/saprot/self_peft/peft_model.py](https://localhost:8080/#) in from_pretrained(cls, model, model_id, adapter_name, is_trainable, config, **kwargs)
    326         if config is None:
    327             config = PEFT_TYPE_TO_CONFIG_MAPPING[
--> 328                 PeftConfig._get_peft_type(
    329                     model_id,
    330                     subfolder=kwargs.get("subfolder", None),

[/usr/local/lib/python3.10/dist-packages/peft/config.py](https://localhost:8080/#) in _get_peft_type(cls, model_id, **hf_hub_download_kwargs)
    201                 )
    202             except Exception:
--> 203                 raise ValueError(f"Can't find '{CONFIG_NAME}' at '{model_id}'")
    204 
    205         loaded_attributes = cls.from_json_file(config_file)

ValueError: Can't find 'adapter_config.json' at '/content/SaprotHub/adapters/regression/Local/Model-demo-35M'

This has too many recursive calls for me to debug right now, not familiar with the codebase. The gist of it that I can see is that the finetune function, post training, initializes a model wrapper class SaprotRegressionModel(**model_config). This class expects model_config to contain a path to adapter weights, but it cannot find them. The directory was successfully created but it is indeed empty of weights, indicating that the finetune function is either not saving weights or saving them in the wrong place.

LTEnjoy commented 1 week ago

Hi, can you try rerunning this on our latest colab link?

https://colab.research.google.com/github/westlake-repl/SaprotHub/blob/main/colab/SaprotHub.ipynb

We did lots of tests so it should not run into error.