Open EvanKomp opened 2 weeks ago
Hello, thank you very much for providing sush useful feedback!
For the first error, as you said it can be easily fixed by moving all tensors from GPU to CPU. We have corrected this error and updated the notebook so it should work well now.
For the second error, could you provide more information about how you ran the training? It seems like the program didn't find the adapter_config.json
file to initialize a lora module. If you have finished training and saved the model weights to a folder, it should contain a list of files like:
@LTEnjoy I was running a simple regression training! I lost the traceback because I did not copy the notebook. I will try to run it again and see if I can give something more informative.
EDIT: here is the full stack.
Training task type: regression
Dataset: /content/SaprotHub/LMDB/demo
Base Model: westlake-repl/SaProt_35M_AF2
====================================================================================================
{'Trainer': {'accelerator': 'gpu',
'accumulate_grad_batches': 2,
'devices': 1,
'enable_checkpointing': False,
'limit_test_batches': 1.0,
'limit_train_batches': 1.0,
'limit_val_batches': 1.0,
'log_every_n_steps': 1,
'logger': False,
'max_epochs': 2,
'num_nodes': 1,
'num_sanity_val_steps': 0,
'precision': 16,
'strategy': {},
'val_check_interval': 0.5},
'dataset': {'dataloader_kwargs': {'batch_size': 2, 'num_workers': 2},
'dataset_py_path': 'saprot/saprot_regression_dataset',
'kwargs': {'plddt_threshold': None,
'tokenizer': 'westlake-repl/SaProt_35M_AF2'},
'test_lmdb': '/content/SaprotHub/LMDB/demo/test',
'train_lmdb': '/content/SaprotHub/LMDB/demo/train',
'valid_lmdb': '/content/SaprotHub/LMDB/demo/valid'},
'model': {'kwargs': {'config_path': 'westlake-repl/SaProt_35M_AF2',
'extra_config': {'attention_probs_dropout_prob': 0,
'hidden_dropout_prob': 0},
'load_pretrained': True,
'lora_kwargs': {'config_list': [], 'num_lora': 1}},
'lr_scheduler_kwargs': {'class': 'ConstantLRScheduler',
'init_lr': 0.001},
'model_py_path': 'saprot/saprot_regression_model',
'save_path': '/content/SaprotHub/adapters/regression/Local/Model-demo-35M'},
'setting': {'os_environ': {'CUDA_VISIBLE_DEVICES': '0,1,2,3,4,5,6,7',
'MASTER_ADDR': 'localhost',
'MASTER_PORT': 12315,
'NODE_RANK': 0,
'WANDB_API_KEY': None,
'WANDB_RUN_ID': None,
'WORLD_SIZE': 1},
'run_mode': 'train',
'seed': 20000812}}
====================================================================================================
/usr/local/lib/python3.10/dist-packages/Bio/pairwise2.py:278: BiopythonDeprecationWarning: Bio.pairwise2 has been deprecated, and we intend to remove it in a future release of Biopython. As an alternative, please consider using Bio.Align.PairwiseAligner as a replacement, and contact the Biopython developers if you still need the Bio.pairwise2 module.
warnings.warn(
/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
warnings.warn(
/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_token.py:89: UserWarning:
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
warnings.warn(
tokenizer_config.json: 100%
40.0/40.0 [00:00<00:00, 857B/s]
vocab.txt: 100%
1.35k/1.35k [00:00<00:00, 22.7kB/s]
special_tokens_map.json: 100%
125/125 [00:00<00:00, 1.75kB/s]
config.json: 100%
645/645 [00:00<00:00, 24.1kB/s]
pytorch_model.bin: 100%
137M/137M [00:01<00:00, 114MB/s]
Some weights of the model checkpoint at westlake-repl/SaProt_35M_AF2 were not used when initializing EsmForSequenceClassification: ['lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.decoder.weight', 'lm_head.bias', 'lm_head.layer_norm.bias', 'lm_head.layer_norm.weight']
- This IS expected if you are initializing EsmForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing EsmForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of EsmForSequenceClassification were not initialized from the model checkpoint at westlake-repl/SaProt_35M_AF2 and are newly initialized: ['classifier.dense.bias', 'classifier.out_proj.bias', 'classifier.out_proj.weight', 'classifier.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
/usr/local/lib/python3.10/dist-packages/torchmetrics/utilities/prints.py:36: UserWarning: Metric `SpearmanCorrcoef` will save all targets and predictions in the buffer. For large datasets, this may lead to large memory footprint.
warnings.warn(*args, **kwargs)
No optimizer_kwargs provided. The default optimizer is AdamW.
Now active LoRA model: default
trainable params: 1,060,801 || all params: 34,759,682 || trainable%: 3.0518144556098066
/usr/local/lib/python3.10/dist-packages/lightning_fabric/connector.py:558: `precision=16` is supported for historical reasons but its usage is discouraged. Please set your precision to 16-mixed instead!
INFO:pytorch_lightning.utilities.rank_zero:Using 16bit Automatic Mixed Precision (AMP)
INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:IPU available: False, using: 0 IPUs
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.utilities.rank_zero:`Trainer(limit_train_batches=1.0)` was configured so 100% of the batches per epoch will be used..
INFO:pytorch_lightning.utilities.rank_zero:`Trainer(limit_val_batches=1.0)` was configured so 100% of the batches will be used..
INFO:pytorch_lightning.utilities.rank_zero:`Trainer(limit_test_batches=1.0)` was configured so 100% of the batches will be used..
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
INFO:pytorch_lightning.callbacks.model_summary:
| Name | Type | Params
-----------------------------------------------------------------------
0 | model | PeftModelForSequenceClassification | 34.8 M
1 | train_loss | MeanSquaredError | 0
2 | train_spearman | SpearmanCorrCoef | 0
3 | train_R2 | R2Score | 0
4 | train_pearson | PearsonCorrCoef | 0
5 | valid_loss | MeanSquaredError | 0
6 | valid_spearman | SpearmanCorrCoef | 0
7 | valid_R2 | R2Score | 0
8 | valid_pearson | PearsonCorrCoef | 0
9 | test_loss | MeanSquaredError | 0
10 | test_spearman | SpearmanCorrCoef | 0
11 | test_R2 | R2Score | 0
12 | test_pearson | PearsonCorrCoef | 0
-----------------------------------------------------------------------
1.1 M Trainable params
33.7 M Non-trainable params
34.8 M Total params
139.039 Total estimated model params size (MB)
/usr/lib/python3.10/multiprocessing/popen_fork.py:66: RuntimeWarning: os.fork() was called. os.fork() is incompatible with multithreaded code, and JAX is multithreaded, so this will likely lead to a deadlock.
self.pid = os.fork()
/usr/local/lib/python3.10/dist-packages/pytorch_lightning/utilities/data.py:104: Total length of `DataLoader` across ranks is zero. Please make sure this was your intention.
Epoch 1: 100%
118/118 [00:12<00:00, 9.36it/s, loss=38.80]
/usr/lib/python3.10/multiprocessing/popen_fork.py:66: RuntimeWarning: os.fork() was called. os.fork() is incompatible with multithreaded code, and JAX is multithreaded, so this will likely lead to a deadlock.
self.pid = os.fork()
INFO:pytorch_lightning.utilities.rank_zero:`Trainer.fit` stopped: `max_epochs=2` reached.
/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
warnings.warn(
Some weights of the model checkpoint at westlake-repl/SaProt_35M_AF2 were not used when initializing EsmForSequenceClassification: ['lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.decoder.weight', 'lm_head.bias', 'lm_head.layer_norm.bias', 'lm_head.layer_norm.weight']
- This IS expected if you are initializing EsmForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing EsmForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of EsmForSequenceClassification were not initialized from the model checkpoint at westlake-repl/SaProt_35M_AF2 and are newly initialized: ['classifier.dense.bias', 'classifier.out_proj.bias', 'classifier.out_proj.weight', 'classifier.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
/usr/local/lib/python3.10/dist-packages/torchmetrics/utilities/prints.py:36: UserWarning: Metric `SpearmanCorrcoef` will save all targets and predictions in the buffer. For large datasets, this may lead to large memory footprint.
warnings.warn(*args, **kwargs)
No optimizer_kwargs provided. The default optimizer is AdamW.
---------------------------------------------------------------------------
HFValidationError Traceback (most recent call last)
[/usr/local/lib/python3.10/dist-packages/peft/config.py](https://localhost:8080/#) in _get_peft_type(cls, model_id, **hf_hub_download_kwargs)
196 try:
--> 197 config_file = hf_hub_download(
198 model_id,
9 frames
[/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_validators.py](https://localhost:8080/#) in _inner_fn(*args, **kwargs)
105 if arg_name in ["repo_id", "from_id", "to_id"]:
--> 106 validate_repo_id(arg_value)
107
[/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_validators.py](https://localhost:8080/#) in validate_repo_id(repo_id)
153 if repo_id.count("/") > 1:
--> 154 raise HFValidationError(
155 "Repo id must be in the form 'repo_name' or 'namespace/repo_name':"
HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/content/SaprotHub/adapters/regression/Local/Model-demo-35M'. Use `repo_type` argument if needed.
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
[<ipython-input-3-932a5a0b7f0b>](https://localhost:8080/#) in <cell line: 258>()
256
257 from saprot.scripts.training import finetune
--> 258 finetune(config)
259
260
[/usr/local/lib/python3.10/dist-packages/saprot/scripts/training.py](https://localhost:8080/#) in finetune(config)
50 config.model.kwargs.lora_kwargs.config_list = [{"lora_config_path": model.save_path}]
51
---> 52 model = my_load_model(config.model)
53
54 else:
[/usr/local/lib/python3.10/dist-packages/saprot/utils/module_loader.py](https://localhost:8080/#) in my_load_model(config)
35 if 'num_labels' in model_config: del model_config['num_labels']
36 from model.saprot.saprot_regression_model import SaprotRegressionModel
---> 37 return SaprotRegressionModel(**model_config)
38
39 if model_type == "saprot/saprot_pair_classification_model":
[/usr/local/lib/python3.10/dist-packages/saprot/model/saprot/saprot_regression_model.py](https://localhost:8080/#) in __init__(self, test_result_path, **kwargs)
16 """
17 self.test_result_path = test_result_path
---> 18 super().__init__(task="regression", **kwargs)
19
20 def initialize_metrics(self, stage):
[/usr/local/lib/python3.10/dist-packages/saprot/model/saprot/base.py](https://localhost:8080/#) in __init__(self, task, config_path, extra_config, load_pretrained, freeze_backbone, gradient_checkpointing, lora_kwargs, **kwargs)
67
68 self.lora_kwargs = EasyDict(lora_kwargs)
---> 69 self._init_lora()
70
71 self.valid_metrics_list = {}
[/usr/local/lib/python3.10/dist-packages/saprot/model/saprot/base.py](https://localhost:8080/#) in _init_lora(self)
94 if i == 0:
95 # If i == 0, initialize a PEFT model
---> 96 self.model = PeftModelForSequenceClassification.from_pretrained(self.model,
97 lora_config_path,
98 adapter_name=adapter_name,
[/usr/local/lib/python3.10/dist-packages/saprot/model/saprot/self_peft/peft_model.py](https://localhost:8080/#) in from_pretrained(cls, model, model_id, adapter_name, is_trainable, config, **kwargs)
326 if config is None:
327 config = PEFT_TYPE_TO_CONFIG_MAPPING[
--> 328 PeftConfig._get_peft_type(
329 model_id,
330 subfolder=kwargs.get("subfolder", None),
[/usr/local/lib/python3.10/dist-packages/peft/config.py](https://localhost:8080/#) in _get_peft_type(cls, model_id, **hf_hub_download_kwargs)
201 )
202 except Exception:
--> 203 raise ValueError(f"Can't find '{CONFIG_NAME}' at '{model_id}'")
204
205 loaded_attributes = cls.from_json_file(config_file)
ValueError: Can't find 'adapter_config.json' at '/content/SaprotHub/adapters/regression/Local/Model-demo-35M'
This has too many recursive calls for me to debug right now, not familiar with the codebase. The gist of it that I can see is that the finetune
function, post training, initializes a model wrapper class SaprotRegressionModel(**model_config)
. This class expects model_config
to contain a path to adapter weights, but it cannot find them. The directory was successfully created but it is indeed empty of weights, indicating that the finetune
function is either not saving weights or saving them in the wrong place.
Hi, can you try rerunning this on our latest colab link?
https://colab.research.google.com/github/westlake-repl/SaprotHub/blob/main/colab/SaprotHub.ipynb
We did lots of tests so it should not run into error.
Hello - I am trying to rune a finetuning (potentially a useful model for the hub!) And ran into a couple of errors.
results
is a list of size 1 tensors. This was easily fixed byresults = [r.cpu().item() for r in results]
and then manually downloading.This goes into the importable
finetune
function. I do not currently have the time to go fix it and make a pull.Thanks for your work in making PLMs zero shot and finetuning more accessible! Really great work.
Evan