Closed igortru closed 8 months ago
Could you please give a complete error information? So I can figure out what is wrong with the code or your environment. If you create a conda environment following README, it should run normally.
I have git clone SaProt again
cd SaProt export PYTHONPATH=. (it looks like it is the FIX)
copy SaProt_650M_AF2 to weights/PLMs copy Thermostability to ./LMDB
and run python scripts/training.py -c config/Thermostability/saprot.yaml
last error now TypeError: optimizer_step() takes from 4 to 8 positional arguments but 9 were given
1,305.768 Total estimated model params size (MB)
Epoch 0: 0%| | 0/3166 [00:00<?, ?it/s]/opt/conda/lib/python3.7/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py:363: LightningDeprecationWarning: The NVIDIA/apex AMP implementation has been deprecated upstream. Consequently, its integration inside PyTorch Lightning has been deprecated in v1.9.0 and will be removed in v2.0.0. The EsmRegressionModel.optimizer_step()
hook is overridden, including the using_native_amp
argument. Removing this argument will avoid this message, you can expect it to return True.
"The NVIDIA/apex AMP implementation has been deprecated upstream. Consequently, its integration inside"
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/tolstoy/SaProt/scripts/training.py:73 in batch_idx
is optional with inter-batch parallelism │
│ ❱ 249 │ │ │ self._optimizer_step(optimizer, opt_idx, kwargs.get("batch_idx", 0), closure │
│ 250 │ │ │
│ 251 │ │ result = closure.consume_result() │
│ 252 │
│ │
│ /opt/conda/lib/python3.7/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py:37 │
│ 9 in _optimizer_step │
│ │
│ 376 │ │ │ train_step_and_backward_closure, │
│ 377 │ │ │ on_tpu=isinstance(self.trainer.accelerator, TPUAccelerator), │
│ 378 │ │ │ kwargs, # type: ignore[arg-type] │
│ ❱ 379 │ │ │ using_lbfgs=is_lbfgs, │
│ 380 │ │ ) │
│ 381 │ │ │
│ 382 │ │ if not should_accumulate: │
│ │
│ /opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py:1356 in │
│ _call_lightning_module_hook │
│ │
│ 1353 │ │ pl_module._current_fx_name = hook_name │
│ 1354 │ │ │
│ 1355 │ │ with self.profiler.profile(f"[LightningModule]{pl_module.class.name}.{ho │
│ ❱ 1356 │ │ │ output = fn(args, **kwargs) │
│ 1357 │ │ │
│ 1358 │ │ # restore current_fx when nested context │
│ 1359 │ │ pl_module._current_fx_name = prev_fx_name │
│ │
│ /home/tolstoy/SaProt/model/abstract_model.py:135 in optimizer_step │
│ │
│ 132 │ │ using_lbfgs: bool = False, │
│ 133 │ ) -> None: │
│ 134 │ │ super().optimizer_step( │
│ ❱ 135 │ │ │ epoch, batch_idx, optimizer, optimizer_idx, optimizer_closure, on_tpu, using │
│ 136 │ │ ) │
│ 137 │ │ self.step += 1 │
│ 138 │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
TypeError: optimizer_step() takes from 4 to 8 positional arguments but 9 were given
Epoch 0: 0%| | 0/3166 [00:03<?, ?it/s]
I think, issue can be closed. I have installed correct version of pytorch-lightning pip install pytorch-lightning==1.8.3
and now I have problem which is out of your scope
OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 15.78 GiB total capacity; 14.55 GiB already allocated; 16.44 MiB free; 14.72 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF Epoch 0: 0%| | 1/3166 [00:14<12:29:38, 14.21s/it, loss=0.167]
Sure! You can change the hyper-parameters in the config file:
If you have any question, feel free to raise an issue again!
I have set all parameters to possible minimums it looks like two 16Gb V100 is not enough for your model
Our model has 650M parameters so I guess even 16GB is not enough for full fine-tuning. Maybe you can try to freeze previous part of the backbone.
I have added freeze_backbone=True with attempt reduce memory consumption as you suggested. now it report
RuntimeError: Predictions and targets are expected to have the same shape Epoch 0: 0%| | 0/3166 [00:04<?, ?it/s]
class EsmRegressionModel(EsmBaseModel): def init(self, test_result_path: str = None, kwargs): """ Args: test_result_path: path to save test result kwargs: other arguments for EsmBaseModel """ self.test_result_path = test_result_path super().init(task="regression", freeze_backbone=True, **kwargs)
can you, please provide more info about backbone freeze? How I can do it.
You can try it again and it should be solved now. There was a bug in the forward
method and I fixed it.
it is working now.
next error:
│
│ 78 │ if n_obs < 2: │
│ ❱ 79 │ │ raise ValueError("Needs at least two samples to calculate r2 score.") │
│ 80 │ │
│ 81 │ mean_obs = sum_obs / n_obs │
│ 82 │ tss = sum_squared_obs - sum_obs * mean_obs
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ValueError: Needs at least two samples to calculate r2 score.
Epoch 0: 90%|████████▉ | 2845/3166 [05:58<00:40, 7.94it/s, loss=0.0267]
when I changed batch size to 3 from 2, error disappeared.
The error is caused when you calculate the metric with only 1 sample in a batch. I already fixed it by skip calculating when a batch contains only 1 sample.
for some reason in my python environment class method function don't see global variable. Any advice how it can be fixed without changing code will be appreciated.
I have rewritten it using more straightforward "hack"
class ModelInterface: @classmethod def init_model(cls, model_py_path: str, **kwargs): sub_dirs = model_py_path.split(os.sep) module_name = '.'.join(sub_dirs[:]) module = importlib.import_module(module_name) objs = dir(module) model_cls = getattr(module,objs[1])
"EsmRegressionModel")