igortru commented 8 months ago

for some reason in my python environment class method function don't see global variable. Any advice how it can be fixed without changing code will be appreciated.

I have rewritten it using more straightforward "hack"

class ModelInterface: @classmethod def init_model(cls, model_py_path: str, **kwargs): sub_dirs = model_py_path.split(os.sep) module_name = '.'.join(sub_dirs[:]) module = importlib.import_module(module_name) objs = dir(module) model_cls = getattr(module,objs[1])

"EsmRegressionModel")

    return model_cls(**kwargs)

LTEnjoy commented 8 months ago

Could you please give a complete error information? So I can figure out what is wrong with the code or your environment. If you create a conda environment following README, it should run normally.

igortru commented 8 months ago

I have git clone SaProt again

cd SaProt export PYTHONPATH=. (it looks like it is the FIX)

copy SaProt_650M_AF2 to weights/PLMs copy Thermostability to ./LMDB

and run python scripts/training.py -c config/Thermostability/saprot.yaml

last error now TypeError: optimizer_step() takes from 4 to 8 positional arguments but 9 were given

1,305.768 Total estimated model params size (MB) Epoch 0: 0%| | 0/3166 [00:00<?, ?it/s]/opt/conda/lib/python3.7/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py:363: LightningDeprecationWarning: The NVIDIA/apex AMP implementation has been deprecated upstream. Consequently, its integration inside PyTorch Lightning has been deprecated in v1.9.0 and will be removed in v2.0.0. The EsmRegressionModel.optimizer_step() hook is overridden, including the using_native_amp argument. Removing this argument will avoid this message, you can expect it to return True. "The NVIDIA/apex AMP implementation has been deprecated upstream. Consequently, its integration inside" ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ /home/tolstoy/SaProt/scripts/training.py:73 in │ │ │ │ 70 │ │ 71 │ │ 72 if name == 'main': │ │ ❱ 73 │ main(get_args()) │ │ 74 │ │ │ │ /home/tolstoy/SaProt/scripts/training.py:69 in main │ │ │ │ 66 │ if config.setting.os_environ.NODE_RANK != 0: │ │ 67 │ │ config.Trainer.logger = False │ │ 68 │ │ │ ❱ 69 │ run(config) │ │ 70 │ │ 71 │ │ 72 if name == 'main': │ │ │ │ /home/tolstoy/SaProt/scripts/training.py:28 in run │ │ │ │ 25 │ trainer = load_trainer(config) │ │ 26 │ │ │ 27 │ # Train and validate │ │ ❱ 28 │ trainer.fit(model=model, datamodule=data_module) │ │ 29 │ │ │ 30 │ # Load best model and test performance │ │ 31 │ if model.save_path is not None: │ │ │ │ /opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py:609 in fit │ │ │ │ 606 │ │ model = self._maybe_unwrap_optimized(model) │ │ 607 │ │ self.strategy._lightning_module = model │ │ 608 │ │ call._call_and_handle_interrupt( │ │ ❱ 609 │ │ │ self, self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, │ │ 610 │ │ ) │ │ 611 │ │ │ 612 │ def _fit_impl( │ │ │ │ /opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/call.py:36 in │ │ _call_and_handle_interrupt │ │ │ │ 33 │ """ │ │ 34 │ try: │ │ 35 │ │ if trainer.strategy.launcher is not None: │ │ ❱ 36 │ │ │ return trainer.strategy.launcher.launch(trainer_fn, args, trainer=trainer, │ │ 37 │ │ else: │ │ 38 │ │ │ return trainer_fn(args, kwargs) │ │ 39 │ │ │ │ /opt/conda/lib/python3.7/site-packages/pytorch_lightning/strategies/launchers/subprocess_script. │ │ py:88 in launch │ │ │ │ 85 │ │ """ │ │ 86 │ │ if not self.cluster_environment.creates_processes_externally: │ │ 87 │ │ │ self._call_children_scripts() │ │ ❱ 88 │ │ return function(*args, *kwargs) │ │ 89 │ │ │ 90 │ def _call_children_scripts(self) -> None: │ │ 91 │ │ # bookkeeping of spawned processes │ │ │ │ /opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py:650 in _fit_impl │ │ │ │ 647 │ │ │ model_provided=True, │ │ 648 │ │ │ model_connected=self.lightning_module is not None, │ │ 649 │ │ ) │ │ ❱ 650 │ │ self._run(model, ckpt_path=self.ckpt_path) │ │ 651 │ │ │ │ 652 │ │ assert self.state.stopped │ │ 653 │ │ self.training = False │ │ │ │ /opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py:1112 in _run │ │ │ │ 1109 │ │ │ │ 1110 │ │ self._checkpoint_connector.resume_end() │ │ 1111 │ │ │ │ ❱ 1112 │ │ results = self._run_stage() │ │ 1113 │ │ │ │ 1114 │ │ log.detail(f"{self.class.name}: trainer tearing down") │ │ 1115 │ │ self._teardown() │ │ │ │ /opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py:1191 in _run_stage │ │ │ │ 1188 │ │ │ return self._run_evaluate() │ │ 1189 │ │ if self.predicting: │ │ 1190 │ │ │ return self._run_predict() │ │ ❱ 1191 │ │ self._run_train() │ │ 1192 │ │ │ 1193 │ def _pre_training_routine(self) -> None: │ │ 1194 │ │ # wait for all to join if on distributed │ │ │ │ /opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py:1214 in _run_train │ │ │ │ 1211 │ │ self.fit_loop.trainer = self │ │ 1212 │ │ │ │ 1213 │ │ with torch.autograd.set_detect_anomaly(self._detect_anomaly): │ │ ❱ 1214 │ │ │ self.fit_loop.run() │ │ 1215 │ │ │ 1216 │ def _run_evaluate(self) -> _EVALUATE_OUTPUT: │ │ 1217 │ │ assert self.evaluating │ │ │ │ /opt/conda/lib/python3.7/site-packages/pytorch_lightning/loops/loop.py:199 in run │ │ │ │ 196 │ │ while not self.done: │ │ 197 │ │ │ try: │ │ 198 │ │ │ │ self.on_advance_start(args, kwargs) │ │ ❱ 199 │ │ │ │ self.advance(*args, kwargs) │ │ 200 │ │ │ │ self.on_advance_end() │ │ 201 │ │ │ │ self._restarting = False │ │ 202 │ │ │ except StopIteration: │ │ │ │ /opt/conda/lib/python3.7/site-packages/pytorch_lightning/loops/fit_loop.py:267 in advance │ │ │ │ 264 │ │ assert self._data_fetcher is not None │ │ 265 │ │ self._data_fetcher.setup(dataloader, batch_to_device=batch_to_device) │ │ 266 │ │ with self.trainer.profiler.profile("run_training_epoch"): │ │ ❱ 267 │ │ │ self._outputs = self.epoch_loop.run(self._data_fetcher) │ │ 268 │ │ │ 269 │ def on_advance_end(self) -> None: │ │ 270 │ │ # inform logger the batch loop has finished │ │ │ │ /opt/conda/lib/python3.7/site-packages/pytorch_lightning/loops/loop.py:199 in run │ │ │ │ 196 │ │ while not self.done: │ │ 197 │ │ │ try: │ │ 198 │ │ │ │ self.on_advance_start(*args, *kwargs) │ │ ❱ 199 │ │ │ │ self.advance(args, kwargs) │ │ 200 │ │ │ │ self.on_advance_end() │ │ 201 │ │ │ │ self._restarting = False │ │ 202 │ │ │ except StopIteration: │ │ │ │ /opt/conda/lib/python3.7/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py:213 │ │ in advance │ │ │ │ 210 │ │ │ self.batch_progress.increment_started() │ │ 211 │ │ │ │ │ 212 │ │ │ with self.trainer.profiler.profile("run_training_batch"): │ │ ❱ 213 │ │ │ │ batch_output = self.batch_loop.run(kwargs) │ │ 214 │ │ │ │ 215 │ │ self.batch_progress.increment_processed() │ │ 216 │ │ │ │ /opt/conda/lib/python3.7/site-packages/pytorch_lightning/loops/loop.py:199 in run │ │ │ │ 196 │ │ while not self.done: │ │ 197 │ │ │ try: │ │ 198 │ │ │ │ self.on_advance_start(*args, kwargs) │ │ ❱ 199 │ │ │ │ self.advance(*args, *kwargs) │ │ 200 │ │ │ │ self.on_advance_end() │ │ 201 │ │ │ │ self._restarting = False │ │ 202 │ │ │ except StopIteration: │ │ │ │ /opt/conda/lib/python3.7/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py:88 │ │ in advance │ │ │ │ 85 │ │ │ optimizers = _get_active_optimizers( │ │ 86 │ │ │ │ self.trainer.optimizers, self.trainer.optimizer_frequencies, kwargs.get( │ │ 87 │ │ │ ) │ │ ❱ 88 │ │ │ outputs = self.optimizer_loop.run(optimizers, kwargs) │ │ 89 │ │ else: │ │ 90 │ │ │ outputs = self.manual_loop.run(kwargs) │ │ 91 │ │ if outputs: │ │ │ │ /opt/conda/lib/python3.7/site-packages/pytorch_lightning/loops/loop.py:199 in run │ │ │ │ 196 │ │ while not self.done: │ │ 197 │ │ │ try: │ │ 198 │ │ │ │ self.on_advance_start(args, kwargs) │ │ ❱ 199 │ │ │ │ self.advance(args, kwargs) │ │ 200 │ │ │ │ self.on_advance_end() │ │ 201 │ │ │ │ self._restarting = False │ │ 202 │ │ │ except StopIteration: │ │ │ │ /opt/conda/lib/python3.7/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py:20 │ │ 2 in advance │ │ │ │ 199 │ def advance(self, optimizers: List[Tuple[int, Optimizer]], kwargs: OrderedDict) -> N │ │ 200 │ │ kwargs = self._build_kwargs(kwargs, self.optimizer_idx, self._hiddens) │ │ 201 │ │ │ │ ❱ 202 │ │ result = self._run_optimization(kwargs, self._optimizers[self.optim_progress.opt │ │ 203 │ │ if result.loss is not None: │ │ 204 │ │ │ # automatic optimization assumes a loss needs to be returned for extras to b │ │ 205 │ │ │ # would be skipped otherwise │ │ │ │ /opt/conda/lib/python3.7/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py:24 │ │ 9 in _run_optimization │ │ │ │ 246 │ │ # gradient update with accumulated gradients │ │ 247 │ │ else: │ │ 248 │ │ │ # the batch_idx is optional with inter-batch parallelism │ │ ❱ 249 │ │ │ self._optimizer_step(optimizer, opt_idx, kwargs.get("batch_idx", 0), closure │ │ 250 │ │ │ │ 251 │ │ result = closure.consume_result() │ │ 252 │ │ │ │ /opt/conda/lib/python3.7/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py:37 │ │ 9 in _optimizer_step │ │ │ │ 376 │ │ │ train_step_and_backward_closure, │ │ 377 │ │ │ on_tpu=isinstance(self.trainer.accelerator, TPUAccelerator), │ │ 378 │ │ │ kwargs, # type: ignore[arg-type] │ │ ❱ 379 │ │ │ using_lbfgs=is_lbfgs, │ │ 380 │ │ ) │ │ 381 │ │ │ │ 382 │ │ if not should_accumulate: │ │ │ │ /opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py:1356 in │ │ _call_lightning_module_hook │ │ │ │ 1353 │ │ pl_module._current_fx_name = hook_name │ │ 1354 │ │ │ │ 1355 │ │ with self.profiler.profile(f"[LightningModule]{pl_module.class.name}.{ho │ │ ❱ 1356 │ │ │ output = fn(args, **kwargs) │ │ 1357 │ │ │ │ 1358 │ │ # restore current_fx when nested context │ │ 1359 │ │ pl_module._current_fx_name = prev_fx_name │ │ │ │ /home/tolstoy/SaProt/model/abstract_model.py:135 in optimizer_step │ │ │ │ 132 │ │ using_lbfgs: bool = False, │ │ 133 │ ) -> None: │ │ 134 │ │ super().optimizer_step( │ │ ❱ 135 │ │ │ epoch, batch_idx, optimizer, optimizer_idx, optimizer_closure, on_tpu, using │ │ 136 │ │ ) │ │ 137 │ │ self.step += 1 │ │ 138 │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ TypeError: optimizer_step() takes from 4 to 8 positional arguments but 9 were given Epoch 0: 0%| | 0/3166 [00:03<?, ?it/s]

igortru commented 8 months ago

I think, issue can be closed. I have installed correct version of pytorch-lightning pip install pytorch-lightning==1.8.3

and now I have problem which is out of your scope

OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 15.78 GiB total capacity; 14.55 GiB already allocated; 16.44 MiB free; 14.72 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF Epoch 0: 0%| | 1/3166 [00:14<12:29:38, 14.21s/it, loss=0.167]

LTEnjoy commented 8 months ago

Sure! You can change the hyper-parameters in the config file: 1696992655076

If you have any question, feel free to raise an issue again!

igortru commented 8 months ago

I have set all parameters to possible minimums it looks like two 16Gb V100 is not enough for your model

LTEnjoy commented 8 months ago

Our model has 650M parameters so I guess even 16GB is not enough for full fine-tuning. Maybe you can try to freeze previous part of the backbone.

igortru commented 8 months ago

I have added freeze_backbone=True with attempt reduce memory consumption as you suggested. now it report

RuntimeError: Predictions and targets are expected to have the same shape Epoch 0: 0%| | 0/3166 [00:04<?, ?it/s]

class EsmRegressionModel(EsmBaseModel): def init(self, test_result_path: str = None, kwargs): """ Args: test_result_path: path to save test result kwargs: other arguments for EsmBaseModel """ self.test_result_path = test_result_path super().init(task="regression", freeze_backbone=True, **kwargs)

was super().init(task="regression", **kwargs)

can you, please provide more info about backbone freeze?  How  I can do it.

LTEnjoy commented 8 months ago

You can try it again and it should be solved now. There was a bug in the forward method and I fixed it.

igortru commented 8 months ago

it is working now.

igortru commented 8 months ago

next error: │ │ 78 │ if n_obs < 2: │ │ ❱ 79 │ │ raise ValueError("Needs at least two samples to calculate r2 score.") │ │ 80 │ │ │ 81 │ mean_obs = sum_obs / n_obs │ │ 82 │ tss = sum_squared_obs - sum_obs * mean_obs
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ ValueError: Needs at least two samples to calculate r2 score. Epoch 0: 90%|████████▉ | 2845/3166 [05:58<00:40, 7.94it/s, loss=0.0267]

when I changed batch size to 3 from 2, error disappeared.

LTEnjoy commented 8 months ago

The error is caused when you calculate the metric with only 1 sample in a batch. I already fixed it by skip calculating when a batch contains only 1 sample.

westlake-repl / SaProt

dynamic model/dataset selection is not working #3

"EsmRegressionModel")

was super().init(task="regression", **kwargs)