train command for CAMERA dataset

skarapanahalli commented 1 year ago

hi, Excellent work Irshad & team !. I am trying to follow the instructions on github to train. I have setup a windows environment with GPU support and trying to run jupyter notebook commands. (have downloaded both code & datasets) However, when I try to run the below code %run net_train.py "@configs/net_config.txt" It is resulting in div by zero error

_ZeroDivisionError Traceback (most recent call last) D:\aiexpts\6dof\gcasp\CenterSnap\net_train.py in 59 steps = hparams.max_steps 60 steps_per_epoch = samples_per_epoch // samples_per_step ---> 61 epochs = int(np.ceil(steps / steps_per_epoch)) 62 actual_steps = epochs * steps_per_epoch 63 print('Samples per epoch', samples_perepoch)

When I tried to debug, I found that the simnet code is actually looking for pickle files (self.dataset_path.glob('*.pickle.zstd')). But CAMERA & Real datasets are not having these files.

am I missing something ??

zubair-irshad commented 1 year ago

Hello @skarapanahalli, Thanks for finding our work interesting. For jupyter notebooks, you don't need downloaded data or pickle files but for training our model, you need to download or create data locally as we mention here.

skarapanahalli commented 1 year ago

Thanks @zubair-irshad, I have now downloaded all the data required. It is now able to proceed, however I hit upon another issue, which says key error. during training, copying the logs here..

Samples per epoch 1028 Steps per epoch 32 Target steps: 380000 Actual steps: 380000 Epochs: 11875 Using model class from: D:\aiexpts\6dof\gcasp\CenterSnap\simnet\lib\net\models\panoptic_net.py INFO:lightning:Running in fast_dev_run mode: will run a full train, val and test loop using a single batch INFO:lightning:GPU available: True, used: True INFO:lightning:CUDA_VISIBLE_DEVICES: [0] INFO:lightning: | Name | Type | Params

0 | model | PanopticNet | 5 M
1 | model.backbone | FPN | 4 M
2 | model.backbone.fpn_lateral2 | Conv2d | 4 K
3 | model.backbone.fpn_lateral2.norm | BatchNorm2d | 128
4 353 | model.keypoint_head.heatmap_head.p5.5 | Upsample | 0
.... (removed lines for brewity) .... 354 | model.keypoint_head.heatmap_head.predictor | Conv2d | 33
355 | model.keypoint_head.activation | Sigmoid | 0

KeyError Traceback (most recent call last) C:\ProgramData\miniconda3\envs\centersnap3\lib\site-packages\traitlets\traitlets.py in get(self, obj, cls) 534 try: --> 535 value = obj._trait_values[self.name] 536 except KeyError:

KeyError: 'layout'

During handling of the above exception, another exception occurred:

NotImplementedError Traceback (most recent call last) D:\aiexpts\6dof\gcasp\CenterSnap\net_train.py in 99 ) 100 --> 101 trainer.fit(model)

C:\ProgramData\miniconda3\envs\centersnap3\lib\site-packages\pytorch_lightning\trainer\trainer.py in fit(self, model, train_dataloader, val_dataloaders) 763 764 elif self.single_gpu: --> 765 self.single_gpu_train(model) 766 767 elif self.use_tpu: # pragma: no-cover

C:\ProgramData\miniconda3\envs\centersnap3\lib\site-packages\pytorch_lightning\trainer\distrib_parts.py in single_gpu_train(self, model) 490 self.optimizers = optimizers 491 --> 492 self.run_pretrain_routine(model) 493 494 def tpu_train(self, tpu_core_idx, model):

C:\ProgramData\miniconda3\envs\centersnap3\lib\site-packages\pytorch_lightning\trainer\trainer.py in run_pretrain_routine(self, model) 911 912 # CORE TRAINING LOOP --> 913 self.train() 914 915 def test(self, model: Optional[LightningModule] = None, test_dataloaders: Optional[DataLoader] = None):

C:\ProgramData\miniconda3\envs\centersnap3\lib\site-packages\pytorch_lightning\trainer\training_loop.py in train(self) 312 with self.profiler.profile('on_train_start'): 313 # callbacks --> 314 self.on_train_start() 315 # initialize early stop callback 316 if self.early_stop_callback is not None:

C:\ProgramData\miniconda3\envs\centersnap3\lib\site-packages\pytorch_lightning\trainer\callback_hook.py in on_train_start(self) 46 """Called when the train begins.""" 47 for callback in self.callbacks: ---> 48 callback.on_train_start(self, self.get_model()) 49 50 def on_train_end(self):

C:\ProgramData\miniconda3\envs\centersnap3\lib\site-packages\pytorch_lightning\callbacks\progress.py in on_train_start(self, trainer, pl_module) 305 def on_train_start(self, trainer, pl_module): 306 super().on_train_start(trainer, pl_module) --> 307 self.main_progress_bar = self.init_train_tqdm() 308 309 def on_epoch_start(self, trainer, pl_module):

C:\ProgramData\miniconda3\envs\centersnap3\lib\site-packages\pytorch_lightning\callbacks\progress.py in init_train_tqdm(self) 256 def init_train_tqdm(self) -> tqdm: 257 """ Override this to customize the tqdm bar for training. """ --> 258 bar = tqdm( 259 desc='Training', 260 initial=self.train_batch_idx,

C:\ProgramData\miniconda3\envs\centersnap3\lib\site-packages\tqdm\notebook.py in init(self, *args, *kwargs) 237 unit_scale = 1 if self.unit_scale is True else self.unit_scale or 1 238 total = self.total unit_scale if self.total else self.total --> 239 self.container = self.status_printer(self.fp, total, self.desc, self.ncols) 240 self.container.pbar = self 241 self.displayed = False

C:\ProgramData\miniconda3\envs\centersnap3\lib\site-packages\tqdm\notebook.py in statusprinter(, total, desc, ncols) 117 pbar = IProgress(min=0, max=total) 118 else: # No total? Show info style bar with no progress tqdm status --> 119 pbar = IProgress(min=0, max=1) 120 pbar.value = 1 121 pbar.bar_style = 'info'

~\AppData\Roaming\Python\Python38\site-packages\ipywidgets\widgets\widget_float.py in init(self, value, kwargs) 24 if value is not None: 25 kwargs['value'] = value ---> 26 super().init(kwargs) 27 28

~\AppData\Roaming\Python\Python38\site-packages\ipywidgets\widgets\widget_description.py in init(self, *args, *kwargs) 33 kwargs.setdefault('tooltip', kwargs['description_tooltip']) 34 del kwargs['description_tooltip'] ---> 35 super().init(args, **kwargs) 36 37 def _repr_keys(self):

~\AppData\Roaming\Python\Python38\site-packages\ipywidgets\widgets\widget.py in init(self, **kwargs) 502 503 Widget._call_widget_constructed(self) --> 504 self.open() 505 506 def del(self):

~\AppData\Roaming\Python\Python38\site-packages\ipywidgets\widgets\widget.py in open(self) 515 """Open a comm to the frontend if one isn't already open.""" 516 if self.comm is None: --> 517 state, buffer_paths, buffers = _remove_buffers(self.get_state()) 518 519 args = dict(target_name='jupyter.widget',

~\AppData\Roaming\Python\Python38\site-packages\ipywidgets\widgets\widget.py in get_state(self, key, drop_defaults) 613 for k in keys: 614 to_json = self.trait_metadata(k, 'to_json', self._trait_to_json) --> 615 value = to_json(getattr(self, k), self) 616 if not drop_defaults or not self._compare(value, traits[k].default_value): 617 state[k] = value

C:\ProgramData\miniconda3\envs\centersnap3\lib\site-packages\traitlets\traitlets.py in get(self, obj, cls) 573 return self 574 else: --> 575 return self.get(obj, cls) 576 577 def set(self, obj, value):

C:\ProgramData\miniconda3\envs\centersnap3\lib\site-packages\traitlets\traitlets.py in get(self, obj, cls) 536 except KeyError: 537 # Check for a dynamic initializer. --> 538 default = obj.trait_defaults(self.name) 539 if default is Undefined: 540 warn(

C:\ProgramData\miniconda3\envs\centersnap3\lib\site-packages\traitlets\traitlets.py in trait_defaults(self, *names, metadata) 1576 1577 if len(names) == 1 and len(metadata) == 0: -> 1578 return self._get_trait_default_generator(names[0])(self) 1579 1580 trait_names = self.trait_names(metadata)

C:\ProgramData\miniconda3\envs\centersnap3\lib\site-packages\traitlets\traitlets.py in default(self, obj) 509 return self.default_value 510 elif hasattr(self, 'make_dynamic_default'): --> 511 return self.make_dynamic_default() 512 else: 513 # Undefined will raise in TraitType.get

~\AppData\Roaming\Python\Python38\site-packages\ipywidgets\widgets\trait_types.py in make_dynamic_default(self) 363 364 def make_dynamic_default(self): --> 365 return self.klass(*(self.default_args or ()), 366 **(self.default_kwargs or {})) 367

~\AppData\Roaming\Python\Python38\site-packages\ipywidgets\widgets\widgetlayout.py in init(self, **kwargs) 84 kwargs.setdefault(f'border{side}', border) 85 ---> 86 super().init(**kwargs) 87 88 def _get_border(self):

~\AppData\Roaming\Python\Python38\site-packages\ipywidgets\widgets\widget.py in init(self, **kwargs) 502 503 Widget._call_widget_constructed(self) --> 504 self.open() 505 506 def del(self):

~\AppData\Roaming\Python\Python38\site-packages\ipywidgets\widgets\widget.py in open(self) 533 return Comm(kwargs) 534 --> 535 self.comm = create_comm(args) 536 537 @observe('comm')

C:\ProgramData\miniconda3\envs\centersnap3\lib\site-packages\comm__init__.py in _create_comm(*args, **kwargs) 25 This method is intended to be replaced, so that it returns your Comm instance. 26 """ ---> 27 raise NotImplementedError("Cannot ") 28 29

NotImplementedError: Cannot

zubair-irshad commented 1 year ago

@skarapanahalli We haven't seen this error before but looks like a conda environment package mismatch error maybe? since it does not point anywhere in the code itself so it is hard to debug which line triggered this error. Can you please share the full error trace in a more readable (code format) please.

skarapanahalli commented 1 year ago

@zubair-irshad , the error is due to wandb issue. I am not able to perform wandb.init(), it is complaning of some proxy issue even though I have everything set correctly. I can proceed further if I comment the logger line below in net_train.py. But it fails later, perhaps related to logs.

trainer = pl.Trainer(
    max_nb_epochs=epochs,
    early_stop_callback=None,
    gpus=[_GPU_TO_USE],
    checkpoint_callback=model_checkpoint,
    val_check_interval=1.0,
    **logger=wandb_logger,**
    fast_dev_run=True,
    default_save_path=hparams.output,
    use_amp=False,
    print_nan_grads=True,

Is it possible to run the code without wandb logger ?

zubair-irshad commented 1 year ago

Did you do wandb.login() in the command line to set up wandb? Yes it is quite easy to change it to any other logger, for instance tensorboard. You can follow PyTorch Lightning's online documentation to read up on that. Hope it resolves your issue.

skarapanahalli commented 1 year ago

Yes I did wandb.login() and it is resulting in the same error. After much googling, I found that I need to update wandb, however when I try to upgrade it is having other dependencies and not easy to resolve. the issue is not code, but the handling of proxies by urllib3 as described here https://stackoverflow.com/questions/66642705/why-requests-raise-this-exception-check-hostname-requires-server-hostname

zubair-irshad commented 1 year ago

I see. In that case, it might be worth switching to a tensorboard logger which is very easy to swap in PyTorch lightning with just a single line of code and a few import lines: https://www.pytorchlightning.ai/blog/tensorboard-with-pytorch-lightning

Please note that you might have to change a few image logging functions inside our code as well to make it work with tensorboard logging but hopefully the link above would help you with this.

skarapanahalli commented 1 year ago

Thanks @zubair-irshad, finally I am able to execute the code (training part). I switched to tensorboard, but TB visualizations won't work due to compatibility issues with older packages. I have now found a work around to copy the tb_logs to my desktop and upload it to tensorboard dev (online) and view it there ! When you get time I request you to update the code to run on latest packages, or may be create a docker image which we can use directly.
Nevertheless, appreciate for your quick responses !

zubair-irshad commented 1 year ago

Thanks for keeping us posted. I will close the issue now. Please feel free to open a PR if you feel this might help other people with logging issues.

zubair-irshad / CenterSnap

train command for CAMERA dataset #24