Open ferreirafabio opened 1 year ago
Hi @ferreirafabio, you probably need to edit some of the fields in the wandb
section of the config you're using. Try updating wandb.entity
to your personal Weights & Biases username or team.
Thx for the hint, that solved my issue. Off-topic: just wanna let you know that the download of your checkpoints is slow AF...downloading http://learn2learn.eecs.berkeley.edu/checkpoint_datasets/mnist.zip takes roughly ~1 day at 800kb/s.
OK updateon wandb issues. Training with CIFAR10 works, testing MNIST doesn't. I get a wandb timeout. So running this command
python main.py --config-path configs/test --config-name mnist_loss.yaml num_gpus=4
ends with this error:
...
WARNING: Isaac Gym not imported
Traceback (most recent call last):
File "/home/ferreira/.miniconda/envs/ltft/lib/python3.8/site-packages/wandb/sdk/wandb_init.py", line 1140, in init
wi.setup(kwargs)
File "/home/ferreira/.miniconda/envs/ltft/lib/python3.8/site-packages/wandb/sdk/wandb_init.py", line 171, in setup
self._wl = wandb_setup.setup(settings=setup_settings)
File "/home/ferreira/.miniconda/envs/ltft/lib/python3.8/site-packages/wandb/sdk/wandb_setup.py", line 327, in setup
ret = _setup(settings=settings)
File "/home/ferreira/.miniconda/envs/ltft/lib/python3.8/site-packages/wandb/sdk/wandb_setup.py", line 320, in _setup
wl = _WandbSetup(settings=settings)
File "/home/ferreira/.miniconda/envs/ltft/lib/python3.8/site-packages/wandb/sdk/wandb_setup.py", line 303, in __init__
_WandbSetup._instance = _WandbSetup__WandbSetup(settings=settings, pid=pid)
File "/home/ferreira/.miniconda/envs/ltft/lib/python3.8/site-packages/wandb/sdk/wandb_setup.py", line 114, in __init__
self._setup()
File "/home/ferreira/.miniconda/envs/ltft/lib/python3.8/site-packages/wandb/sdk/wandb_setup.py", line 250, in _setup
self._setup_manager()
File "/home/ferreira/.miniconda/envs/ltft/lib/python3.8/site-packages/wandb/sdk/wandb_setup.py", line 277, in _setup_manager
self._manager = wandb_manager._Manager(settings=self._settings)
File "/home/ferreira/.miniconda/envs/ltft/lib/python3.8/site-packages/wandb/sdk/wandb_manager.py", line 146, in __init__
self._service.start()
File "/home/ferreira/.miniconda/envs/ltft/lib/python3.8/site-packages/wandb/sdk/service/service.py", line 199, in start
self._launch_server()
File "/home/ferreira/.miniconda/envs/ltft/lib/python3.8/site-packages/wandb/sdk/service/service.py", line 193, in _launch_server
sentry_reraise(e, delay=True)
File "/home/ferreira/.miniconda/envs/ltft/lib/python3.8/site-packages/wandb/util.py", line 214, in sentry_reraise
raise exc.with_traceback(sys.exc_info()[2])
File "/home/ferreira/.miniconda/envs/ltft/lib/python3.8/site-packages/wandb/sdk/service/service.py", line 191, in _launch_server
self._wait_for_ports(fname, proc=internal_proc)
File "/home/ferreira/.miniconda/envs/ltft/lib/python3.8/site-packages/wandb/sdk/service/service.py", line 141, in _wait_for_ports
raise ServiceStartTimeoutError(
wandb.sdk.service.service.ServiceStartTimeoutError: Timed out waiting for wandb service to start after 30.0 seconds. Try increasing the timeout with the `_service_wait` setting.
wandb: ERROR Abnormal program exit
Any idea what might go wrong here?
OK, issue does not seem to arise when setting num_gpus=1
(despite having 4 GPUs in my machine but settingnum_gpus=4
does not seem to work). Probably ranks not properly set for DDP?
Hi, after setting my API key, installing the requirements and executing the following prompt
I get the following error: