Unable to perform "Run a grid search on a Slurm cluster"

pearlmary commented 1 year ago

Hi Zong, I'm using docker and created a virtual environment(installed the prerequisites) to work with medfair (papila dataset). I could not resolve the error for 'sbatch'. Is there any prerequisite we should install for slurm environment? Or it would be great if you can tell me the steps to just run the sweep without using slurm cluster. Just the python way? Any possibilities?

ys-zong commented 1 year ago

Hi, Slurm is a cluster environment management tool preinstalled by the clusters. If you don't have a cluster or your cluster is using other management tools, you can also do the sweep using the regular python script.

This line is calling the sbatch xxx.sh, where sweep_count.sh has some slurm-specific command such as time allocation, etc. You can remove those and only use the code after this line. Also replace the batch with the regular command to execute bash file.

pearlmary commented 1 year ago

Thank you so much. Sure, let me check it.

pearlmary commented 1 year ago

Hi, Slurm is a cluster environment management tool preinstalled by the clusters. If you don't have a cluster or your cluster is using other management tools, you can also do the sweep using the regular python script.

This line is calling the sbatch xxx.sh, where sweep_count.sh has some slurm-specific command such as time allocation, etc. You can remove those and only use the code after this line. Also replace the batch with the regular command to execute bash file.

Hi Zong, In sweep_batch.py, I just replaced sbatch with bash, and as you mentioned in sweep_count.sh, I commented out the lines before 11. It ran for just two different lr and then it throws errors.

command is bash /workspace/MEDFAIR/sweep/train-sweep/sweep_count.sh --sweep_id eafkn0hh Traceback (most recent call last): File "/workspace/fairmed/bin/wandb", line 8, in sys.exit(cli()) File "/workspace/fairmed/lib/python3.8/site-packages/click/core.py", line 1128, in call return self.main(args, kwargs) File "/workspace/fairmed/lib/python3.8/site-packages/click/core.py", line 1053, in main rv = self.invoke(ctx) File "/workspace/fairmed/lib/python3.8/site-packages/click/core.py", line 1659, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/workspace/fairmed/lib/python3.8/site-packages/click/core.py", line 1395, in invoke return ctx.invoke(self.callback, ctx.params) File "/workspace/fairmed/lib/python3.8/site-packages/click/core.py", line 754, in invoke return __callback(args, kwargs) File "/workspace/fairmed/lib/python3.8/site-packages/click/decorators.py", line 26, in new_func return f(get_current_context(), *args, *kwargs) File "/workspace/fairmed/lib/python3.8/site-packages/wandb/cli/cli.py", line 102, in wrapper return func(args, kwargs) File "/workspace/fairmed/lib/python3.8/site-packages/wandb/cli/cli.py", line 1375, in agent api = _get_cling_api() File "/workspace/fairmed/lib/python3.8/site-packages/wandb/cli/cli.py", line 127, in _get_cling_api wandb.setup(settings=dict(_cli_only_mode=True)) File "/workspace/fairmed/lib/python3.8/site-packages/wandb/sdk/wandb_setup.py", line 307, in setup ret = _setup(settings=settings) File "/workspace/fairmed/lib/python3.8/site-packages/wandb/sdk/wandb_setup.py", line 302, in _setup wl = _WandbSetup(settings=settings) File "/workspace/fairmed/lib/python3.8/site-packages/wandb/sdk/wandb_setup.py", line 288, in init _WandbSetup._instance = _WandbSetupWandbSetup(settings=settings, pid=pid) File "/workspace/fairmed/lib/python3.8/site-packages/wandb/sdk/wandb_setup.py", line 106, in init self._setup() File "/workspace/fairmed/lib/python3.8/site-packages/wandb/sdk/wandb_setup.py", line 234, in _setup self._setup_manager() File "/workspace/fairmed/lib/python3.8/site-packages/wandb/sdk/wandb_setup.py", line 262, in _setup_manager self._manager = wandb_manager._Manager(settings=self._settings) File "/workspace/fairmed/lib/python3.8/site-packages/wandb/sdk/wandb_manager.py", line 129, in init__ svc_iface._svc_connect(port=port) File "/workspace/fairmed/lib/python3.8/site-packages/wandb/sdk/service/service_sock.py", line 30, in _svc_connect self._sock_client.connect(port=port) File "/workspace/fairmed/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 102, in connect s.connect(("localhost", port)) ConnectionRefusedError: [Errno 111] Connection refused output eafkn0hh done

error None resampling wandb: WARNING Changes to your wandb environment variables will be ignored because your wandb session has already started. For more information on how to modify your settings with wandb.init() arguments, please refer to https://wandb.me/wandb-init. Problem at: sweep/train-sweep/sweep_batch.py 35 Traceback (most recent call last): File "/workspace/fairmed/lib/python3.8/site-packages/wandb/sdk/wandb_init.py", line 1133, in init run = wi.init() File "/workspace/fairmed/lib/python3.8/site-packages/wandb/sdk/wandb_init.py", line 585, in init tel.feature.init_return_run = True File "/workspace/fairmed/lib/python3.8/site-packages/wandb/sdk/lib/telemetry.py", line 42, in exit self._run._telemetry_callback(self._obj) File "/workspace/fairmed/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 693, in _telemetry_callback self._telemetry_flush() File "/workspace/fairmed/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 704, in _telemetry_flush self._backend.interface._publish_telemetry(self._telemetry_obj) File "/workspace/fairmed/lib/python3.8/site-packages/wandb/sdk/interface/interface_shared.py", line 101, in _publish_telemetry self._publish(rec) File "/workspace/fairmed/lib/python3.8/site-packages/wandb/sdk/interface/interface_sock.py", line 51, in _publish self._sock_client.send_record_publish(record) File "/workspace/fairmed/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 221, in send_record_publish self.send_server_request(server_req) File "/workspace/fairmed/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 155, in send_server_request self._send_message(msg) File "/workspace/fairmed/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 152, in _send_message self._sendall_with_error_handle(header + data) File "/workspace/fairmed/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 130, in _sendall_with_error_handle sent = self._sock.send(data) BrokenPipeError: [Errno 32] Broken pipe wandb: ERROR Abnormal program exit Traceback (most recent call last): File "/workspace/fairmed/lib/python3.8/site-packages/wandb/sdk/wandb_init.py", line 1133, in init run = wi.init() File "/workspace/fairmed/lib/python3.8/site-packages/wandb/sdk/wandb_init.py", line 585, in init tel.feature.init_return_run = True File "/workspace/fairmed/lib/python3.8/site-packages/wandb/sdk/lib/telemetry.py", line 42, in exit self._run._telemetry_callback(self._obj) File "/workspace/fairmed/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 693, in _telemetry_callback self._telemetry_flush() File "/workspace/fairmed/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 704, in _telemetry_flush self._backend.interface._publish_telemetry(self._telemetry_obj) File "/workspace/fairmed/lib/python3.8/site-packages/wandb/sdk/interface/interface_shared.py", line 101, in _publish_telemetry self._publish(rec) File "/workspace/fairmed/lib/python3.8/site-packages/wandb/sdk/interface/interface_sock.py", line 51, in _publish self._sock_client.send_record_publish(record) File "/workspace/fairmed/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 221, in send_record_publish self.send_server_request(server_req) File "/workspace/fairmed/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 155, in send_server_request self._send_message(msg) File "/workspace/fairmed/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 152, in _send_message self._sendall_with_error_handle(header + data) File "/workspace/fairmed/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 130, in _sendall_with_error_handle sent = self._sock.send(data) BrokenPipeError: [Errno 32] Broken pipe

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "sweep/train-sweep/sweep_batch.py", line 35, in wandb.init(project=project_name) File "/workspace/fairmed/lib/python3.8/site-packages/wandb/sdk/wandb_init.py", line 1170, in init raise Exception("problem") from error_seen Exception: problem wandb: While tearing down the service manager. The following error has occurred: [Errno 32] Broken pipe

Since I'm new to .sh file handling, can you help me what else has to be done in the .sh files?

ys-zong commented 1 year ago

I didn't face this error before, but this looks like an error from wandb library instead of the bash. The bash should be right as you can run experiments already. Maybe related to this. Can you try to run wandb in offline mode wandb offline and see if it works?

pearlmary commented 1 year ago

Thanks for the reply. I tried the offline option as well, but it gives the same error. It seems that sweeps can't happen with offline mode. Since, you can run it without errors, I think it is the problem with the docker container's port.

ys-zong commented 1 year ago

Yes, it seems like the issue is with the network/ports rather than the code. Closing for now.

pearlmary commented 1 year ago

Hi Zong, is there a way to do sweep without using wandb for this current code?

Can you suggest one?

ys-zong commented 1 year ago

You can write your own script for doing a sweep. E.g., define the Hyperparameter space and loop over it where in each loop you can pass the hyperparameter to call the main.py.

ys-zong / MEDFAIR

Unable to perform "Run a grid search on a Slurm cluster" #4