zenml-io / zenml

ZenML 🙏: The bridge between ML and Ops. https://zenml.io.
https://zenml.io
Apache License 2.0
4.04k stars 436 forks source link

RuntimeError: Tensor for argument #2 'mat1' is on CPU, but expected it to be on GPU (while checking arguments for addmm) #90

Closed SKRohit closed 3 years ago

SKRohit commented 3 years ago

Describe the bug I am new to zenml and planning to use in one of our project. I tried to run Pytorch examples mentioned here. Please let me know what is the issue. I am confused because it is able to train the model without CPU, GPU tensor mismatch. But after training I am getting this error and I cannot find an option (in the apis) to specify an option to use or not use GPU. Please let me know if you need any other details.

To Reproduce Steps to reproduce the behavior:

  1. pip install zenml[pytorch]
  2. zenml example pull pytorch
  3. cd zenml_examples/pytorch
  4. git init
  5. zenml init

Stack Trace

RuntimeError Traceback (most recent call last)

in 1 # Run the pipeline locally ----> 2 training_pipeline.run() ~/miniconda3/envs/zenml/lib/python3.7/site-packages/zenml/utils/analytics_utils.py in inner_func(*args, **kwargs) 175 def inner_func(*args, **kwargs): 176 track_event(event, metadata=metadata) --> 177 result = func(*args, **kwargs) 178 return result 179 ~/miniconda3/envs/zenml/lib/python3.7/site-packages/zenml/pipelines/base_pipeline.py in run(self, backend, metadata_store, artifact_store) 455 self.register_pipeline(config) 456 --> 457 self.run_config(config) 458 459 # After running, pipeline is immutable ~/miniconda3/envs/zenml/lib/python3.7/site-packages/zenml/pipelines/base_pipeline.py in run_config(self, config) 376 """ 377 assert issubclass(self.backend.__class__, OrchestratorBaseBackend) --> 378 self.backend.run(config) 379 380 @track(event=RUN_PIPELINE) ~/miniconda3/envs/zenml/lib/python3.7/site-packages/zenml/backends/orchestrator/base/orchestrator_base_backend.py in run(self, config) 107 """ 108 tfx_pipeline = self.get_tfx_pipeline(config) --> 109 ZenMLLocalDagRunner().run(tfx_pipeline) ~/miniconda3/envs/zenml/lib/python3.7/site-packages/zenml/backends/orchestrator/base/zenml_local_orchestrator.py in run(self, pipeline) 95 custom_driver_spec=custom_driver_spec) 96 logging.info('Component %s is running.', node_id) ---> 97 component_launcher.launch() 98 logging.info('Component %s is finished.', node_id) ~/miniconda3/envs/zenml/lib/python3.7/site-packages/tfx/orchestration/portable/launcher.py in launch(self) 429 if is_execution_needed: 430 try: --> 431 executor_output = self._run_executor(execution_info) 432 except Exception as e: # pylint: disable=broad-except 433 execution_output = ( ~/miniconda3/envs/zenml/lib/python3.7/site-packages/tfx/orchestration/portable/launcher.py in _run_executor(self, execution_info) 323 outputs_utils.make_output_dirs(execution_info.output_dict) 324 try: --> 325 executor_output = self._executor_operator.run_executor(execution_info) 326 code = executor_output.execution_result.code 327 if code != 0: ~/miniconda3/envs/zenml/lib/python3.7/site-packages/tfx/orchestration/portable/python_executor_operator.py in run_executor(self, execution_info) 139 stateful_working_dir=execution_info.stateful_working_dir) 140 executor = self._executor_cls(context=context) --> 141 return run_with_executor(execution_info, executor) ~/miniconda3/envs/zenml/lib/python3.7/site-packages/tfx/orchestration/portable/python_executor_operator.py in run_with_executor(execution_info, executor) 64 output_dict = copy.deepcopy(execution_info.output_dict) 65 result = executor.Do(execution_info.input_dict, output_dict, ---> 66 execution_info.exec_properties) 67 if not result: 68 # If result is not returned from the Do function, then try to ~/miniconda3/envs/zenml/lib/python3.7/site-packages/tfx/components/trainer/executor.py in Do(self, input_dict, output_dict, exec_properties) 192 # Train the model 193 absl.logging.info('Training model.') --> 194 run_fn(fn_args) 195 196 # Note: If trained with multi-node distribution workers, it is the user ~/miniconda3/envs/zenml/lib/python3.7/site-packages/zenml/components/trainer/trainer_module.py in run_fn(fn_args) 30 # Load the step, parameterize it and run it 31 c = load_source_path_class(custom_config.pop(StepKeys.SOURCE)) ---> 32 return c(**args).run_fn() ~/miniconda3/envs/zenml/lib/python3.7/site-packages/zenml/steps/trainer/pytorch_trainers/torch_ff_trainer.py in run_fn(self) 211 pattern = self.input_patterns[split] 212 test_dataset = self.input_fn([pattern]) --> 213 test_results = self.test_fn(model, test_dataset) 214 utils.save_test_results(test_results, self.output_patterns[split]) 215 ~/miniconda3/envs/zenml/lib/python3.7/site-packages/zenml/steps/trainer/pytorch_trainers/torch_ff_trainer.py in test_fn(self, model, dataset) 130 # finally, add the output of the model 131 x_batch = torch.cat([v for v in x.values()], dim=-1) --> 132 p = model(x_batch) 133 134 if isinstance(p, torch.Tensor): ~/miniconda3/envs/zenml/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs) 887 result = self._slow_forward(*input, **kwargs) 888 else: --> 889 result = self.forward(*input, **kwargs) 890 for hook in itertools.chain( 891 _global_forward_hooks.values(), ~/miniconda3/envs/zenml/lib/python3.7/site-packages/zenml/steps/trainer/pytorch_trainers/torch_ff_trainer.py in forward(self, inputs) 42 43 def forward(self, inputs): ---> 44 x = self.relu(self.layer_1(inputs)) 45 x = self.batchnorm1(x) 46 x = self.relu(self.layer_2(x)) ~/miniconda3/envs/zenml/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs) 887 result = self._slow_forward(*input, **kwargs) 888 else: --> 889 result = self.forward(*input, **kwargs) 890 for hook in itertools.chain( 891 _global_forward_hooks.values(), ~/miniconda3/envs/zenml/lib/python3.7/site-packages/torch/nn/modules/linear.py in forward(self, input) 92 93 def forward(self, input: Tensor) -> Tensor: ---> 94 return F.linear(input, self.weight, self.bias) 95 96 def extra_repr(self) -> str: ~/miniconda3/envs/zenml/lib/python3.7/site-packages/torch/nn/functional.py in linear(input, weight, bias) 1751 if has_torch_function_variadic(input, weight): 1752 return handle_torch_function(linear, (input, weight), input, weight, bias=bias) -> 1753 return torch._C._nn.linear(input, weight, bias) 1754 1755 RuntimeError: Tensor for argument #2 'mat1' is on CPU, but expected it to be on GPU (while checking arguments for addmm) ** Context (please complete the following information):** - OS: Ubuntu 20.04 - Python Version: 3.7.10 - ZenML Version: 0.3.8
htahir1 commented 3 years ago

Indeed it is a bug - we can push a fix in the next release. Thanks for bringing it to our attention.

In the meanwhile maybe try changing some of the files in the pytorch trainer to see if you can change the behavior?

SKRohit commented 3 years ago

@htahir1 the bug is occurring because of self.test_fn() in FeedForwardTrainer in file torch_ff_trainer.py. It is called in line 213 from self.run_fn() where the model is being explicitly kept on GPU (if a GPU is present) before training. This model is being passed to self.test_fn. And while testing the model the inputs should have been kept on GPU (see line no 131) or model should have been moved to CPU.

I have created a pull request to fix this that you can find here

htahir1 commented 3 years ago

Thanks - left a review!

htahir1 commented 3 years ago

Thank you @SKRohit for the PR #91 ! It fixes this issue