Closed wolfgangmeyers closed 1 year ago
Multigpu error:
Process Loop Error: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1! (when checking argument for argument weight in method wrapper__cudnn_convolution)
Traceback (most recent call last):
File "/app/aibrush-2/worker/images_worker_multigpu.py", line 246, in process_loop
nsfw = model.generate(args)
File "/app/aibrush-2/worker/sd_text2im_model.py", line 191, in generate
init_latent = self.model.get_first_stage_encoding(self.model.encode_first_stage(init_image)) # move to latent space
File "/opt/venv/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/opt/venv/src/latent-diffusion/ldm/models/diffusion/ddpm.py", line 863, in encode_first_stage
return self.first_stage_model.encode(x)
File "/opt/venv/src/latent-diffusion/ldm/models/autoencoder.py", line 325, in encode
h = self.encoder(x)
File "/opt/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/venv/src/latent-diffusion/ldm/modules/diffusionmodules/model.py", line 439, in forward
hs = [self.conv_in(x)]
File "/opt/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/venv/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 463, in forward
return self._conv_forward(input, self.weight, self.bias)
File "/opt/venv/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 459, in _conv_forward
return F.conv2d(input, weight, bias, self.stride,
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1! (when checking argument for argument weight in method wrapper__cudnn_convolution)
Exception in thread Thread-8 (process_loop):
Traceback (most recent call last):
File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
self.run()
File "/usr/lib/python3.10/threading.py", line 953, in run
self._target(*self._args, **self._kwargs)
File "/app/aibrush-2/worker/images_worker_multigpu.py", line 173, in process_loop
model.generate(args)
File "/app/aibrush-2/worker/sd_text2im_model.py", line 206, in generate
uc = self.model.get_learned_conditioning(1 * [""])
File "/opt/venv/src/latent-diffusion/ldm/models/diffusion/ddpm.py", line 554, in get_learned_conditioning
c = self.cond_stage_model.encode(c)
File "/opt/venv/src/latent-diffusion/ldm/modules/encoders/modules.py", line 162, in encode
return self(text)
File "/opt/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/venv/src/latent-diffusion/ldm/modules/encoders/modules.py", line 156, in forward
outputs = self.transformer(input_ids=tokens)
File "/opt/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/venv/lib/python3.10/site-packages/transformers/models/clip/modeling_clip.py", line 722, in forward
return self.text_model(
File "/opt/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/venv/lib/python3.10/site-packages/transformers/models/clip/modeling_clip.py", line 632, in forward
hidden_states = self.embeddings(input_ids=input_ids, position_ids=position_ids)
File "/opt/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/venv/lib/python3.10/site-packages/transformers/models/clip/modeling_clip.py", line 165, in forward
inputs_embeds = self.token_embedding(input_ids)
File "/opt/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/venv/lib/python3.10/site-packages/torch/nn/modules/sparse.py", line 160, in forward
return F.embedding(
File "/opt/venv/lib/python3.10/site-packages/torch/nn/functional.py", line 2210, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument index in method wrapper__index_select)
Fixed using torch.cuda.set_device
in subprocesses. Used a threading mutex to ensure that only one subprocess loads a model at a time to avoid overflowing cpu ram.
Also, worker runs on all available GPUs. No need now to run on specific GPU.
This will allow a single machine to run multiple workers