Open prote376 opened 2 years ago
Thank you for sharing this code!
I am testing your code for multitask video with BART on 24GB GPUs. To run your code on 24GB GPUs, I used below command to enable DDP. (batch size:50 -> 25)
bash scripts/video/single_adapter.sh 2
However, it showed worse results than the performance on a single 48GB GPU. When I increased the number of GPUs, the performance was getting worse. Because the model doesn't have BatchNorm, I thought the performance should be similar.
Have you tried DDP? Or do you have any intuition about the problem?
I have the same problem. When running this code on 8 x V100 (16G), I got : VQA Epoch 19: Valid Raw 62.55 Topk 62.49 Epoch 19: Best Raw 62.55 GQA Epoch 19: Valid 51.76 Epoch 18: Best 51.86 NLVR Epoch 19: Valid 69.21 Epoch 19: Best 69.21 COCO Caption Epoch 19: Valid CIDEr 111.52 Epoch 19: Best 111.52
Thanks for pointing out the issue. I remember I didn't have this issue when I tried DDP. I will check on this soon.
I just found out that DDP works well with full fine-tuning but works worse with parameter-efficient transfer learning methods. I will further investigate this issue soon.
I just found out that DDP works well with full fine-tuning but works worse with parameter-efficient transfer learning methods. I will further investigate this issue soon.
If DDP does not work well, can i reproduce the results on a single A100 (40GB) by reducing the batch size. From @prote376 , it seems also not work well. How should I address this problem? Thanks!
I think reduce the batch size should work, but the learning rate might need to reduce accordingly. The performance drops from @prote376 experiments may still come from the multi-gpu problem not the batch size.
The problem may come with DDP settings. The PyTorch DDP notes about Backward Pass “so after the backward pass, the grad field on the same corresponding parameter across different DDP processes should be the same”, however, I printed the grads and found out that they were not the same across GPUs.
My guess is that the synchronization of grads is missing (or some other misplays that possibly cause the desynchronization), and the way to use DDP model could be the source (official use is ddp_model(input)
while in this repo the case is ddp_model.module.train_step()
).
@ylsung Any update regarding this issue?
The problem may come with DDP settings. The PyTorch DDP notes about Backward Pass “so after the backward pass, the grad field on the same corresponding parameter across different DDP processes should be the same”, however, I printed the grads and found out that they were not the same across GPUs.
My guess is that the synchronization of grads is missing (or some other misplays that possibly cause the desynchronization), and the way to use DDP model could be the source (official use is
ddp_model(input)
while in this repo the case isddp_model.module.train_step()
).
You're right. ddp_model.module.train_step()
does not synchronize the gradients. The issue can be solved by manually synchronizing them.
@JieShibo Hi, could you elaborate more on how to manually synchronize the gradient and address the synchronization problem? Thank you so much!
@nbasyl PyTorch 2.0.1
from torch.distributed.algorithms.join import (
Join,
Joinable,
JoinHook,
)
from torch.distributed.utils import (
_verify_param_shape_across_processes,
_sync_module_states,
_to_kwargs,
)
from torch.nn.parallel.distributed import _find_tensors, _tree_flatten_with_rref, _DDPSink, _tree_unflatten_with_rref
def ddp_forward(self, *inputs, **kwargs):
with torch.autograd.profiler.record_function(
"DistributedDataParallel.forward"
):
if torch.is_grad_enabled() and self.require_backward_grad_sync:
assert self.logger is not None
self.logger.set_runtime_stats_and_log()
self.num_iterations += 1
self.reducer.prepare_for_forward()
work = Join.notify_join_context(self)
if work:
self.reducer._set_forward_pass_work_handle(
work, self._divide_by_initial_world_size # type: ignore[arg-type]
)
if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
logger.info(
"Reducer buckets have been rebuilt in this iteration."
)
self._has_rebuilt_buckets = True
if self._check_sync_bufs_pre_fwd():
self._sync_buffers()
if self._join_config.enable:
self._check_global_requires_backward_grad_sync(
is_joined_rank=False
)
module_to_run = (
self._replicated_tensor_module
if self._use_replicated_tensor_module
else self.module
)
if self.device_ids:
inputs, kwargs = _to_kwargs(
inputs,
kwargs,
self.device_ids[0],
self.use_side_stream_for_tensor_copies,
)
with self._inside_ddp_forward():
output = module_to_run.train_step(*inputs[0], **kwargs[0]) # type: ignore[index]
else:
with self._inside_ddp_forward():
output = module_to_run.train_step(*inputs, **kwargs)
if self._check_sync_bufs_post_fwd():
self._sync_buffers()
if torch.is_grad_enabled() and self.require_backward_grad_sync:
self.require_forward_param_sync = True
if self.find_unused_parameters and not self.static_graph:
self.reducer.prepare_for_backward(
list(_find_tensors(output))
)
else:
self.reducer.prepare_for_backward([])
else:
self.require_forward_param_sync = False
if (self.find_unused_parameters and not self.static_graph) or (
self.static_graph and self.num_iterations == 1
):
state_dict = {
"static_graph": self.static_graph,
"num_iterations": self.num_iterations,
}
(
output_tensor_list,
treespec,
output_is_rref,
) = _tree_flatten_with_rref(output)
output_placeholders = [None for _ in range(len(output_tensor_list))]
for i, output in enumerate(output_tensor_list):
if torch.is_tensor(output) and output.grad_fn is None:
output_placeholders[i] = output
passthrough_tensor_list = _DDPSink.apply(
self.reducer,
state_dict,
*output_tensor_list,
)
for i in range(len(output_placeholders)):
if output_placeholders[i] is None:
output_placeholders[i] = passthrough_tensor_list[i]
output = _tree_unflatten_with_rref(
output_placeholders, treespec, output_is_rref
)
return output
https://github.com/ylsung/VL_adapter/blob/545fcbbdbbaec4c442de35567f6ae477ff4e8265/VL-T5/src/multitask.py#L284-L294
self.model.module.train_step(batch)
-> ddp_forward(self.model, batch)
Thank you for sharing this code!
I am testing your code for multitask video with BART on 24GB GPUs. To run your code on 24GB GPUs, I used below command to enable DDP. (batch size:50 -> 25)
bash scripts/video/single_adapter.sh 2
However, it showed worse results than the performance on a single 48GB GPU. When I increased the number of GPUs, the performance was getting worse. Because the model doesn't have BatchNorm, I thought the performance should be similar.
Have you tried DDP? Or do you have any intuition about the problem?