DDP degrades the performance

prote376 commented 2 years ago

Thank you for sharing this code!

I am testing your code for multitask video with BART on 24GB GPUs. To run your code on 24GB GPUs, I used below command to enable DDP. (batch size:50 -> 25)

bash scripts/video/single_adapter.sh 2

However, it showed worse results than the performance on a single 48GB GPU. When I increased the number of GPUs, the performance was getting worse. Because the model doesn't have BatchNorm, I thought the performance should be similar.

Have you tried DDP? Or do you have any intuition about the problem?

luogen1996 commented 2 years ago

Thank you for sharing this code!

I am testing your code for multitask video with BART on 24GB GPUs. To run your code on 24GB GPUs, I used below command to enable DDP. (batch size:50 -> 25)

bash scripts/video/single_adapter.sh 2

However, it showed worse results than the performance on a single 48GB GPU. When I increased the number of GPUs, the performance was getting worse. Because the model doesn't have BatchNorm, I thought the performance should be similar.

Have you tried DDP? Or do you have any intuition about the problem?

I have the same problem. When running this code on 8 x V100 (16G), I got : VQA Epoch 19: Valid Raw 62.55 Topk 62.49 Epoch 19: Best Raw 62.55 GQA Epoch 19: Valid 51.76 Epoch 18: Best 51.86 NLVR Epoch 19: Valid 69.21 Epoch 19: Best 69.21 COCO Caption Epoch 19: Valid CIDEr 111.52 Epoch 19: Best 111.52

ylsung commented 2 years ago

Thanks for pointing out the issue. I remember I didn't have this issue when I tried DDP. I will check on this soon.

ylsung commented 2 years ago

I just found out that DDP works well with full fine-tuning but works worse with parameter-efficient transfer learning methods. I will further investigate this issue soon.

luogen1996 commented 2 years ago

I just found out that DDP works well with full fine-tuning but works worse with parameter-efficient transfer learning methods. I will further investigate this issue soon.

If DDP does not work well, can i reproduce the results on a single A100 (40GB) by reducing the batch size. From @prote376 , it seems also not work well. How should I address this problem? Thanks!

ylsung commented 1 year ago

I think reduce the batch size should work, but the learning rate might need to reduce accordingly. The performance drops from @prote376 experiments may still come from the multi-gpu problem not the batch size.

czy-orange commented 1 year ago

The problem may come with DDP settings. The PyTorch DDP notes about Backward Pass “so after the backward pass, the grad field on the same corresponding parameter across different DDP processes should be the same”, however, I printed the grads and found out that they were not the same across GPUs.

My guess is that the synchronization of grads is missing (or some other misplays that possibly cause the desynchronization), and the way to use DDP model could be the source (official use is ddp_model(input) while in this repo the case is ddp_model.module.train_step()).

yushuinanrong commented 1 year ago

@ylsung Any update regarding this issue?

JieShibo commented 1 year ago

The problem may come with DDP settings. The PyTorch DDP notes about Backward Pass “so after the backward pass, the grad field on the same corresponding parameter across different DDP processes should be the same”, however, I printed the grads and found out that they were not the same across GPUs.

My guess is that the synchronization of grads is missing (or some other misplays that possibly cause the desynchronization), and the way to use DDP model could be the source (official use is ddp_model(input) while in this repo the case is ddp_model.module.train_step()).

You're right. ddp_model.module.train_step() does not synchronize the gradients. The issue can be solved by manually synchronizing them.

nbasyl commented 1 year ago

@JieShibo Hi, could you elaborate more on how to manually synchronize the gradient and address the synchronization problem? Thank you so much!

JieShibo commented 1 year ago

@nbasyl PyTorch 2.0.1

from torch.distributed.algorithms.join import (
    Join,
    Joinable,
    JoinHook,
)
from torch.distributed.utils import (
    _verify_param_shape_across_processes,
    _sync_module_states,
    _to_kwargs,
)
from torch.nn.parallel.distributed import _find_tensors, _tree_flatten_with_rref, _DDPSink, _tree_unflatten_with_rref

def ddp_forward(self, *inputs, **kwargs):
    with torch.autograd.profiler.record_function(
        "DistributedDataParallel.forward"
    ):
        if torch.is_grad_enabled() and self.require_backward_grad_sync:
            assert self.logger is not None
            self.logger.set_runtime_stats_and_log()
            self.num_iterations += 1
            self.reducer.prepare_for_forward()

        work = Join.notify_join_context(self)
        if work:
            self.reducer._set_forward_pass_work_handle(
                work, self._divide_by_initial_world_size  # type: ignore[arg-type]
            )

        if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
            logger.info(
                "Reducer buckets have been rebuilt in this iteration."
            )
            self._has_rebuilt_buckets = True

        if self._check_sync_bufs_pre_fwd():
            self._sync_buffers()

        if self._join_config.enable:
            self._check_global_requires_backward_grad_sync(
                is_joined_rank=False
            )
        module_to_run = (
            self._replicated_tensor_module
            if self._use_replicated_tensor_module
            else self.module
        )

        if self.device_ids:
            inputs, kwargs = _to_kwargs(
                inputs,
                kwargs,
                self.device_ids[0],
                self.use_side_stream_for_tensor_copies,
            )
            with self._inside_ddp_forward():
                output = module_to_run.train_step(*inputs[0], **kwargs[0])  # type: ignore[index]
        else:
            with self._inside_ddp_forward():
                output = module_to_run.train_step(*inputs, **kwargs)

        if self._check_sync_bufs_post_fwd():
            self._sync_buffers()

        if torch.is_grad_enabled() and self.require_backward_grad_sync:
            self.require_forward_param_sync = True
            if self.find_unused_parameters and not self.static_graph:
                self.reducer.prepare_for_backward(
                    list(_find_tensors(output))
                )
            else:
                self.reducer.prepare_for_backward([])
        else:
            self.require_forward_param_sync = False

    if (self.find_unused_parameters and not self.static_graph) or (
        self.static_graph and self.num_iterations == 1
    ):
        state_dict = {
            "static_graph": self.static_graph,
            "num_iterations": self.num_iterations,
        }

        (
            output_tensor_list,
            treespec,
            output_is_rref,
        ) = _tree_flatten_with_rref(output)
        output_placeholders = [None for _ in range(len(output_tensor_list))]
        for i, output in enumerate(output_tensor_list):
            if torch.is_tensor(output) and output.grad_fn is None:
                output_placeholders[i] = output

        passthrough_tensor_list = _DDPSink.apply(
            self.reducer,
            state_dict,
            *output_tensor_list,
        )
        for i in range(len(output_placeholders)):
            if output_placeholders[i] is None:
                output_placeholders[i] = passthrough_tensor_list[i]

        output = _tree_unflatten_with_rref(
            output_placeholders, treespec, output_is_rref
        )
    return output

https://github.com/ylsung/VL_adapter/blob/545fcbbdbbaec4c442de35567f6ae477ff4e8265/VL-T5/src/multitask.py#L284-L294 self.model.module.train_step(batch) -> ddp_forward(self.model, batch)

ylsung / VL_adapter

DDP degrades the performance #8