Questions about using accelerate library for training

XindiWu commented 1 year ago

Hi,

Thanks a lot for the repo! I was curious that if I use

CUDA_VISIBLE_DEVICES=0,1 python train.py --config configs/makelongvideo.yaml, it seems that the training time will be roughly 700 hours for 20k steps on 2xA100, however, if I use accelerate launch --config_file ./configs/multigpu.yaml train.py --config configs/makelongvideo.yaml on the same 2xA100, it will be >1000 hours, and I was curious that whether you know why this is the case?

Thank you!

xuduo35 commented 1 year ago

Maybe it's due to default configuration difference with ./configs/multigpu.yaml. You can compare accelerate output information in console log. Mine(./configs/multigpu.yaml, 2 RTX3090 GPUs) is

/home/ubuntu/torch19/lib/python3.10/site-packages/accelerate/accelerator.py:231: FutureWarning: logging_dir is deprecated and will be removed in version 0.18.0 of 🤗 Accelerate. Use project_dir instead. warnings.warn( /home/ubuntu/torch19/lib/python3.10/site-packages/accelerate/accelerator.py:231: FutureWarning: logging_dir is deprecated and will be removed in version 0.18.0 of 🤗 Accelerate. Use project_dir instead. warnings.warn( 06/27/2023 18:12:14 - INFO - main - Distributed environment: MULTI_GPU Backend: nccl Num processes: 2 Process index: 1 Local process index: 1 Device: cuda:1 Mixed precision type: fp16

06/27/2023 18:12:14 - INFO - main - Distributed environment: MULTI_GPU Backend: nccl Num processes: 2 Process index: 0 Local process index: 0 Device: cuda:0 Mixed precision type: fp16

{'prediction_type', 'variance_type'} was not found in config. Values will be initialized to default values. {'norm_num_groups'} was not found in config. Values will be initialized to default values. {'use_linear_projection', 'num_class_embeds', 'upcast_attention', 'only_cross_attention', 'resnet_time_scale_shift', 'dual_cross_attention', 'class_embed_type', 'mid_block_type'} was not found in config. Values will be initialized to default values. {'prediction_type'} was not found in config. Values will be initialized to default values. {'prediction_type'} was not found in config. Values will be initialized to default values. 06/27/2023 18:12:33 - INFO - main - Running training 06/27/2023 18:12:33 - INFO - main - Num examples = 4243251 06/27/2023 18:12:33 - INFO - main - Num Epochs = 4 06/27/2023 18:12:33 - INFO - main - Instantaneous batch size per device = 4 06/27/2023 18:12:33 - INFO - main - Total train batch size (w. parallel, distributed & accumulation) = 800 06/27/2023 18:12:33 - INFO - main - Gradient Accumulation steps = 100 06/27/2023 18:12:33 - INFO - main - Total optimization steps = 20000 Resuming from checkpoint checkpoint-12080 06/27/2023 18:12:33 - INFO - accelerate.accelerator - Loading states from ./outputs/makelongvideo/checkpoint-12080

XindiWu commented 1 year ago

Thank you for the reply!! So here's what it printed:

06/30/2023 22:31:30 - INFO - main - Distributed environment: MULTI_GPU Backend: nccl
Num processes: 8
Process index: 4
Local process index: 4
Device: cuda:4
Mixed precision type: fp16

06/30/2023 22:31:30 - INFO - main - Distributed environment: MULTI_GPU Backend: nccl
Num processes: 8
Process index: 6
Local process index: 6
Device: cuda:6
Mixed precision type: fp16

06/30/2023 22:31:30 - INFO - main - Distributed environment: MULTI_GPU Backend: nccl
Num processes: 8
Process index: 7
Local process index: 7
Device: cuda:7
Mixed precision type: fp16

06/30/2023 22:31:30 - INFO - main - Distributed environment: MULTI_GPU Backend: nccl
Num processes: 8
Process index: 1
Local process index: 1
Device: cuda:1
Mixed precision type: fp16

06/30/2023 22:31:30 - INFO - main - Distributed environment: MULTI_GPU Backend: nccl
Num processes: 8
Process index: 2
Local process index: 2
Device: cuda:2
Mixed precision type: fp16

06/30/2023 22:31:30 - INFO - main - Distributed environment: MULTI_GPU Backend: nccl
Num processes: 8
Process index: 5
Local process index: 5
Device: cuda:5
Mixed precision type: fp16

........... 06/30/2023 22:33:46 - INFO - main - Running training 06/30/2023 22:33:46 - INFO - main - Num examples = 2327084 06/30/2023 22:33:46 - INFO - main - Num Epochs = 55 06/30/2023 22:33:46 - INFO - main - Instantaneous batch size per device = 16 06/30/2023 22:33:46 - INFO - main - Total train batch size (w. parallel, distributed & accumulation) = 6400 06/30/2023 22:33:46 - INFO - main - Gradient Accumulation steps = 50 06/30/2023 22:33:46 - INFO - main - Total optimization steps = 20000 Steps: 0%| | 0/20000 [00:00<5517, ?it/s$

............ Steps: 0%|▎ | 80/20000 [20:50:17<5090:20:55, 919.94s/it, lr=0.0001, step_loss=

and my config is: compute_environment: LOCAL_MACHINE deepspeed_config: {} distributed_type: MULTI_GPU downcast_bf16: 'no' dynamo_backend: 'NO' fsdp_config: {} gpu_ids: all machine_rank: 0 main_training_function: main megatron_lm_config: {} mixed_precision: fp16 num_machines: 1 num_processes: 8 rdzv_backend: static same_network: true use_cpu: false

any thoughts on why it will take >5000 hours on 8xA100 to train? Thank you!!

xuduo35 commented 1 year ago

The config file seems fine. You can check CPU and GPU's load by nvidia-smi, htop, free -h. I guess your A100 is not fully running due to data feeding speed. Change num_workers=8 to num_workers=32 or bigger according to your machine ability.

# DataLoaders creation:
train_dataloader = torch.utils.data.DataLoader(
    train_dataset, batch_size=train_batch_size, num_workers=8
)

Below is my machine running state, GPU's is 97% loaded.

nvidia-smi

free -h

htop

Revanthraja commented 1 year ago

Hello sir how to download the dataset

xuduo35 commented 1 year ago

Refer to https://github.com/m-bain/webvid @Revanthraja

Revanthraja commented 1 year ago

Thanks sir

XindiWu commented 1 year ago

Awesome thank you!!

xuduo35 / MakeLongVideo

Questions about using accelerate library for training #3