Closed XindiWu closed 1 year ago
Maybe it's due to default configuration difference with ./configs/multigpu.yaml. You can compare accelerate output information in console log. Mine(./configs/multigpu.yaml, 2 RTX3090 GPUs) is
/home/ubuntu/torch19/lib/python3.10/site-packages/accelerate/accelerator.py:231: FutureWarning: logging_dir
is deprecated and will be removed in version 0.18.0 of 🤗 Accelerate. Use project_dir
instead.
warnings.warn(
/home/ubuntu/torch19/lib/python3.10/site-packages/accelerate/accelerator.py:231: FutureWarning: logging_dir
is deprecated and will be removed in version 0.18.0 of 🤗 Accelerate. Use project_dir
instead.
warnings.warn(
06/27/2023 18:12:14 - INFO - main - Distributed environment: MULTI_GPU Backend: nccl
Num processes: 2
Process index: 1
Local process index: 1
Device: cuda:1
Mixed precision type: fp16
06/27/2023 18:12:14 - INFO - main - Distributed environment: MULTI_GPU Backend: nccl Num processes: 2 Process index: 0 Local process index: 0 Device: cuda:0 Mixed precision type: fp16
{'prediction_type', 'variance_type'} was not found in config. Values will be initialized to default values. {'norm_num_groups'} was not found in config. Values will be initialized to default values. {'use_linear_projection', 'num_class_embeds', 'upcast_attention', 'only_cross_attention', 'resnet_time_scale_shift', 'dual_cross_attention', 'class_embed_type', 'mid_block_type'} was not found in config. Values will be initialized to default values. {'prediction_type'} was not found in config. Values will be initialized to default values. {'prediction_type'} was not found in config. Values will be initialized to default values. 06/27/2023 18:12:33 - INFO - main - Running training 06/27/2023 18:12:33 - INFO - main - Num examples = 4243251 06/27/2023 18:12:33 - INFO - main - Num Epochs = 4 06/27/2023 18:12:33 - INFO - main - Instantaneous batch size per device = 4 06/27/2023 18:12:33 - INFO - main - Total train batch size (w. parallel, distributed & accumulation) = 800 06/27/2023 18:12:33 - INFO - main - Gradient Accumulation steps = 100 06/27/2023 18:12:33 - INFO - main - Total optimization steps = 20000 Resuming from checkpoint checkpoint-12080 06/27/2023 18:12:33 - INFO - accelerate.accelerator - Loading states from ./outputs/makelongvideo/checkpoint-12080
Thank you for the reply!! So here's what it printed:
06/30/2023 22:31:30 - INFO - main - Distributed environment: MULTI_GPU Backend: nccl
Num processes: 8
Process index: 4
Local process index: 4
Device: cuda:4
Mixed precision type: fp16
06/30/2023 22:31:30 - INFO - main - Distributed environment: MULTI_GPU Backend: nccl
Num processes: 8
Process index: 6
Local process index: 6
Device: cuda:6
Mixed precision type: fp16
06/30/2023 22:31:30 - INFO - main - Distributed environment: MULTI_GPU Backend: nccl
Num processes: 8
Process index: 7
Local process index: 7
Device: cuda:7
Mixed precision type: fp16
06/30/2023 22:31:30 - INFO - main - Distributed environment: MULTI_GPU Backend: nccl
Num processes: 8
Process index: 1
Local process index: 1
Device: cuda:1
Mixed precision type: fp16
06/30/2023 22:31:30 - INFO - main - Distributed environment: MULTI_GPU Backend: nccl
Num processes: 8
Process index: 2
Local process index: 2
Device: cuda:2
Mixed precision type: fp16
06/30/2023 22:31:30 - INFO - main - Distributed environment: MULTI_GPU Backend: nccl
Num processes: 8
Process index: 5
Local process index: 5
Device: cuda:5
Mixed precision type: fp16
........... 06/30/2023 22:33:46 - INFO - main - Running training 06/30/2023 22:33:46 - INFO - main - Num examples = 2327084 06/30/2023 22:33:46 - INFO - main - Num Epochs = 55 06/30/2023 22:33:46 - INFO - main - Instantaneous batch size per device = 16 06/30/2023 22:33:46 - INFO - main - Total train batch size (w. parallel, distributed & accumulation) = 6400 06/30/2023 22:33:46 - INFO - main - Gradient Accumulation steps = 50 06/30/2023 22:33:46 - INFO - main - Total optimization steps = 20000 Steps: 0%| | 0/20000 [00:00<5517, ?it/s$
............ Steps: 0%|â–Ž | 80/20000 [20:50:17<5090:20:55, 919.94s/it, lr=0.0001, step_loss=
and my config is: compute_environment: LOCAL_MACHINE deepspeed_config: {} distributed_type: MULTI_GPU downcast_bf16: 'no' dynamo_backend: 'NO' fsdp_config: {} gpu_ids: all machine_rank: 0 main_training_function: main megatron_lm_config: {} mixed_precision: fp16 num_machines: 1 num_processes: 8 rdzv_backend: static same_network: true use_cpu: false
any thoughts on why it will take >5000 hours on 8xA100 to train? Thank you!!
The config file seems fine. You can check CPU and GPU's load by nvidia-smi, htop, free -h. I guess your A100 is not fully running due to data feeding speed. Change num_workers=8 to num_workers=32 or bigger according to your machine ability.
# DataLoaders creation:
train_dataloader = torch.utils.data.DataLoader(
train_dataset, batch_size=train_batch_size, num_workers=8
)
Below is my machine running state, GPU's is 97% loaded.
nvidia-smi
free -h
htop
Hello sir how to download the dataset
Refer to https://github.com/m-bain/webvid @Revanthraja
Thanks sir
Awesome thank you!!
Hi,
Thanks a lot for the repo! I was curious that if I use
CUDA_VISIBLE_DEVICES=0,1 python train.py --config configs/makelongvideo.yaml
, it seems that the training time will be roughly 700 hours for 20k steps on 2xA100, however, if I useaccelerate launch --config_file ./configs/multigpu.yaml train.py --config configs/makelongvideo.yaml
on the same 2xA100, it will be >1000 hours, and I was curious that whether you know why this is the case?Thank you!