zjysteven / lmms-finetune

A minimal codebase for finetuning large multimodal models, supporting llava-1.5/1.6, llava-interleave, llava-next-video, qwen-vl, phi3-v etc.
Apache License 2.0
115 stars 8 forks source link

作者您好,如果我想在slurm集群上提交finetune任务且我已经将llava-interleave-qwen-7b的参数下载到服务器上了,请问应该如何更改代码? #1

Closed whycantfindaname closed 1 month ago

whycantfindaname commented 1 month ago

目前将example.sh改为:

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --gres=gpu:4
#SBATCH --cpus-per-task=28
#SBATCH --partition=gpu
#SBATCH --exclude=gpu19,gpu3,gpu8,gpu14,gpu4
#SBATCH --job-name=llava_job
#SBATCH --output=/home/u9920230028/lmms-finetune-main/testbug/job_output.txt
#SBATCH --error=/home/u9920230028/lmms-finetune-main/testbug/job_error.txt
#SBATCH --mail-type=ALL
#SBATCH --mail-user=jasonliaonk21@gmail.com

NUM_GPUS=4
DISTRIBUTED_ARGS="
    --nnodes=1 \    
    --nproc_per_node ${NUM_GPUS} \
    --rdzv_backend c10d \
    --rdzv_endpoint localhost:0 \
    --master_port 29501
"
#--nnodes=1: 指定使用的节点数量,这里为1个节点。
#--nproc_per_node ${NUM_GPUS}: 指定每个节点上的进程数,这里使用${NUM_GPUS}表示,每个节点上的进程数与GPU数量相同。
#--rdzv_backend c10d: 指定torch.distributed的后端为c10d(常用于多进程多GPU训练)。
#--rdzv_endpoint localhost:0: 指定用于分布式训练的 rendezvous 端点,这里为localhost:0,即在本地主机上自动选择一个端口。

# arguments that are very likely to be changed
# according to your own case
MODEL_ID=llava-interleave-qwen-7b                            # model id; pick on by running `python supported_models.py`
TRAIN_DATA_PATH=./example_data/video.json               # path to the training data json file
EVAL_DATA_PATH=./example_data/video.json                # path to the evaluation data json file
IMAGE_FOLDER=./example_data/images                      # path to the image root folder; if provided, the image paths in the json should be relative
VIDEO_FOLDER=./example_data/videos                      # path to the video root folder; if provided, the video paths in the json should be relative
DEFAULT_NUM_FRAMES=8                                    # if `num_frames` is not specified in dataset entries, this value will be used to sample frames from videos

USE_LORA=True                                           # whether use lora
Q_LORA=False                                            # whether use q-lora; only effective when `USE_LORA` is True
LORA_R=64                                               # the lora rank
LORA_ALPHA=16                                           # the lora alpha

RUN_ID=${MODEL_ID}_lora-${USE_LORA}_qlora-${Q_LORA}     # a custom run id that determines the checkpoint folder and wandb run name

DS_STAGE=zero3                                          # deepspeed stage; < zero2 | zero3 >
PER_DEVICE_BATCH_SIZE=1                                 # batch size per GPU
GRAD_ACCUM=1                                            # gradient accumulation steps
NUM_EPOCHS=1                                            # number of training epochs

LR=1e-4                                                 # learning rate
MODEL_MAX_LEN=2048                                      # maximum input length of the model

srun torchrun $DISTRIBUTED_ARGS train.py \
    --model_id $MODEL_ID \
    --data_path $TRAIN_DATA_PATH \
    --eval_data_path $EVAL_DATA_PATH \
    --image_folder $IMAGE_FOLDER \
    --video_folder $VIDEO_FOLDER \
    --default_num_frames $DEFAULT_NUM_FRAMES \
    --output_dir ./checkpoints/$RUN_ID \
    --report_to wandb \
    --run_name $RUN_ID \
    --deepspeed ./ds_configs/${DS_STAGE}.json \
    --bf16 True \
    --num_train_epochs $NUM_EPOCHS \
    --per_device_train_batch_size $PER_DEVICE_BATCH_SIZE \
    --per_device_eval_batch_size $PER_DEVICE_BATCH_SIZE \
    --gradient_accumulation_steps $GRAD_ACCUM \
    --eval_strategy "epoch" \
    --save_strategy "epoch" \
    --save_total_limit 1 \
    --learning_rate ${LR} \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length $MODEL_MAX_LEN \
    --gradient_checkpointing True \
    --dataloader_num_workers 4 \
    --use_lora $USE_LORA \
    --q_lora $Q_LORA \
    --lora_r $LORA_R \
    --lora_alpha $LORA_ALPHA
会报错:
    /home/u9920230028/miniconda3/envs/lmms-finetune/bin/python: can't open file '/home/u9920230028/lmms-finetune-main/\\': [Errno 2] No such file or directory
E0721 21:53:11.559000 22859032696640 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 2) local_rank: 0 (pid: 18745) of binary: /home/u9920230028/miniconda3/envs/lmms-finetune/bin/python
Traceback (most recent call last):
  File "/home/u9920230028/miniconda3/envs/lmms-finetune/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/u9920230028/miniconda3/envs/lmms-finetune/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
  File "/home/u9920230028/miniconda3/envs/lmms-finetune/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main
    run(args)
  File "/home/u9920230028/miniconda3/envs/lmms-finetune/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run
    elastic_launch(
  File "/home/u9920230028/miniconda3/envs/lmms-finetune/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/u9920230028/miniconda3/envs/lmms-finetune/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
\ FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-07-21_21:53:11
  host      : gpu7.example.com
  rank      : 0 (local_rank: 0)
  exitcode  : 2 (pid: 18745)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
srun: error: gpu7: task 0: Exited with exit code 1
zjysteven commented 1 month ago

script看着没问题,报错里指出是/home/u9920230028/miniconda3/envs/lmms-finetune/bin/python: can't open file '/home/u9920230028/lmms-finetune-main/\\': [Errno 2] No such file or directory,具体不太清楚是什么问题。我今天本地跑一下slurm看看。

zjysteven commented 1 month ago

我本地跑如下slurm script没问题。我把data换成了multi_images.json(因为llava-interleave只会找image token),不过这个应该和你遇到的bug无关

#!/usr/bin/bash
#SBATCH --job-name=lmms-finetune
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --ntasks=1
#SBATCH --gres=gpu:2
#SBATCH --cpus-per-task=8
#SBATCH --mem=32g
#SBATCH --time=24:00:00
#SBATCH -e /home/jz288/lmms_train/slurm/%j.err
#SBATCH -o /home/jz288/lmms_train/slurm/%j.out
#SBATCH --partition=athena-genai
#SBATCH --exclude=node5
#SBATCH --account jz288

eval "$(conda shell.bash hook)"
conda activate lmms-finetune

NUM_GPUS=2
DISTRIBUTED_ARGS="
    --nnodes=1 \
    --nproc_per_node ${NUM_GPUS} \
    --rdzv_backend c10d \
    --rdzv_endpoint localhost:0
"

# arguments that are very likely to be changed
# according to your own case
MODEL_ID=llava-interleave-qwen-7b                            # model id; pick on by running `python supported_models.py`
TRAIN_DATA_PATH=./example_data/multi_images.json               # path to the training data json file
EVAL_DATA_PATH=./example_data/multi_images.json                # path to the evaluation data json file
IMAGE_FOLDER=./example_data/images                      # path to the image root folder; if provided, the image paths in the json should be relative
VIDEO_FOLDER=./example_data/videos                      # path to the video root folder; if provided, the video paths in the json should be relative
DEFAULT_NUM_FRAMES=8                                    # if `num_frames` is not specified in dataset entries, this value will be used to sample frames from videos

USE_LORA=True                                           # whether use lora
Q_LORA=False                                            # whether use q-lora; only effective when `USE_LORA` is True
LORA_R=64                                               # the lora rank
LORA_ALPHA=16                                           # the lora alpha

RUN_ID=${MODEL_ID}_lora-${USE_LORA}_qlora-${Q_LORA}     # a custom run id that determines the checkpoint folder and wandb run name

DS_STAGE=zero3                                          # deepspeed stage; < zero2 | zero3 >
PER_DEVICE_BATCH_SIZE=1                                 # batch size per GPU
GRAD_ACCUM=1                                            # gradient accumulation steps
NUM_EPOCHS=1                                            # number of training epochs

LR=1e-4                                                 # learning rate
MODEL_MAX_LEN=2048                                      # maximum input length of the model

srun torchrun $DISTRIBUTED_ARGS train.py \
    --model_id $MODEL_ID \
    --data_path $TRAIN_DATA_PATH \
    --eval_data_path $EVAL_DATA_PATH \
    --image_folder $IMAGE_FOLDER \
    --video_folder $VIDEO_FOLDER \
    --default_num_frames $DEFAULT_NUM_FRAMES \
    --output_dir ./checkpoints/$RUN_ID \
    --report_to wandb \
    --run_name $RUN_ID \
    --deepspeed ./ds_configs/${DS_STAGE}.json \
    --bf16 True \
    --num_train_epochs $NUM_EPOCHS \
    --per_device_train_batch_size $PER_DEVICE_BATCH_SIZE \
    --per_device_eval_batch_size $PER_DEVICE_BATCH_SIZE \
    --gradient_accumulation_steps $GRAD_ACCUM \
    --eval_strategy "epoch" \
    --save_strategy "epoch" \
    --save_total_limit 1 \
    --learning_rate ${LR} \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length $MODEL_MAX_LEN \
    --gradient_checkpointing True \
    --dataloader_num_workers 4 \
    --use_lora $USE_LORA \
    --q_lora $Q_LORA \
    --lora_r $LORA_R \
    --lora_alpha $LORA_ALPHA
whycantfindaname commented 1 month ago

我本地跑如下slurm script没问题。我把data换成了multi_images.json(因为llava-interleave只会找image token),不过这个应该和你遇到的bug无关

#!/usr/bin/bash
#SBATCH --job-name=lmms-finetune
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --ntasks=1
#SBATCH --gres=gpu:2
#SBATCH --cpus-per-task=8
#SBATCH --mem=32g
#SBATCH --time=24:00:00
#SBATCH -e /home/jz288/lmms_train/slurm/%j.err
#SBATCH -o /home/jz288/lmms_train/slurm/%j.out
#SBATCH --partition=athena-genai
#SBATCH --exclude=node5
#SBATCH --account jz288

eval "$(conda shell.bash hook)"
conda activate lmms-finetune

NUM_GPUS=2
DISTRIBUTED_ARGS="
    --nnodes=1 \
    --nproc_per_node ${NUM_GPUS} \
    --rdzv_backend c10d \
    --rdzv_endpoint localhost:0
"

# arguments that are very likely to be changed
# according to your own case
MODEL_ID=llava-interleave-qwen-7b                            # model id; pick on by running `python supported_models.py`
TRAIN_DATA_PATH=./example_data/multi_images.json               # path to the training data json file
EVAL_DATA_PATH=./example_data/multi_images.json                # path to the evaluation data json file
IMAGE_FOLDER=./example_data/images                      # path to the image root folder; if provided, the image paths in the json should be relative
VIDEO_FOLDER=./example_data/videos                      # path to the video root folder; if provided, the video paths in the json should be relative
DEFAULT_NUM_FRAMES=8                                    # if `num_frames` is not specified in dataset entries, this value will be used to sample frames from videos

USE_LORA=True                                           # whether use lora
Q_LORA=False                                            # whether use q-lora; only effective when `USE_LORA` is True
LORA_R=64                                               # the lora rank
LORA_ALPHA=16                                           # the lora alpha

RUN_ID=${MODEL_ID}_lora-${USE_LORA}_qlora-${Q_LORA}     # a custom run id that determines the checkpoint folder and wandb run name

DS_STAGE=zero3                                          # deepspeed stage; < zero2 | zero3 >
PER_DEVICE_BATCH_SIZE=1                                 # batch size per GPU
GRAD_ACCUM=1                                            # gradient accumulation steps
NUM_EPOCHS=1                                            # number of training epochs

LR=1e-4                                                 # learning rate
MODEL_MAX_LEN=2048                                      # maximum input length of the model

srun torchrun $DISTRIBUTED_ARGS train.py \
    --model_id $MODEL_ID \
    --data_path $TRAIN_DATA_PATH \
    --eval_data_path $EVAL_DATA_PATH \
    --image_folder $IMAGE_FOLDER \
    --video_folder $VIDEO_FOLDER \
    --default_num_frames $DEFAULT_NUM_FRAMES \
    --output_dir ./checkpoints/$RUN_ID \
    --report_to wandb \
    --run_name $RUN_ID \
    --deepspeed ./ds_configs/${DS_STAGE}.json \
    --bf16 True \
    --num_train_epochs $NUM_EPOCHS \
    --per_device_train_batch_size $PER_DEVICE_BATCH_SIZE \
    --per_device_eval_batch_size $PER_DEVICE_BATCH_SIZE \
    --gradient_accumulation_steps $GRAD_ACCUM \
    --eval_strategy "epoch" \
    --save_strategy "epoch" \
    --save_total_limit 1 \
    --learning_rate ${LR} \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length $MODEL_MAX_LEN \
    --gradient_checkpointing True \
    --dataloader_num_workers 4 \
    --use_lora $USE_LORA \
    --q_lora $Q_LORA \
    --lora_r $LORA_R \
    --lora_alpha $LORA_ALPHA

谢谢你,我今天在主机的两张3090上跑没有问题了。对于第二个问题我是直接将llava_interleave.py里的self.model_hf_path改为本地路径。我把脚本改一下再去slurm上试试

whycantfindaname commented 1 month ago

目前slurm也没有问题,谢谢您!我还想请教一下该如何更改llava-interleave的多图inference的pipeline才能将其用于我们finetune后的模型呢?

whycantfindaname commented 1 month ago

作者您好,我还想请教一下为什么我在slurm集群上申请了4个gpu但是我的training_args中只显示了一个gpu:

TrainingArguments(
_n_gpu=1,
accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False},
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
batch_eval_metrics=False,
bf16=True,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=4,
dataloader_persistent_workers=False,
dataloader_pin_memory=True,
dataloader_prefetch_factor=None,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=./ds_configs/zero2.json,
disable_tqdm=False,
dispatch_batches=None,
do_eval=True,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_do_concat_batches=True,
eval_on_start=False,
eval_steps=None,
eval_strategy=epoch,
evaluation_strategy=None,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
freeze_multimodal=True,
fsdp=[],
fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False},
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
gradient_accumulation_steps=1,
gradient_checkpointing=True,
gradient_checkpointing_kwargs=None,
greater_is_better=None,
group_by_length=False,
half_precision_backend=auto,
hub_always_push=False,
hub_model_id=None,
hub_private_repo=False,
hub_strategy=every_save,
hub_token=<HUB_TOKEN>,
ignore_data_skip=False,
include_inputs_for_metrics=False,
include_num_input_tokens_seen=False,
include_tokens_per_second=False,
jit_mode_eval=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=0.0001,
length_column_name=length,
load_best_model_at_end=False,
local_rank=0,
log_level=passive,
log_level_replica=warning,
log_on_each_node=True,
logging_dir=./checkpoints/llava-interleave-qwen-0.5b_lora-True_qlora-False/runs/Jul22_21-34-09_gpu2,
logging_first_step=False,
logging_nan_inf_filter=True,
logging_steps=1.0,
logging_strategy=steps,
lr_scheduler_kwargs={},
lr_scheduler_type=cosine,
max_grad_norm=1.0,
max_steps=-1,
metric_for_best_model=None,
model_max_length=2048,
mp_parameters=,
neftune_noise_alpha=None,
no_cuda=False,
num_train_epochs=1.0,
optim=adamw_torch,
optim_args=None,
optim_target_modules=None,
output_dir=./checkpoints/llava-interleave-qwen-0.5b_lora-True_qlora-False,
overwrite_output_dir=False,
past_index=-1,
per_device_eval_batch_size=8,
per_device_train_batch_size=8,
prediction_loss_only=False,
push_to_hub=False,
push_to_hub_model_id=None,
push_to_hub_organization=None,
push_to_hub_token=<PUSH_TO_HUB_TOKEN>,
ray_scope=last,
remove_unused_columns=False,
report_to=['wandb'],
restore_callback_states_from_checkpoint=False,
resume_from_checkpoint=None,
run_name=llava-interleave-qwen-0.5b_lora-True_qlora-False,
save_on_each_node=False,
save_only_model=False,
save_safetensors=True,
save_steps=500,
save_strategy=epoch,
save_total_limit=1,
seed=42,
skip_memory_metrics=True,
split_batches=None,
tf32=True,
torch_compile=False,
torch_compile_backend=None,
torch_compile_mode=None,
torchdynamo=None,
tpu_metrics_debug=False,
tpu_num_cores=None,
use_cpu=False,
use_flash_attn=True,
use_ipex=False,
use_legacy_prediction_loop=False,
use_mps_device=False,
warmup_ratio=0.03,
warmup_steps=0,
weight_decay=0.0,
)

我的slurm脚本如下:

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --gres=gpu:4
#SBATCH --cpus-per-task=28
#SBATCH --partition=gpu
#SBATCH --exclude=gpu19,gpu3,gpu8,gpu14
#SBATCH --job-name=lmms-finetune
#SBATCH --output=/home/u9920230028/lmms-finetune-main/train_testbug/job_output.txt
#SBATCH --error=/home/u9920230028/lmms-finetune-main/train_testbug/job_error.txt
#SBATCH --mail-type=ALL
#SBATCH --mail-user=jasonliaonk21@gmail.com

eval "$(conda shell.bash hook)"
conda activate lmms-finetune
NUM_GPUS=4
DISTRIBUTED_ARGS="
    --nnodes=1 \
    --nproc_per_node=${NUM_GPUS} \
    --rdzv_backend=c10d \
    --rdzv_endpoint=localhost:0
"

# according to your own case
MODEL_ID=llava-interleave-qwen-0.5b                            # model id; pick on by running `python supported_models.py`
TRAIN_DATA_PATH=./example_data/multi_images.json               # path to the training data json file
EVAL_DATA_PATH=./example_data/multi_images.json                # path to the evaluation data json file
IMAGE_FOLDER=./example_data/images                      # path to the image root folder; if provided, the image paths in the json should be relative
VIDEO_FOLDER=./example_data/videos                      # path to the video root folder; if provided, the video paths in the json should be relative
DEFAULT_NUM_FRAMES=8                                    # if `num_frames` is not specified in dataset entries, this value will be used to sample frames from videos

USE_LORA=True                                           # whether use lora
Q_LORA=False                                            # whether use q-lora; only effective when `USE_LORA` is True
LORA_R=64                                               # the lora rank
LORA_ALPHA=16                                           # the lora alpha

RUN_ID=${MODEL_ID}_lora-${USE_LORA}_qlora-${Q_LORA}     # a custom run id that determines the checkpoint folder and wandb run name

DS_STAGE=zero2                                          # deepspeed stage; < zero2 | zero3 >
PER_DEVICE_BATCH_SIZE=8                                 # batch size per GPU
GRAD_ACCUM=1                                            # gradient accumulation steps
NUM_EPOCHS=1                                            # number of training epochs

LR=1e-4                                                 # learning rate
MODEL_MAX_LEN=2048                                      # maximum input length of the model

srun torchrun $DISTRIBUTED_ARGS train.py \
    --model_id $MODEL_ID \
    --data_path $TRAIN_DATA_PATH \
    --eval_data_path $EVAL_DATA_PATH \
    --image_folder $IMAGE_FOLDER \
    --video_folder $VIDEO_FOLDER \
    --default_num_frames $DEFAULT_NUM_FRAMES \
    --output_dir ./checkpoints/$RUN_ID \
    --report_to wandb \
    --run_name $RUN_ID \
    --deepspeed ./ds_configs/${DS_STAGE}.json \
    --use_flash_attn True \
    --bf16 True \
    --num_train_epochs $NUM_EPOCHS \
    --per_device_train_batch_size $PER_DEVICE_BATCH_SIZE \
    --per_device_eval_batch_size $PER_DEVICE_BATCH_SIZE \
    --gradient_accumulation_steps $GRAD_ACCUM \
    --eval_strategy "epoch" \
    --save_strategy "epoch" \
    --save_total_limit 1 \
    --learning_rate ${LR} \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length $MODEL_MAX_LEN \
    --gradient_checkpointing True \
    --dataloader_num_workers 4 \
    --use_lora $USE_LORA \
    --q_lora $Q_LORA \
    --lora_r $LORA_R \
    --lora_alpha $LORA_ALPHA
zjysteven commented 1 month ago

目前slurm也没有问题,谢谢您!我还想请教一下该如何更改llava-interleave的多图inference的pipeline才能将其用于我们finetune后的模型呢?

不需要修改,只要把finetune模型加载好,剩下inference、加速等等都与原本模型/pipeline是一致的。具体加载可以看一下inference.md

whycantfindaname commented 1 month ago

作者您好,我还想请教一下为什么我在slurm集群上申请了4个gpu但是我的training_args中只显示了一个gpu:

TrainingArguments(
_n_gpu=1,
accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False},
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
batch_eval_metrics=False,
bf16=True,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=4,
dataloader_persistent_workers=False,
dataloader_pin_memory=True,
dataloader_prefetch_factor=None,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=./ds_configs/zero2.json,
disable_tqdm=False,
dispatch_batches=None,
do_eval=True,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_do_concat_batches=True,
eval_on_start=False,
eval_steps=None,
eval_strategy=epoch,
evaluation_strategy=None,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
freeze_multimodal=True,
fsdp=[],
fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False},
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
gradient_accumulation_steps=1,
gradient_checkpointing=True,
gradient_checkpointing_kwargs=None,
greater_is_better=None,
group_by_length=False,
half_precision_backend=auto,
hub_always_push=False,
hub_model_id=None,
hub_private_repo=False,
hub_strategy=every_save,
hub_token=<HUB_TOKEN>,
ignore_data_skip=False,
include_inputs_for_metrics=False,
include_num_input_tokens_seen=False,
include_tokens_per_second=False,
jit_mode_eval=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=0.0001,
length_column_name=length,
load_best_model_at_end=False,
local_rank=0,
log_level=passive,
log_level_replica=warning,
log_on_each_node=True,
logging_dir=./checkpoints/llava-interleave-qwen-0.5b_lora-True_qlora-False/runs/Jul22_21-34-09_gpu2,
logging_first_step=False,
logging_nan_inf_filter=True,
logging_steps=1.0,
logging_strategy=steps,
lr_scheduler_kwargs={},
lr_scheduler_type=cosine,
max_grad_norm=1.0,
max_steps=-1,
metric_for_best_model=None,
model_max_length=2048,
mp_parameters=,
neftune_noise_alpha=None,
no_cuda=False,
num_train_epochs=1.0,
optim=adamw_torch,
optim_args=None,
optim_target_modules=None,
output_dir=./checkpoints/llava-interleave-qwen-0.5b_lora-True_qlora-False,
overwrite_output_dir=False,
past_index=-1,
per_device_eval_batch_size=8,
per_device_train_batch_size=8,
prediction_loss_only=False,
push_to_hub=False,
push_to_hub_model_id=None,
push_to_hub_organization=None,
push_to_hub_token=<PUSH_TO_HUB_TOKEN>,
ray_scope=last,
remove_unused_columns=False,
report_to=['wandb'],
restore_callback_states_from_checkpoint=False,
resume_from_checkpoint=None,
run_name=llava-interleave-qwen-0.5b_lora-True_qlora-False,
save_on_each_node=False,
save_only_model=False,
save_safetensors=True,
save_steps=500,
save_strategy=epoch,
save_total_limit=1,
seed=42,
skip_memory_metrics=True,
split_batches=None,
tf32=True,
torch_compile=False,
torch_compile_backend=None,
torch_compile_mode=None,
torchdynamo=None,
tpu_metrics_debug=False,
tpu_num_cores=None,
use_cpu=False,
use_flash_attn=True,
use_ipex=False,
use_legacy_prediction_loop=False,
use_mps_device=False,
warmup_ratio=0.03,
warmup_steps=0,
weight_decay=0.0,
)

我的slurm脚本如下:

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --gres=gpu:4
#SBATCH --cpus-per-task=28
#SBATCH --partition=gpu
#SBATCH --exclude=gpu19,gpu3,gpu8,gpu14
#SBATCH --job-name=lmms-finetune
#SBATCH --output=/home/u9920230028/lmms-finetune-main/train_testbug/job_output.txt
#SBATCH --error=/home/u9920230028/lmms-finetune-main/train_testbug/job_error.txt
#SBATCH --mail-type=ALL
#SBATCH --mail-user=jasonliaonk21@gmail.com

eval "$(conda shell.bash hook)"
conda activate lmms-finetune
NUM_GPUS=4
DISTRIBUTED_ARGS="
    --nnodes=1 \
    --nproc_per_node=${NUM_GPUS} \
    --rdzv_backend=c10d \
    --rdzv_endpoint=localhost:0
"

# according to your own case
MODEL_ID=llava-interleave-qwen-0.5b                            # model id; pick on by running `python supported_models.py`
TRAIN_DATA_PATH=./example_data/multi_images.json               # path to the training data json file
EVAL_DATA_PATH=./example_data/multi_images.json                # path to the evaluation data json file
IMAGE_FOLDER=./example_data/images                      # path to the image root folder; if provided, the image paths in the json should be relative
VIDEO_FOLDER=./example_data/videos                      # path to the video root folder; if provided, the video paths in the json should be relative
DEFAULT_NUM_FRAMES=8                                    # if `num_frames` is not specified in dataset entries, this value will be used to sample frames from videos

USE_LORA=True                                           # whether use lora
Q_LORA=False                                            # whether use q-lora; only effective when `USE_LORA` is True
LORA_R=64                                               # the lora rank
LORA_ALPHA=16                                           # the lora alpha

RUN_ID=${MODEL_ID}_lora-${USE_LORA}_qlora-${Q_LORA}     # a custom run id that determines the checkpoint folder and wandb run name

DS_STAGE=zero2                                          # deepspeed stage; < zero2 | zero3 >
PER_DEVICE_BATCH_SIZE=8                                 # batch size per GPU
GRAD_ACCUM=1                                            # gradient accumulation steps
NUM_EPOCHS=1                                            # number of training epochs

LR=1e-4                                                 # learning rate
MODEL_MAX_LEN=2048                                      # maximum input length of the model

srun torchrun $DISTRIBUTED_ARGS train.py \
    --model_id $MODEL_ID \
    --data_path $TRAIN_DATA_PATH \
    --eval_data_path $EVAL_DATA_PATH \
    --image_folder $IMAGE_FOLDER \
    --video_folder $VIDEO_FOLDER \
    --default_num_frames $DEFAULT_NUM_FRAMES \
    --output_dir ./checkpoints/$RUN_ID \
    --report_to wandb \
    --run_name $RUN_ID \
    --deepspeed ./ds_configs/${DS_STAGE}.json \
    --use_flash_attn True \
    --bf16 True \
    --num_train_epochs $NUM_EPOCHS \
    --per_device_train_batch_size $PER_DEVICE_BATCH_SIZE \
    --per_device_eval_batch_size $PER_DEVICE_BATCH_SIZE \
    --gradient_accumulation_steps $GRAD_ACCUM \
    --eval_strategy "epoch" \
    --save_strategy "epoch" \
    --save_total_limit 1 \
    --learning_rate ${LR} \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length $MODEL_MAX_LEN \
    --gradient_checkpointing True \
    --dataloader_num_workers 4 \
    --use_lora $USE_LORA \
    --q_lora $Q_LORA \
    --lora_r $LORA_R \
    --lora_alpha $LORA_ALPHA

会报显存不够的错: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 10.31 GiB. GPU  has a total capacity of 44.56 GiB of which 3.08 GiB is free. Process 33808 has 41.47 GiB memory in use. Of the allocated memory 37.88 GiB is allocated by PyTorch, and 1.49 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.

zjysteven commented 1 month ago

我跑了一下trainingargs里的_n_gpu的确也为1,但查看gpu状态用的是多卡没问题。爆显存的话调低一下per_device_batch_size,对于llava-interleave-qwen-0.5b,per_device_batch_size=4,model_max_length=2048我这里是每块卡占了20G。

目前显存占用的确会比较大,与huggingface模型的实现有关(见这里)。后续会和huggingface联系优化这部分。

whycantfindaname commented 1 month ago

我跑了一下trainingargs里的_n_gpu的确也为1,但查看gpu状态用的是多卡没问题。爆显存的话调低一下per_device_batch_size,对于llava-interleave-qwen-0.5b,per_device_batch_size=4,model_max_length=2048我这里是每块卡占了20G。

目前显存占用的确会比较大,与huggingface模型的实现有关(见这里)。后续会和huggingface联系优化这部分。

谢谢您!能麻烦您查看一下gmail嘛?我觉得在issue上交流太不方便了

zjysteven commented 1 month ago

Closing now. Feel free to reach out if there are any other issue.