Open ganeshkrishnan1 opened 7 months ago
Hello @ganeshkrishnan1,
Could you try calling the model to the CPU first before passing the model to the ORPOTrainer by removing device_map='auto'
?
model_name = "aihello/podcast"
model = AutoModelForCausalLM.from_pretrained(
model_name, torch_dtype=torch_dtype, quantization_config=bnb_config
)
Accelerate usually allocates the model and the loss automatically to the appropriate GPUs, so let me know if loading the model to the CPU first could resolve it.
If I remove the device_map then only one GPU is used and I get a device out of memory error. If I add device_map then I get this error "ValueError: Calculated loss must be on the original device: cuda:0 but device in use is cuda:3"
Also I tried replacing the ORPO trainer with the DPO trainer and it worked without any issues.
Hi @ganeshkrishnan1 , @jiwooya1000 , I just face the same situation and raise an issue in trl here, any suggestions to fix this error?
I'm facing the same issue, anyone know how to fix the error?
@blaze7451 @huangxinping there is no known fix for now. We have reverted to DPO for now and might revisit this later or try to fix it ourselves
Hello, @nlee-208 and I are currently using alignment-handbook and TRL too, but we were not able to reproduce the issue for now. Could you specify which setting of accelerate are you using @ganeshkrishnan1 @huangxinping @blaze7451 (e.g., FSDP, DS2, --multi-gpu)?
I am using FDSP
fsdp_plugin = FullyShardedDataParallelPlugin(
state_dict_config=FullStateDictConfig(offload_to_cpu=True, rank0_only=False),
optim_state_dict_config=FullOptimStateDictConfig(offload_to_cpu=True, rank0_only=False),
)
accelerator = Accelerator(fsdp_plugin=fsdp_plugin)
device = accelerator.device
@jiwooya1000 I am just using ORPORTrainer and following a blog tutorial. Fine-tune Llama 3 with ORPO
If I only set up one GPU to fine-tune Llama 3, it can be trained successfully.
I simply used ORPOTrainer, didn't specifically set any accelerate config. Just like the situation of @huangxinping, i tried to use single GPU to train and it works for me.
Hey @alvarobartt, do you perhaps have any solution/similar experience using the ORPOTrainer from trl? Seems like there are some issues with the device mapping from either bnb or peft for the trl ORPOTrainer.
Thanks for the ping @nlee-208! Indeed AFAIK device_map
is only intended to be used for inference, other than that accelerate
handles the device placement and so does the alignment-handbook
, so setting device_map
for training is just not correct AFAIK.
Also could you guys @huangxinping @ganeshkrishnan1 @blaze7451 clarify what the issue is? Does removing the device_map
work on a multi-GPU environment, or is it also failing? Also how many processes are you using? If you could share the accelerate
config for multi-GPU that would be great to help debug that
Could you guys try using the FSDP configuration at https://github.com/huggingface/alignment-handbook/blob/main/recipes/accelerate_configs/fsdp.yaml, and run that as accelerate launch --config_file fsdp.yaml your_script.py
using the same number of processes in num_processes
as GPUs you want to use, and also removing the device_map="auto"
from AutoModelForCausalLM.from_pretrained
?
So to replicate Maxime's script via the alignment-handbook
you should use the following configuration, say config.yaml
:
# Model arguments
model_name_or_path: meta-llama/Meta-Llama-3-8B
torch_dtype: bfloat16
use_flash_attention_2: true
# LoRA arguments
use_peft: true
load_in_4bit: true
lora_r: 16
lora_alpha: 32
lora_dropout: 0.05
lora_target_modules:
- q_proj
- k_proj
- v_proj
- o_proj
- gate_proj
- up_proj
- down_proj
# Data training arguments
dataset_mixer:
mlabonne/orpo-dpo-mix-40k: 0.1
dataset_splits:
- train
preprocessing_num_workers: 12
# ORPOTrainer arguments
beta: 0.1
do_eval: true
evaluation_strategy: steps
eval_steps: 0.2
gradient_accumulation_steps: 4
gradient_checkpointing: true
gradient_checkpointing_kwargs:
use_reentrant: false
hub_model_id: llama-3-orpo-qlora
learning_rate: 8.0e-6
log_level: info
logging_steps: 1
lr_scheduler_type: linear
max_length: 1024
max_prompt_length: 512
num_train_epochs: 1
optim: paged_adamw_8bit
output_dir: results/
per_device_train_batch_size: 2
per_device_eval_batch_size: 2
seed: 42
warmup_ratio: 0.1
warmup_steps: 10
Then the following FSDP configuration file (tweaking the num_processes
), say fsdp.yaml
:
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: FSDP
downcast_bf16: 'no'
enable_cpu_affinity: false
fsdp_config:
fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
fsdp_backward_prefetch: BACKWARD_PRE
fsdp_cpu_ram_efficient_loading: true
fsdp_forward_prefetch: true
fsdp_offload_params: false
fsdp_sharding_strategy: FULL_SHARD
fsdp_state_dict_type: SHARDED_STATE_DICT
fsdp_sync_module_states: true
fsdp_use_orig_params: true
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
And then run that as:
ACCELERATE_LOG_LEVEL=info TRANSFORMERS_VERBOSITY=info accelerate launch --config_file fsdp.yaml scripts/run_orpo.py config.yaml
Otherwise, if you prefer to use custom code, you can look at run_orpo.py
for reference on how to properly initialize accelerate
to use multiple GPUs for fine-tuning.
Hope that helped you in the meantime 👍🏻
Thanks, got it working though there is the bug of Ampere devices with bfloat16. I created a pull request to fix it
So to replicate Maxime's script via the
alignment-handbook
you should use the following configuration, sayconfig.yaml
:因此,要通过alignment-handbook
复制 Maxime 的脚本,您应该使用以下配置,例如config.yaml
:# Model arguments model_name_or_path: meta-llama/Meta-Llama-3-8B torch_dtype: bfloat16 use_flash_attention_2: true # LoRA arguments use_peft: true load_in_4bit: true lora_r: 16 lora_alpha: 32 lora_dropout: 0.05 lora_target_modules: - q_proj - k_proj - v_proj - o_proj - gate_proj - up_proj - down_proj # Data training arguments dataset_mixer: mlabonne/orpo-dpo-mix-40k: 0.1 dataset_splits: - train preprocessing_num_workers: 12 # ORPOTrainer arguments beta: 0.1 do_eval: true evaluation_strategy: steps eval_steps: 0.2 gradient_accumulation_steps: 4 gradient_checkpointing: true gradient_checkpointing_kwargs: use_reentrant: false hub_model_id: llama-3-orpo-qlora learning_rate: 8.0e-6 log_level: info logging_steps: 1 lr_scheduler_type: linear max_length: 1024 max_prompt_length: 512 num_train_epochs: 1 optim: paged_adamw_8bit output_dir: results/ per_device_train_batch_size: 2 per_device_eval_batch_size: 2 seed: 42 warmup_ratio: 0.1 warmup_steps: 10
Then the following FSDP configuration file (tweaking the
num_processes
), sayfsdp.yaml
:然后是以下 FSDP 配置文件(调整num_processes
),例如fsdp.yaml
:compute_environment: LOCAL_MACHINE debug: false distributed_type: FSDP downcast_bf16: 'no' enable_cpu_affinity: false fsdp_config: fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP fsdp_backward_prefetch: BACKWARD_PRE fsdp_cpu_ram_efficient_loading: true fsdp_forward_prefetch: true fsdp_offload_params: false fsdp_sharding_strategy: FULL_SHARD fsdp_state_dict_type: SHARDED_STATE_DICT fsdp_sync_module_states: true fsdp_use_orig_params: true machine_rank: 0 main_training_function: main mixed_precision: bf16 num_machines: 1 num_processes: 8 rdzv_backend: static same_network: true tpu_env: [] tpu_use_cluster: false tpu_use_sudo: false use_cpu: false
And then run that as:然后运行它:
ACCELERATE_LOG_LEVEL=info TRANSFORMERS_VERBOSITY=info accelerate launch --config_file fsdp.yaml scripts/run_orpo.py config.yaml
Otherwise, if you prefer to use custom code, you can look at
run_orpo.py
for reference on how to properly initializeaccelerate
to use multiple GPUs for fine-tuning.否则,如果您更喜欢使用自定义代码,可以查看run_orpo.py
以获取有关如何正确初始化accelerate
以使用多个 GPU 进行微调的参考。Hope that helped you in the meantime 👍🏻希望同时对您有所帮助👍🏻
哈喽,我想知道这样的配置在加载模型时候是否会出现warning:You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with 'model. to (' cuda' )'.
I hit upon an error in HuggingFace for which there are strangely zero google search results
"ValueError: Calculated loss must be on the original device" I can see this error source code in huggingface trainer.py file
The full error is "ValueError: Calculated loss must be on the original device: cuda:0 but device in use is cuda:3"
This happens when I use multi-gpu using accelerate with this code
I can set the device map to specific GPU to avoid this but one GPU doesnt have enough memory to support our ORPO training
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 28.00 MiB. GPU 0 has a total capacity of 14.58 GiB of which 14.50 MiB is free. yadda yadda
This is specific to ORPO as I have no issues with PEFT finetuning and multi gpu setup