Loss device for ORPOTrainer

ganeshkrishnan1 commented 7 months ago

I hit upon an error in HuggingFace for which there are strangely zero google search results

"ValueError: Calculated loss must be on the original device" I can see this error source code in huggingface trainer.py file

The full error is "ValueError: Calculated loss must be on the original device: cuda:0 but device in use is cuda:3"

This happens when I use multi-gpu using accelerate with this code

model_name = "aihello/podcast"
model = AutoModelForCausalLM.from_pretrained(
          model_name, torch_dtype=torch_dtype, quantization_config=bnb_config, device_map='auto'
)

I can set the device map to specific GPU to avoid this but one GPU doesnt have enough memory to support our ORPO training

 model = AutoModelForCausalLM.from_pretrained(
           model_name, torch_dtype=torch_dtype, quantization_config=bnb_config, device_map={"": 0},  attn_implementation=attn_implementation
 )

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 28.00 MiB. GPU 0 has a total capacity of 14.58 GiB of which 14.50 MiB is free. yadda yadda

orpo_config = ORPOConfig(
    output_dir="./output/",
    evaluation_strategy="steps",
    do_eval=True,
    optim="paged_adamw_8bit",
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,
    per_device_eval_batch_size=1,
    log_level="debug",
    logging_steps=20,
    learning_rate=8e-6,
    eval_steps=20,
    num_train_epochs=3,
    # max_steps=9000,
    save_steps=20,
    save_strategy='epoch',
    warmup_ratio=0.1,
    lr_scheduler_type="linear",
    beta=0.1, #beta is ORPO's lambda
    max_length=1024,
)

trainer = ORPOTrainer(
        model=model,
        train_dataset=dataset[0],
        eval_dataset=dataset[1],
        peft_config=peft_config,
        args=orpo_config,
        tokenizer=tokenizer,
)

trainer.train()

This is specific to ORPO as I have no issues with PEFT finetuning and multi gpu setup

jiwooya1000 commented 7 months ago

Hello @ganeshkrishnan1,

Could you try calling the model to the CPU first before passing the model to the ORPOTrainer by removing device_map='auto'?

model_name = "aihello/podcast"
model = AutoModelForCausalLM.from_pretrained(
          model_name, torch_dtype=torch_dtype, quantization_config=bnb_config
)

Accelerate usually allocates the model and the loss automatically to the appropriate GPUs, so let me know if loading the model to the CPU first could resolve it.

ganeshkrishnan1 commented 7 months ago

If I remove the device_map then only one GPU is used and I get a device out of memory error. If I add device_map then I get this error "ValueError: Calculated loss must be on the original device: cuda:0 but device in use is cuda:3"

Also I tried replacing the ORPO trainer with the DPO trainer and it worked without any issues.

blaze7451 commented 6 months ago

Hi @ganeshkrishnan1 , @jiwooya1000 , I just face the same situation and raise an issue in trl here, any suggestions to fix this error?

huangxinping commented 6 months ago

I'm facing the same issue, anyone know how to fix the error?

ganeshkrishnan1 commented 6 months ago

@blaze7451 @huangxinping there is no known fix for now. We have reverted to DPO for now and might revisit this later or try to fix it ourselves

jiwooya1000 commented 6 months ago

Hello, @nlee-208 and I are currently using alignment-handbook and TRL too, but we were not able to reproduce the issue for now. Could you specify which setting of accelerate are you using @ganeshkrishnan1 @huangxinping @blaze7451 (e.g., FSDP, DS2, --multi-gpu)?

ganeshkrishnan1 commented 6 months ago

I am using FDSP

fsdp_plugin = FullyShardedDataParallelPlugin(
    state_dict_config=FullStateDictConfig(offload_to_cpu=True, rank0_only=False),
    optim_state_dict_config=FullOptimStateDictConfig(offload_to_cpu=True, rank0_only=False),
)
accelerator = Accelerator(fsdp_plugin=fsdp_plugin)
device = accelerator.device

huangxinping commented 6 months ago

@jiwooya1000 I am just using ORPORTrainer and following a blog tutorial. Fine-tune Llama 3 with ORPO

If I only set up one GPU to fine-tune Llama 3, it can be trained successfully.

blaze7451 commented 6 months ago

I simply used ORPOTrainer, didn't specifically set any accelerate config. Just like the situation of @huangxinping, i tried to use single GPU to train and it works for me.

nlee-208 commented 6 months ago

Hey @alvarobartt, do you perhaps have any solution/similar experience using the ORPOTrainer from trl? Seems like there are some issues with the device mapping from either bnb or peft for the trl ORPOTrainer.

alvarobartt commented 6 months ago

Thanks for the ping @nlee-208! Indeed AFAIK device_map is only intended to be used for inference, other than that accelerate handles the device placement and so does the alignment-handbook, so setting device_map for training is just not correct AFAIK.

See https://github.com/huggingface/alignment-handbook/blob/70769f9e9ba41c7f08ba6c4ff3725441b68b7ca3/src/alignment/model_utils.py#L33C1-L35C85

alvarobartt commented 6 months ago

Also could you guys @huangxinping @ganeshkrishnan1 @blaze7451 clarify what the issue is? Does removing the device_map work on a multi-GPU environment, or is it also failing? Also how many processes are you using? If you could share the accelerate config for multi-GPU that would be great to help debug that

alvarobartt commented 6 months ago

Could you guys try using the FSDP configuration at https://github.com/huggingface/alignment-handbook/blob/main/recipes/accelerate_configs/fsdp.yaml, and run that as accelerate launch --config_file fsdp.yaml your_script.py using the same number of processes in num_processes as GPUs you want to use, and also removing the device_map="auto" from AutoModelForCausalLM.from_pretrained?

alvarobartt commented 6 months ago

So to replicate Maxime's script via the alignment-handbook you should use the following configuration, say config.yaml:

# Model arguments
model_name_or_path: meta-llama/Meta-Llama-3-8B
torch_dtype: bfloat16
use_flash_attention_2: true

# LoRA arguments
use_peft: true
load_in_4bit: true
lora_r: 16
lora_alpha: 32
lora_dropout: 0.05
lora_target_modules:
- q_proj
- k_proj
- v_proj
- o_proj
- gate_proj
- up_proj
- down_proj

# Data training arguments
dataset_mixer:
  mlabonne/orpo-dpo-mix-40k: 0.1
dataset_splits:
- train
preprocessing_num_workers: 12

# ORPOTrainer arguments
beta: 0.1
do_eval: true
evaluation_strategy: steps
eval_steps: 0.2
gradient_accumulation_steps: 4
gradient_checkpointing: true
gradient_checkpointing_kwargs:
  use_reentrant: false
hub_model_id: llama-3-orpo-qlora
learning_rate: 8.0e-6
log_level: info
logging_steps: 1
lr_scheduler_type: linear 
max_length: 1024
max_prompt_length: 512
num_train_epochs: 1
optim: paged_adamw_8bit
output_dir: results/
per_device_train_batch_size: 2
per_device_eval_batch_size: 2
seed: 42
warmup_ratio: 0.1
warmup_steps: 10

Then the following FSDP configuration file (tweaking the num_processes), say fsdp.yaml:

compute_environment: LOCAL_MACHINE
debug: false
distributed_type: FSDP
downcast_bf16: 'no'
enable_cpu_affinity: false
fsdp_config:
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_backward_prefetch: BACKWARD_PRE
  fsdp_cpu_ram_efficient_loading: true
  fsdp_forward_prefetch: true
  fsdp_offload_params: false
  fsdp_sharding_strategy: FULL_SHARD
  fsdp_state_dict_type: SHARDED_STATE_DICT
  fsdp_sync_module_states: true
  fsdp_use_orig_params: true
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

And then run that as:

ACCELERATE_LOG_LEVEL=info TRANSFORMERS_VERBOSITY=info accelerate launch --config_file fsdp.yaml scripts/run_orpo.py config.yaml

Otherwise, if you prefer to use custom code, you can look at run_orpo.py for reference on how to properly initialize accelerate to use multiple GPUs for fine-tuning.

Hope that helped you in the meantime 👍🏻

ganeshkrishnan1 commented 6 months ago

Thanks, got it working though there is the bug of Ampere devices with bfloat16. I created a pull request to fix it

hitszxs commented 4 months ago

So to replicate Maxime's script via the alignment-handbook you should use the following configuration, say config.yaml:因此，要通过 alignment-handbook 复制 Maxime 的脚本，您应该使用以下配置，例如 config.yaml ：
# Model arguments
model_name_or_path: meta-llama/Meta-Llama-3-8B
torch_dtype: bfloat16
use_flash_attention_2: true

# LoRA arguments
use_peft: true
load_in_4bit: true
lora_r: 16
lora_alpha: 32
lora_dropout: 0.05
lora_target_modules:
- q_proj
- k_proj
- v_proj
- o_proj
- gate_proj
- up_proj
- down_proj

# Data training arguments
dataset_mixer:
  mlabonne/orpo-dpo-mix-40k: 0.1
dataset_splits:
- train
preprocessing_num_workers: 12

# ORPOTrainer arguments
beta: 0.1
do_eval: true
evaluation_strategy: steps
eval_steps: 0.2
gradient_accumulation_steps: 4
gradient_checkpointing: true
gradient_checkpointing_kwargs:
  use_reentrant: false
hub_model_id: llama-3-orpo-qlora
learning_rate: 8.0e-6
log_level: info
logging_steps: 1
lr_scheduler_type: linear 
max_length: 1024
max_prompt_length: 512
num_train_epochs: 1
optim: paged_adamw_8bit
output_dir: results/
per_device_train_batch_size: 2
per_device_eval_batch_size: 2
seed: 42
warmup_ratio: 0.1
warmup_steps: 10
Then the following FSDP configuration file (tweaking the num_processes), say fsdp.yaml:然后是以下 FSDP 配置文件（调整 num_processes ），例如 fsdp.yaml ：
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: FSDP
downcast_bf16: 'no'
enable_cpu_affinity: false
fsdp_config:
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_backward_prefetch: BACKWARD_PRE
  fsdp_cpu_ram_efficient_loading: true
  fsdp_forward_prefetch: true
  fsdp_offload_params: false
  fsdp_sharding_strategy: FULL_SHARD
  fsdp_state_dict_type: SHARDED_STATE_DICT
  fsdp_sync_module_states: true
  fsdp_use_orig_params: true
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
And then run that as:然后运行它：
ACCELERATE_LOG_LEVEL=info TRANSFORMERS_VERBOSITY=info accelerate launch --config_file fsdp.yaml scripts/run_orpo.py config.yaml
Otherwise, if you prefer to use custom code, you can look at run_orpo.py for reference on how to properly initialize accelerate to use multiple GPUs for fine-tuning.否则，如果您更喜欢使用自定义代码，可以查看 run_orpo.py 以获取有关如何正确初始化 accelerate 以使用多个 GPU 进行微调的参考。

Hope that helped you in the meantime 👍🏻希望同时对您有所帮助👍🏻

哈喽，我想知道这样的配置在加载模型时候是否会出现warning：You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with 'model. to (' cuda' )'.

xfactlab / orpo

Loss device for ORPOTrainer #18