Training with multi gpus, increase the batch size, and how to evaluate?

shawnricecake commented 9 months ago

Hi,

1. When I use the command on 8 gpus:

python3 qalora.py --model_path $llama_7b_4bit_g32

it will show the error:

  File "/home/shawn/anaconda3/envs/qalora/lib/python3.8/site-packages/transformers/models/llama/modeling_llama.py", line 830, in forward
    logits = self.lm_head(hidden_states)
  File "/home/shawn/anaconda3/envs/qalora/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/shawn/anaconda3/envs/qalora/lib/python3.8/site-packages/torch/nn/modules/linear.py", line 114, in forward
    return F.linear(input, self.weight, self.bias)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:7! (when checking argument for argument mat2 in method wrapper_CUDA_mm)

2. and when I revise the batch size:

per_device_train_batch_size: int = field(default=1, metadata={"help": 'The training batch size per GPU. Increase for better speed.'})

from default=1 to default=16, the training process still shows the following:


  0%|          | 0/10000 [00:00<?, ?it/s]
  0%|          | 1/10000 [07:15<1210:14:35, 435.73s/it]

do I need to decrease the max num of steps when I increase the batch size?

3. also, how can we do the evaluation on mmlu and get the results in paper? (I did not find the mmlu dataset setup instructions...) is it correct to download the .json files here? https://huggingface.co/datasets/openaccess-ai-collective/mmlu-evals/tree/main

would you mind helping me with these problems?

Thanks Shawn

shawnricecake commented 9 months ago

it seems that the first problem can be solved by the following code in function def get_accelerate_model(args, checkpoint_dir):

for name, module in model.named_modules():
        if isinstance(module, LoraLayer):
            if args.bf16:
                module = module.to(torch.bfloat16)
        if 'norm' in name:
            # module = module.to(torch.float32)
            # add to solve the tensor type mismatch
            module = module.to(torch.float16)
        if 'lm_head' in name or 'embed_tokens' in name:
            # add to solve the first problem
            module = module.to(torch.device("cuda:0"))
            if hasattr(module, 'weight'):
                if args.bf16 and module.weight.dtype == torch.float32:
                    module = module.to(torch.bfloat16)

StiphyJay commented 9 months ago

Hi,

When I use the command on 8 gpus:
python3 qalora.py --model_path $llama_7b_4bit_g32
it will show the error:
  File "/home/shawn/anaconda3/envs/qalora/lib/python3.8/site-packages/transformers/models/llama/modeling_llama.py", line 830, in forward
    logits = self.lm_head(hidden_states)
  File "/home/shawn/anaconda3/envs/qalora/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/shawn/anaconda3/envs/qalora/lib/python3.8/site-packages/torch/nn/modules/linear.py", line 114, in forward
    return F.linear(input, self.weight, self.bias)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:7! (when checking argument for argument mat2 in method wrapper_CUDA_mm)
and when I revise the batch size:
per_device_train_batch_size: int = field(default=1, metadata={"help": 'The training batch size per GPU. Increase for better speed.'})
from default=1 to default=16, the training process still shows the following:
  0%|          | 0/10000 [00:00<?, ?it/s]
  0%|          | 1/10000 [07:15<1210:14:35, 435.73s/it]
do I need to decrease the max num of steps when I increase the batch size?

also, how can we do the evaluation on mmlu and get the results in paper? (I did not find the mmlu dataset setup instructions...) is it correct to download the .json files here? https://huggingface.co/datasets/openaccess-ai-collective/mmlu-evals/tree/main

would you mind helping me with these problems?

Thanks Shawn

the same problem with problem2

yuhuixu1993 commented 9 months ago

@shawnricecake The second question. Total batch size= gradient_accumulation_stepsper_device_train_batch_sizedevices

tellyoung commented 3 months ago

it seems that the first problem can be solved by the following code in function def get_accelerate_model(args, checkpoint_dir):

for name, module in model.named_modules():
        if isinstance(module, LoraLayer):
            if args.bf16:
                module = module.to(torch.bfloat16)
        if 'norm' in name:
            # module = module.to(torch.float32)
            # add to solve the tensor type mismatch
            module = module.to(torch.float16)
        if 'lm_head' in name or 'embed_tokens' in name:
            # add to solve the first problem
            module = module.to(torch.device("cuda:0"))
            if hasattr(module, 'weight'):
                if args.bf16 and module.weight.dtype == torch.float32:
                    module = module.to(torch.bfloat16)

module = module.to(torch.device("cuda:0")) is a big mistake

Eros1on-Aqua commented 3 weeks ago

Hi,

When I use the command on 8 gpus:
python3 qalora.py --model_path $llama_7b_4bit_g32
it will show the error:
  File "/home/shawn/anaconda3/envs/qalora/lib/python3.8/site-packages/transformers/models/llama/modeling_llama.py", line 830, in forward
    logits = self.lm_head(hidden_states)
  File "/home/shawn/anaconda3/envs/qalora/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/shawn/anaconda3/envs/qalora/lib/python3.8/site-packages/torch/nn/modules/linear.py", line 114, in forward
    return F.linear(input, self.weight, self.bias)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:7! (when checking argument for argument mat2 in method wrapper_CUDA_mm)
and when I revise the batch size:
per_device_train_batch_size: int = field(default=1, metadata={"help": 'The training batch size per GPU. Increase for better speed.'})
from default=1 to default=16, the training process still shows the following:
  0%|          | 0/10000 [00:00<?, ?it/s]
  0%|          | 1/10000 [07:15<1210:14:35, 435.73s/it]
do I need to decrease the max num of steps when I increase the batch size?

also, how can we do the evaluation on mmlu and get the results in paper? (I did not find the mmlu dataset setup instructions...) is it correct to download the .json files here? https://huggingface.co/datasets/openaccess-ai-collective/mmlu-evals/tree/main

would you mind helping me with these problems?

Thanks Shawn

For question1: I just change device_map in AutoGPTQForCausalLM.from_quantized and it worked: device_map='balanced_low_0'

yuhuixu1993 / qa-lora

Training with multi gpus, increase the batch size, and how to evaluate? #17