Open shawnricecake opened 9 months ago
it seems that the first problem can be solved by the following code in function def get_accelerate_model(args, checkpoint_dir):
for name, module in model.named_modules():
if isinstance(module, LoraLayer):
if args.bf16:
module = module.to(torch.bfloat16)
if 'norm' in name:
# module = module.to(torch.float32)
# add to solve the tensor type mismatch
module = module.to(torch.float16)
if 'lm_head' in name or 'embed_tokens' in name:
# add to solve the first problem
module = module.to(torch.device("cuda:0"))
if hasattr(module, 'weight'):
if args.bf16 and module.weight.dtype == torch.float32:
module = module.to(torch.bfloat16)
Hi,
When I use the command on 8 gpus:
python3 qalora.py --model_path $llama_7b_4bit_g32
it will show the error:
File "/home/shawn/anaconda3/envs/qalora/lib/python3.8/site-packages/transformers/models/llama/modeling_llama.py", line 830, in forward logits = self.lm_head(hidden_states) File "/home/shawn/anaconda3/envs/qalora/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/home/shawn/anaconda3/envs/qalora/lib/python3.8/site-packages/torch/nn/modules/linear.py", line 114, in forward return F.linear(input, self.weight, self.bias) RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:7! (when checking argument for argument mat2 in method wrapper_CUDA_mm)
and when I revise the batch size:
per_device_train_batch_size: int = field(default=1, metadata={"help": 'The training batch size per GPU. Increase for better speed.'})
from default=1 to default=16, the training process still shows the following:
0%| | 0/10000 [00:00<?, ?it/s] 0%| | 1/10000 [07:15<1210:14:35, 435.73s/it]
do I need to decrease the max num of steps when I increase the batch size?
also, how can we do the evaluation on mmlu and get the results in paper? (I did not find the mmlu dataset setup instructions...) is it correct to download the .json files here? https://huggingface.co/datasets/openaccess-ai-collective/mmlu-evals/tree/main
would you mind helping me with these problems?
Thanks Shawn
the same problem with problem2
@shawnricecake The second question. Total batch size= gradient_accumulation_stepsper_device_train_batch_sizedevices
it seems that the first problem can be solved by the following code in function
def get_accelerate_model(args, checkpoint_dir):
for name, module in model.named_modules(): if isinstance(module, LoraLayer): if args.bf16: module = module.to(torch.bfloat16) if 'norm' in name: # module = module.to(torch.float32) # add to solve the tensor type mismatch module = module.to(torch.float16) if 'lm_head' in name or 'embed_tokens' in name: # add to solve the first problem module = module.to(torch.device("cuda:0")) if hasattr(module, 'weight'): if args.bf16 and module.weight.dtype == torch.float32: module = module.to(torch.bfloat16)
module = module.to(torch.device("cuda:0")) is a big mistake
Hi,
When I use the command on 8 gpus:
python3 qalora.py --model_path $llama_7b_4bit_g32
it will show the error:
File "/home/shawn/anaconda3/envs/qalora/lib/python3.8/site-packages/transformers/models/llama/modeling_llama.py", line 830, in forward logits = self.lm_head(hidden_states) File "/home/shawn/anaconda3/envs/qalora/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/home/shawn/anaconda3/envs/qalora/lib/python3.8/site-packages/torch/nn/modules/linear.py", line 114, in forward return F.linear(input, self.weight, self.bias) RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:7! (when checking argument for argument mat2 in method wrapper_CUDA_mm)
and when I revise the batch size:
per_device_train_batch_size: int = field(default=1, metadata={"help": 'The training batch size per GPU. Increase for better speed.'})
from default=1 to default=16, the training process still shows the following:
0%| | 0/10000 [00:00<?, ?it/s] 0%| | 1/10000 [07:15<1210:14:35, 435.73s/it]
do I need to decrease the max num of steps when I increase the batch size?
also, how can we do the evaluation on mmlu and get the results in paper? (I did not find the mmlu dataset setup instructions...) is it correct to download the .json files here? https://huggingface.co/datasets/openaccess-ai-collective/mmlu-evals/tree/main
would you mind helping me with these problems?
Thanks Shawn
For question1:
I just change device_map
in AutoGPTQForCausalLM.from_quantized
and it worked:
device_map='balanced_low_0'
Hi,
1. When I use the command on 8 gpus:
it will show the error:
2. and when I revise the batch size:
from default=1 to default=16, the training process still shows the following:
do I need to decrease the max num of steps when I increase the batch size?
3. also, how can we do the evaluation on mmlu and get the results in paper? (I did not find the mmlu dataset setup instructions...) is it correct to download the .json files here? https://huggingface.co/datasets/openaccess-ai-collective/mmlu-evals/tree/main
would you mind helping me with these problems?
Thanks Shawn