zhiyuanhubj / LongRecipe

LongRecipe: Recipe for Efficient Long Context Generalization in Large Language Models
https://arxiv.org/abs/2409.00509
70 stars 4 forks source link

batch size设置大于1时会报shape不对等错误 #6

Open 233function opened 1 month ago

233function commented 1 month ago

Exception type: ValueError Detail: Traceback (most recent call last): File "/checkpoint/binary/train_package/utils/train.py", line 424, in LongRecipe_train.train_with_stage() File "/checkpoint/binary/train_package/utils/train.py", line 360, in train_with_stage model, accelerator = self.train(stage, model, accelerator, train_data_loader, loss_func, optim, scheduler, progress_bar) File "/checkpoint/binary/train_package/utils/train.py", line 235, in train for idx, batch in enumerate(train_data_loader): File "/root/.local/lib/python3.10/site-packages/accelerate/data_loader.py", line 550, in iter current_batch = next(dataloader_iter) File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 631, in next data = self._next_data() File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 675, in _next_data data = self._dataset_fetcher.fetch(index) # may raise StopIteration File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 54, in fetch return self.collate_fn(data) File "/root/.local/lib/python3.10/site-packages/transformers/data/data_collator.py", line 92, in default_data_collator return torch_default_data_collator(features) File "/root/.local/lib/python3.10/site-packages/transformers/data/data_collator.py", line 158, in torch_default_data_collator batch[k] = torch.tensor([f[k] for f in features]) ValueError: expected sequence of length 33921 at dim 1 (got 39205)

zhiyuanhubj commented 1 month ago

Hello, we develop our training scripts based on the EasyContext and use sequence parallelism. Usually, the batch size is configured as 1 and we can use other hyperparameter, gradient-accumulate-every (as shown in readme) to a large number to achieve the equivalent effect with larger batch size.

We will also try to use larger batch size in EasyContext and SP settings, but it seems that it is not very necessary to do that if we can change the hyperparameter gradient-accumulate-every.