Open 233function opened 1 month ago
Hello, we develop our training scripts based on the EasyContext and use sequence parallelism. Usually, the batch size is configured as 1 and we can use other hyperparameter, gradient-accumulate-every (as shown in readme) to a large number to achieve the equivalent effect with larger batch size.
We will also try to use larger batch size in EasyContext and SP settings, but it seems that it is not very necessary to do that if we can change the hyperparameter gradient-accumulate-every.
Exception type: ValueError Detail: Traceback (most recent call last): File "/checkpoint/binary/train_package/utils/train.py", line 424, in
LongRecipe_train.train_with_stage()
File "/checkpoint/binary/train_package/utils/train.py", line 360, in train_with_stage
model, accelerator = self.train(stage, model, accelerator, train_data_loader, loss_func, optim, scheduler, progress_bar)
File "/checkpoint/binary/train_package/utils/train.py", line 235, in train
for idx, batch in enumerate(train_data_loader):
File "/root/.local/lib/python3.10/site-packages/accelerate/data_loader.py", line 550, in iter
current_batch = next(dataloader_iter)
File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 631, in next
data = self._next_data()
File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 675, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 54, in fetch
return self.collate_fn(data)
File "/root/.local/lib/python3.10/site-packages/transformers/data/data_collator.py", line 92, in default_data_collator
return torch_default_data_collator(features)
File "/root/.local/lib/python3.10/site-packages/transformers/data/data_collator.py", line 158, in torch_default_data_collator
batch[k] = torch.tensor([f[k] for f in features])
ValueError: expected sequence of length 33921 at dim 1 (got 39205)