fix a bug in executor.py

lsrami commented 1 year ago

The following error occurs when batch_size = 1 or drop_last = False: ValueError: Expected input batch_size (1) to match target batch_size (0).

cdliang11 commented 1 year ago

Hi, I did not reproduce this bug.

cdliang11 commented 1 year ago

Could you provide more information?

lsrami commented 1 year ago

Could you provide more information?

This bug can be triggered in two cases. When batch_size = 1, the output dimension is one dimension more than that of the label, and the dimensions do not match when calculating the loss; when drop_last = False, if the length of the dataset is not an integral multiple of batch_size, it will sometimes cause only one piece of data in a batch, which will become the case of batch_size = 1 above. Of course, we don't usually set batch_size to 1 in training, but we do when we want to debug the code. Finally, you can set batch_size to 1 to reproduce the bug. This is a fragment of my error message. Traceback (most recent call last): File "/data/lisirui/wespeaker/examples/cnceleb/v5_0327/wespeaker/bin/train.py", line 235, in <module> fire.Fire(train) File "/data/lisirui/anaconda3/envs/wespeaker/lib/python3.9/site-packages/fire/core.py", line 141, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) File "/data/lisirui/anaconda3/envs/wespeaker/lib/python3.9/site-packages/fire/core.py", line 466, in _Fire component, remaining_args = _CallAndUpdateTrace( File "/data/lisirui/anaconda3/envs/wespeaker/lib/python3.9/site-packages/fire/core.py", line 681, in _CallAndUpdateTrace component = fn(*varargs, **kwargs) File "/data/lisirui/wespeaker/examples/cnceleb/v5_0327/wespeaker/bin/train.py", line 207, in train run_epoch(train_dataloader, File "/data/lisirui/wespeaker/examples/cnceleb/v5_0327/wespeaker/utils/executor.py", line 65, in run_epoch loss = criterion(outputs, targets) File "/data/lisirui/anaconda3/envs/wespeaker/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/data/lisirui/anaconda3/envs/wespeaker/lib/python3.9/site-packages/torch/nn/modules/loss.py", line 1164, in forward return F.cross_entropy(input, target, weight=self.weight, File "/data/lisirui/anaconda3/envs/wespeaker/lib/python3.9/site-packages/torch/nn/functional.py", line 3014, in cross_entropy return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing) ValueError: Expected input batch_size (1) to match target batch_size (0). WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 112747 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 112749 closing signal SIGTERM

JiJiJiang commented 1 year ago

Could you provide more information?

This bug can be triggered in two cases. When batch_size = 1, the output dimension is one dimension more than that of the label, and the dimensions do not match when calculating the loss; when drop_last = False, if the length of the dataset is not an integral multiple of batch_size, it will sometimes cause only one piece of data in a batch, which will become the case of batch_size = 1 above. Of course, we don't usually set batch_size to 1 in training, but we do when we want to debug the code. Finally, you can set batch_size to 1 to reproduce the bug. This is a fragment of my error message. Traceback (most recent call last): File "/data/lisirui/wespeaker/examples/cnceleb/v5_0327/wespeaker/bin/train.py", line 235, in <module> fire.Fire(train) File "/data/lisirui/anaconda3/envs/wespeaker/lib/python3.9/site-packages/fire/core.py", line 141, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) File "/data/lisirui/anaconda3/envs/wespeaker/lib/python3.9/site-packages/fire/core.py", line 466, in _Fire component, remaining_args = _CallAndUpdateTrace( File "/data/lisirui/anaconda3/envs/wespeaker/lib/python3.9/site-packages/fire/core.py", line 681, in _CallAndUpdateTrace component = fn(*varargs, **kwargs) File "/data/lisirui/wespeaker/examples/cnceleb/v5_0327/wespeaker/bin/train.py", line 207, in train run_epoch(train_dataloader, File "/data/lisirui/wespeaker/examples/cnceleb/v5_0327/wespeaker/utils/executor.py", line 65, in run_epoch loss = criterion(outputs, targets) File "/data/lisirui/anaconda3/envs/wespeaker/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/data/lisirui/anaconda3/envs/wespeaker/lib/python3.9/site-packages/torch/nn/modules/loss.py", line 1164, in forward return F.cross_entropy(input, target, weight=self.weight, File "/data/lisirui/anaconda3/envs/wespeaker/lib/python3.9/site-packages/torch/nn/functional.py", line 3014, in cross_entropy return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing) ValueError: Expected input batch_size (1) to match target batch_size (0). WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 112747 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 112749 closing signal SIGTERM

Sorry, I can't reproduce this bug either by simply setting batch_size=1 in our voxceleb2 recipe using resnet34 model. Are there any other experimental setups you modify that are different from the default config? Could you provide more details to help us reproduce this bug? Thanks.

lsrami commented 1 year ago

This problem was caused by my modification of dataloader, which was resolved after pulling the latest version.

wenet-e2e / wespeaker

fix a bug in executor.py #147