yabufarha / ms-tcn

Other
214 stars 58 forks source link

the variable "mask" in Trainer (line 71 in model.py) #4

Closed cmhungsteve closed 5 years ago

cmhungsteve commented 5 years ago

I wonder what "mask" in the following codes is used for.

             batch_input, batch_target, mask = batch_gen.next_batch(batch_size)
                batch_input, batch_target, mask = batch_input.to(device), batch_target.to(device), mask.to(device)
                optimizer.zero_grad()
                predictions = self.model(batch_input)

                loss = 0
                for p in predictions:
                    loss += self.ce(p.transpose(2, 1).contiguous().view(-1, self.num_classes), batch_target.view(-1))
                    loss += 0.15*torch.mean(torch.clamp(self.mse(F.log_softmax(p[:, :, 1:], dim=1), F.log_softmax(p.detach()[:, :, :-1], dim=1)), min=0, max=16)*mask[:, :, 1:])

                epoch_loss += loss.item()
                loss.backward()
                optimizer.step()

                _, predicted = torch.max(predictions[-1].data, 1) # predicted indices
                correct += ((predicted == batch_target).float()*mask[:, 0, :].squeeze(1)).sum().item()
                total += torch.sum(mask[:, 0, :]).item()

It seems not to do anything and always be one.

yabufarha commented 5 years ago

The mask tensor is needed if you have videos of variable lengths in your batch. It defines the valid outputs that are relevant for computing the loss and mask the non-relevant outputs that are generated because of padding. Nevertheless, at the default settings, the batch size is one which means the mask , in this case, is not needed because there is no padding needed. I hope that this would help.

Best, Yazan

cmhungsteve commented 5 years ago

Thank you for your explanation, but I am still not very sure what "padding" means here. It would be great if you can explain more. Thank you.

yabufarha commented 5 years ago

The input of the model is a tensor of size (bz, d, T), where bz is the batch size, d is the dimension of the features, and T is the length (number of frames) of the longest video in the batch. So if your batch size is greater than one, then you have to pad short videos with zeros to make sure that all videos in the batch have the same length T.

cmhungsteve commented 5 years ago

I see. Thank you for your detailed explanation. Just want to double-check. If I set the batch size larger than 1, the padding will happen and the mask will select the relevant outputs for evaluation. Is that correct?

yabufarha commented 5 years ago

Yes, that's correct ;)

cmhungsteve commented 5 years ago

Thank you so much. It's pretty clear to me now.

anisrashidov commented 3 years ago

The input of the model is a tensor of size (bz, d, T), where bz is the batch size, d is the dimension of the features, and T is the length (number of frames) of the longest video in the batch. So if your batch size is greater than one, then you have to pad short videos with zeros to make sure that all videos in the batch have the same length T.

May I ask to be more specific about the variable length? Doesn't the model rely on the usage of Conv1d layers, which do not care about the length? Thank you in advance.