Training error - Githubissues

yezhen17 / 3DIoUMatch-PVRCNN

[CVPR 2021] PyTorch implementation of 3DIoUMatch: Leveraging IoU Prediction for Semi-Supervised 3D Object Detection.

51 stars 9 forks source link

Training error #2

Open chenxyyy opened 3 years ago

chenxyyy commented 3 years ago

Hello! @THU17cyz Thank you for open-sourcing your codebase.

I have successfully run your pretrain phase on KITTI. But I had a problem running train phase.

when I use batchsize 1, I met a problem on the below code

# pcdet/models/detectors/pv_rcnn_ssl.py  line:257
def update_global_step(self):
    self.global_step += 1
    alpha = 0.999
    # Use the true average until the exponential average is more correct
    alpha = min(1 - 1 / (self.global_step + 1), alpha)
    for ema_param, param in zip(self.pv_rcnn_ema.parameters(), self.pv_rcnn.parameters()):
        ema_param.data.mul_(alpha).add_(1 - alpha,  param.data)

It notices that tha add_ cannot take 2 params，So I chage the code ema_param.data.mul_(alpha).add_(1 - alpha, param.data) to ema_param.data.mul_(alpha).add_((1 - alpha) * param.data).

Then I run sucessful

But when I change the batchsize to 2、4 or others，I met the index error on below.

# pcdet/models/detectors/pv_rcnn_ssl.py  line:67
for ind in unlabeled_mask:
    pseudo_score = pred_dicts[ind]['pred_scores']
    pseudo_box = pred_dicts[ind]['pred_boxes']
    pseudo_label = pred_dicts[ind]['pred_labels']
    pseudo_sem_score = pred_dicts[ind]['pred_sem_scores']

    if len(pseudo_label) == 0:
        pseudo_boxes.append(pseudo_label.new_zeros((0, 8)).float())
        continue

It shows that the num in unlabeled_mask beyond the range of pred_dicts. for example, the unlabeled_mask is [2,3,6,7] but the pred_dicts size is 4, 6,7 is illegal.

I want to know if my change is correct, and how to solve the error when the batch size is large？

Looking forward to your reply

yezhen17 commented 3 years ago

Hi @chenxyyy , I guess your PyTorch version is too new, so that the add_ function throws this error. Your modification should be correct (pt1.1 https://pytorch.org/docs/1.1.0/torch.html#torch.add v.s. pt1.5 https://pytorch.org/docs/1.5.0/torch.html#torch.add).

yezhen17 commented 3 years ago

I wrote in README that batch_size=2 (labeled+unlabeled per gpu) is currently hardcoded. This is inconvenient, but I'm sorry that the solution is to change the hardcode or to improve the code yourself. I may improve this myself in the future.

Please search for batch_size = 2 in the code and change 2 to your batch size (labeled+unlabeled per gpu).

chenxyyy commented 3 years ago

Hi, I found another error.

# /data/3DIoUMatch-PVRCNN/pcdet/datasets/kitti/kitti_dataset.py  line:46
if self.training:
    all_train = len(self.kitti_infos)
    self.unlabeled_index_list = list(set(list(range(all_train))) - set(self.sample_index_list))  # float()!!!
    # print(self.unlabeled_index_list)
    self.unlabeled_kitti_infos = []

the type of the set : (set(list(range(all_train))) is int , but the type of the set : set(self.sample_index_list)) is str, so the code self.unlabeled_index_list = list(set(list(range(all_train))) - set(self.sample_index_list)) # float()!!! didn't work at all.

I changed it to self.unlabeled_index_list = list(set(list(range(all_train))) - set([int(i) for i in self.sample_index_list]))

I don't know if I understand right.

yezhen17 commented 3 years ago

Oh yes, your understanding is correct. In fact this reminded me that I noticed this before but forgot to add a comment to this. Adding the labeled set to the unlabeled set is OK, and it makes no difference to the performance. You can leave this alone or apply your modification.

chenxyyy commented 3 years ago

Hi @THU17cyz , I made the following changes and finally ran the training code successfully.

pcdet/models/detectors/pv_rcnn_ssl.py line:38

modify the batch_dict['mask']， add batch_dict['mask'] = batch_dict['mask'][:, :1] under if self.training:

def forward(self, batch_dict):
if self.training:
    batch_dict['mask'] = batch_dict['mask'][:, :1] #  modify by chenxyyyy
    mask = batch_dict['mask'].view(-1)

pcdet/models/detectors/pv_rcnn_ssl.py line:155

due to the existence of pseudo_ sem_score is empty, but unzero_ Inds is not empty, so the index will be out of bounds, I added the judgment condition after line:155

for i, ind in enumerate(unlabeled_mask):
    # statistics
    anchor_by_gt_overlap = iou3d_nms_utils.boxes_iou3d_gpu(
        batch_dict['gt_boxes'][ind, ...][:, 0:7],
        ori_unlabeled_boxes[i, :, 0:7])
    cls_pseudo = batch_dict['gt_boxes'][ind, ...][:, 7]
    unzero_inds = torch.nonzero(cls_pseudo).squeeze(1).long()
    cls_pseudo = cls_pseudo[unzero_inds]
    if len(unzero_inds) > 0 and len(pseudo_sem_score) > len(unzero_inds):  # modify by chenxyyyy
        iou_max, asgn = anchor_by_gt_overlap[unzero_inds, :].max(dim=1)

Batch_size

you set the batchsize to 2 in
- pcdet/models/dense_heads/anchor_head_template.py : line :104
- pcdet/models/dense_heads/anchor_head_template.py : line :178
- pcdet/models/dense_heads/point_head_template.py line:143
- pcdet/models/roi_heads/roi_head_template.py line : 234 and 246
when I debuge the project I found that the batchsize defined in train.py will load double data (labed and unlabed)，if the batchsize defined in train.py is 2,the batchsize when calculate the loss should be 4.

So I modified the batchsize in anchor_head_template.py, point_head_template.py, roi_head_template.py double.

And then running the training program is normal.

I want to know if my change is correct, and how to solve the error when the batch size is large？

Looking forward to your reply

yezhen17 commented 3 years ago

Why add this?

batch_dict['mask'] = batch_dict['mask'][:, :1]

Also I forgot, if you want to have labeled_data_batch_size != unlabeled_data_batch_size, https://github.com/THU17cyz/3DIoUMatch-PVRCNN/blob/1aa469fb7b0bdc22fc030f660f741e59a666160c/pcdet/datasets/kitti/kitti_dataset_ssl.py#L398 this should be modified (also hardcoded that labeled and unlabeled data batch size are the same).

yezhen17 commented 3 years ago

And what happened if you did not make the second modification? I have not met with such situation.

chenxyyy commented 3 years ago

when I use batchsize 2：

if not add batch_dict['mask'] = batch_dict['mask'][:, :1] batch_dict['mask'] is [[1, 1], [0, 0], [1, 1], [0, 0]]

def forward(self, batch_dict):
  if self.training:
      ### batch_dict['mask'] is [[1, 1], [0, 0], [1, 1], [0, 0]]
      mask = batch_dict['mask'].view(-1)

      labeled_mask = torch.nonzero(mask).squeeze(1).long()   # the labeled_mask will be [2, 3, 6, 7]
      unlabeled_mask = torch.nonzero(1-mask).squeeze(1).long()  # the unlabeled_mask will be [0, 1, 4, 5]

the unlabeled_mask will be [0, 1, 4, 5]， when run into the following:

for ind in unlabeled_mask:
    pseudo_score = pred_dicts[ind]['pred_scores']
    pseudo_box = pred_dicts[ind]['pred_boxes']
    pseudo_label = pred_dicts[ind]['pred_labels']
    pseudo_sem_score = self.new_method(pred_dicts, ind)
    ...

the pred_dicts.shape[0] is 4, so the index: 4, 5 is out of the range of pred_dicts.

chenxyyy commented 3 years ago

About the second modification， I have found something else.

I noticed that after this code was executed, the variable pseudo_sem_score is the last element of pred_dicts

for ind in unlabeled_mask:
    pseudo_score = pred_dicts[ind]['pred_scores']
    pseudo_box = pred_dicts[ind]['pred_boxes']
    pseudo_label = pred_dicts[ind]['pred_labels']
    pseudo_sem_score = pred_dicts[ind]['pred_sem_scores']

    if len(pseudo_label) == 0:
        pseudo_boxes.append(pseudo_label.new_zeros((0, 8)).float())
        continue

    conf_thresh = torch.tensor(self.thresh, device=pseudo_label.device).unsqueeze(
        0).repeat(len(pseudo_label), 1).gather(dim=1, index=(pseudo_label-1).unsqueeze(-1))

    valid_inds = pseudo_score > conf_thresh.squeeze()

    valid_inds = valid_inds * (pseudo_sem_score > self.sem_thresh[0])

    pseudo_sem_score = pseudo_sem_score[valid_inds]
    pseudo_box = pseudo_box[valid_inds]
    pseudo_label = pseudo_label[valid_inds]

    # if len(valid_inds) > max_box_num:
    #     _, inds = torch.sort(pseudo_score, descending=True)
    #     inds = inds[:max_box_num]
    #     pseudo_box = pseudo_box[inds]
    #     pseudo_label = pseudo_label[inds]

    pseudo_boxes.append(torch.cat([pseudo_box, pseudo_label.view(-1, 1).float()], dim=1))
    if pseudo_box.shape[0] > max_pseudo_box_num:
        max_pseudo_box_num = pseudo_box.shape[0]
    # pseudo_scores.append(pseudo_score)
    # pseudo_labels.append(pseudo_label)

So when executing the following code, pseudo_sem_score will always the one.

for i, ind in enumerate(unlabeled_mask):
  # statistics
  anchor_by_gt_overlap = iou3d_nms_utils.boxes_iou3d_gpu(
      batch_dict['gt_boxes'][ind, ...][:, 0:7],
      ori_unlabeled_boxes[i, :, 0:7])
  cls_pseudo = batch_dict['gt_boxes'][ind, ...][:, 7]
  unzero_inds = torch.nonzero(cls_pseudo).squeeze(1).long()
  cls_pseudo = cls_pseudo[unzero_inds]
  if len(unzero_inds) > 0:
      iou_max, asgn = anchor_by_gt_overlap[unzero_inds, :].max(dim=1)
      pseudo_ious.append(iou_max.unsqueeze(0))
      acc = (ori_unlabeled_boxes[i][:, 7].gather(dim=0, index=asgn) == cls_pseudo).float().mean()
      pseudo_accs.append(acc.unsqueeze(0))
      fg = (iou_max > 0.5).float().sum(dim=0, keepdim=True) / len(unzero_inds)

      sem_score_fg = (pseudo_sem_score[unzero_inds] * (iou_max > 0.5).float()).sum(dim=0, keepdim=True) \
                     / torch.clamp((iou_max > 0.5).float().sum(dim=0, keepdim=True), min=1.0)
      sem_score_bg = (pseudo_sem_score[unzero_inds] * (iou_max < 0.5).float()).sum(dim=0, keepdim=True) \
                     / torch.clamp((iou_max < 0.5).float().sum(dim=0, keepdim=True), min=1.0)

So I opened your code comments,

pseudo_scores.append(pseudo_score)
pseudo_labels.append(pseudo_label)

and add pseudo_sem_score = pseudo_sem_scores[i] before your code : sem_score_fg = (pseudo_sem_score[unzero_inds] * (iou_max > 0.5).float()).sum(dim=0, keepdim=True) \

yezhen17 commented 3 years ago

About the second modification， I have found something else.

I noticed that after this code was executed, the variable pseudo_sem_score is the last element of pred_dicts

for ind in unlabeled_mask:
    pseudo_score = pred_dicts[ind]['pred_scores']
    pseudo_box = pred_dicts[ind]['pred_boxes']
    pseudo_label = pred_dicts[ind]['pred_labels']
    pseudo_sem_score = pred_dicts[ind]['pred_sem_scores']

    if len(pseudo_label) == 0:
        pseudo_boxes.append(pseudo_label.new_zeros((0, 8)).float())
        continue

    conf_thresh = torch.tensor(self.thresh, device=pseudo_label.device).unsqueeze(
        0).repeat(len(pseudo_label), 1).gather(dim=1, index=(pseudo_label-1).unsqueeze(-1))

    valid_inds = pseudo_score > conf_thresh.squeeze()

    valid_inds = valid_inds * (pseudo_sem_score > self.sem_thresh[0])

    pseudo_sem_score = pseudo_sem_score[valid_inds]
    pseudo_box = pseudo_box[valid_inds]
    pseudo_label = pseudo_label[valid_inds]

    # if len(valid_inds) > max_box_num:
    #     _, inds = torch.sort(pseudo_score, descending=True)
    #     inds = inds[:max_box_num]
    #     pseudo_box = pseudo_box[inds]
    #     pseudo_label = pseudo_label[inds]

    pseudo_boxes.append(torch.cat([pseudo_box, pseudo_label.view(-1, 1).float()], dim=1))
    if pseudo_box.shape[0] > max_pseudo_box_num:
        max_pseudo_box_num = pseudo_box.shape[0]
    # pseudo_scores.append(pseudo_score)
    # pseudo_labels.append(pseudo_label)

So when executing the following code, pseudo_sem_score will always the one.

for i, ind in enumerate(unlabeled_mask):
  # statistics
  anchor_by_gt_overlap = iou3d_nms_utils.boxes_iou3d_gpu(
      batch_dict['gt_boxes'][ind, ...][:, 0:7],
      ori_unlabeled_boxes[i, :, 0:7])
  cls_pseudo = batch_dict['gt_boxes'][ind, ...][:, 7]
  unzero_inds = torch.nonzero(cls_pseudo).squeeze(1).long()
  cls_pseudo = cls_pseudo[unzero_inds]
  if len(unzero_inds) > 0:
      iou_max, asgn = anchor_by_gt_overlap[unzero_inds, :].max(dim=1)
      pseudo_ious.append(iou_max.unsqueeze(0))
      acc = (ori_unlabeled_boxes[i][:, 7].gather(dim=0, index=asgn) == cls_pseudo).float().mean()
      pseudo_accs.append(acc.unsqueeze(0))
      fg = (iou_max > 0.5).float().sum(dim=0, keepdim=True) / len(unzero_inds)

      sem_score_fg = (pseudo_sem_score[unzero_inds] * (iou_max > 0.5).float()).sum(dim=0, keepdim=True) \
                     / torch.clamp((iou_max > 0.5).float().sum(dim=0, keepdim=True), min=1.0)
      sem_score_bg = (pseudo_sem_score[unzero_inds] * (iou_max < 0.5).float()).sum(dim=0, keepdim=True) \
                     / torch.clamp((iou_max < 0.5).float().sum(dim=0, keepdim=True), min=1.0)

So I opened your code comments,

pseudo_scores.append(pseudo_score)
pseudo_labels.append(pseudo_label)

and add pseudo_sem_score = pseudo_sem_scores[i] before your code : sem_score_fg = (pseudo_sem_score[unzero_inds] * (iou_max > 0.5).float()).sum(dim=0, keepdim=True) \

Yes, if unlabeled batch size > 1, your modification is necessary.

yezhen17 commented 3 years ago

when I use batchsize 2：

if not add batch_dict['mask'] = batch_dict['mask'][:, :1] batch_dict['mask'] is [[1, 1], [0, 0], [1, 1], [0, 0]]
def forward(self, batch_dict):
  if self.training:
      ### batch_dict['mask'] is [[1, 1], [0, 0], [1, 1], [0, 0]]
      mask = batch_dict['mask'].view(-1)

      labeled_mask = torch.nonzero(mask).squeeze(1).long()   # the labeled_mask will be [2, 3, 6, 7]
      unlabeled_mask = torch.nonzero(1-mask).squeeze(1).long()  # the unlabeled_mask will be [0, 1, 4, 5]
the unlabeled_mask will be [0, 1, 4, 5]， when run into the following:
for ind in unlabeled_mask:
    pseudo_score = pred_dicts[ind]['pred_scores']
    pseudo_box = pred_dicts[ind]['pred_boxes']
    pseudo_label = pred_dicts[ind]['pred_labels']
    pseudo_sem_score = self.new_method(pred_dicts, ind)
    ...
the pred_dicts.shape[0] is 4, so the index: 4, 5 is out of the range of pred_dicts.

I see. I think you're right. You can also modify the collate_batch function here: https://github.com/THU17cyz/3DIoUMatch-PVRCNN/blob/1aa469fb7b0bdc22fc030f660f741e59a666160c/pcdet/datasets/kitti/kitti_dataset_ssl.py#L405.

I'll soon update the codebase to support arbitrary batch_size. Thank you very much for pointing out these!

JiangZhenW commented 2 years ago

hello, @chenxyyy when i want to pretrain phase on KITTI, i meet a problem : scripts/slurm_pretrain.sh: line 26: srun: command not found. Could you give me some advice?

Looking forward to your reply

yezhen17 commented 2 years ago

hello, @chenxyyy when i want to pretrain phase on KITTI, i meet a problem : scripts/slurm_pretrain.sh: line 26: srun: command not found. Could you give me some advice?

Looking forward to your reply

Are you sure you are running this script on a machine with slurm environment? For example, if you are using GCP/AWS machines, or clusters not installed with slurm, this script won't work.