multi gpus - Githubissues

panyuetj commented 6 years ago

I try to use --gpus=0,1,2,3 to train ,but get error:

raise MXNetError(py_str(_LIB.MXGetLastError())) mxnet.base.MXNetError: Error in operator rois: Shape inconsistent, Provided=(1,3), inferred shape=(4,3)

How to fix the params?

solin319 commented 6 years ago

In the file "maskrcnn_train_end2end.py".

Change the max shape code to

max_data_shape = [('data', (1, 3, max([v[0] for v in config.SCALES]), max([v[1] for v in config.SCALES])))]
    max_data_shape, max_label_shape = train_data.infer_shape(max_data_shape)
    max_data_shape.append(('gt_boxes', (1, 100, 5)))
    max_data_shape.append(('gt_masks', (1, 100, max([v[0] for v in config.SCALES]), max(v[1] for v in config.SCALES))))
    max_data_shape.append(('im_info', (1,train_data.provide_data_single[1][1][1])))
    logger.info('providing maximum shape %s %s' % (max_data_shape, max_label_shape))

panyuetj commented 6 years ago

@solin319 new problem:

  File "maskrcnn_train_end2end.py", line 203, in <module>
    main()
  File "maskrcnn_train_end2end.py", line 200, in main
    lr=args.lr, lr_step=args.lr_step)
  File "maskrcnn_train_end2end.py", line 162, in train_net
    arg_params=arg_params, aux_params=aux_params, begin_epoch=begin_epoch, num_epoch=end_epoch)
  File "/home/pp/code/maskrcnn.mxnet/rcnn/core/module.py", line 955, in fit
    self.update()
  File "/home/pp/code/maskrcnn.mxnet/rcnn/core/module.py", line 1037, in update
    self._curr_module.update()
  File "/home/pp/code/maskrcnn.mxnet/rcnn/core/module.py", line 573, in update
    self._kvstore)
TypeError: _update_params_on_kvstore() takes exactly 4 arguments (3 given)

solin319 commented 6 years ago

The interface of _update_params_on_kvstore was changed in MXNet-0.11 version. You can add a argument 'self._exec_group.param_names' when called.

    def update(self):
        """Updates parameters according to the installed optimizer and the gradients computed
        in the previous forward-backward batch.
        See Also
        ----------
        :meth:`BaseModule.update`.
        """
        assert self.binded and self.params_initialized and self.optimizer_initialized

        self._params_dirty = True
        if self._update_on_kvstore:
            _update_params_on_kvstore(self._exec_group.param_arrays,
                                      self._exec_group.grad_arrays,
                                      self._kvstore, self._exec_group.param_names)
        else:
            _update_params(self._exec_group.param_arrays,
                           self._exec_group.grad_arrays,
                           updater=self._updater,
                           num_device=len(self._context),
                           kvstore=self._kvstore,
                           param_names=self._exec_group.param_names)

panyuetj commented 6 years ago

@solin319 Thanks!

I got another error:

File “/home/pp/training_scripts/maskrcnn.mxnet/rcnn/io/rpn.py”, line 149, in assign_anchor
gt_argmax_overlaps = overlaps.argmax(axis=0)
 ValueError: attempt to get argmax of an empty sequence

It sames like that the unusual size anchors locate outside the image.
I was failed to remove the unusual images because of leaking the the information of instances_train2014.json. Do you have some ideas about this? Can we modified the training code to avoid the problem?

solin319 commented 6 years ago

I meet the same problem and have no idea at this time.

panyuetj commented 6 years ago

@solin319 In the file rcnn/io/rpn.py

    # only keep anchors inside the image
    inds_inside = np.where((all_anchors[:, 0] >= -allowed_border) &
                           (all_anchors[:, 1] >= -allowed_border) &
                           (all_anchors[:, 2] < im_info[1] + allowed_border) &
                           (all_anchors[:, 3] < im_info[0] + allowed_border))[0]

The default value of allowed_border is zero.

I add a parameter to AnchorLoader function in maskrcnn_train_end2end.py

    train_data = AnchorLoader(feat_sym, sdsdb, batch_size=input_batch_size, shuffle=not args.no_shuffle,
                              ctx=ctx, work_load_list=args.work_load_list,
                              feat_stride=config.RPN_FEAT_STRIDE, anchor_scales=config.ANCHOR_SCALES,
                              anchor_ratios=config.ANCHOR_RATIOS, 
                              aspect_grouping=config.TRAIN.ASPECT_GROUPING,allowed_border=50)

It works well so far.

xilaili / maskrcnn.mxnet

multi gpus #3