Question of NUM_CLASSES

bbangbin2780 commented 3 years ago

I have a question while learning Korean dataset

Follow the steps below to proceed

write config file
register dataset( my dataset name is AISL dataset)

then training below command

$ python tools/train_net.py --num-gpus 4 --config-file

below is config file ( just change the dataset name from total-text config file )

_BASE_: "./Base-RCNN-FPN.yaml"
MODEL:
  MASK_ON: True
  TEXTFUSENET_MUTIL_PATH_FUSE_ON: True
  WEIGHTS: "./out_dir_r101/totaltext_model/model_tt_r101.pth"
  PIXEL_STD: [57.375, 57.120, 58.395]
  RESNETS:
    STRIDE_IN_1X1: False  # this is a C2 model
    NUM_GROUPS: 32
    WIDTH_PER_GROUP: 8
    DEPTH: 101
  ROI_HEADS:
    NMS_THRESH_TEST: 0.4
  TEXTFUSENET_SEG_HEAD:
    FPN_FEATURES_FUSED_LEVEL: 1
    POOLER_SCALES: (0.125,)

DATASETS:
  TRAIN: ("AISLText",)
  TEST: ("AISLText",)
SOLVER:
  IMS_PER_BATCH: 8
  BASE_LR: 0.001
  STEPS: (40000,80000,)
  MAX_ITER: 120000
  CHECKPOINT_PERIOD: 2500

INPUT:
  MIN_SIZE_TRAIN: (800,1000,1200)
  MAX_SIZE_TRAIN: 1500
  MIN_SIZE_TEST: 800
  MAX_SIZE_TEST: 1333

OUTPUT_DIR: "./out_dir_r101/at_model/"

register with coco_register in detectron2/data/datasets/builtin.py.

image_path = "/home/ensa/JYB/TextFuseNet/datasets/AISLText/train_images"
json_path = "/home/ensa/JYB/TextFuseNet/datasets/AISLText/trainval.json"
register_coco_instances("AISLText", {},json_path, image_path)

An error occurs when learning

[01/19 18:35:50 d2.data.datasets.coco]: Loaded 3 images in COCO format from /home/ensa/JYB/TextFuseNet/datasets/AISLText/trainval.json
[01/19 18:35:50 d2.data.build]: Removed 0 images with no usable annotations. 3 images left.
[01/19 18:35:50 d2.data.build]: Distribution of training instances among all 31 categories:
|  category  | #instances   |  category  | #instances   |  category  | #instances   |
|:----------:|:-------------|:----------:|:-------------|:----------:|:-------------|
|     -      | 2            |     0      | 2            |     1      | 2            |
|     3      | 3            |     5      | 1            |     7      | 2            |
|     A      | 2            |     B      | 2            |     E      | 4            |
|     K      | 2            |     L      | 2            |     R      | 1            |
|     a      | 1            |     b      | 1            |     c      | 1            |
|     e      | 2            |     i      | 1            |     m      | 1            |
|     o      | 2            |     r      | 3            |     t      | 1            |
|    text    | 7            |     u      | 1            |     y      | 1            |
|     강      | 1            |     료      | 1            |     실      | 3            |
|     의      | 1            |     자      | 1            |     장      | 1            |
|     화      | 1            |            |              |            |              |
|   total    | 56           |            |              |            |              |
[01/19 18:35:50 d2.data.detection_utils]: TransformGens used in training: [ResizeShortestEdge(short_edge_length=(800, 1000, 1200), max_size=1500, sample_style='choice'), RandomFlip(), RandomContrast(intensity_min=0.5, intensity_max=1.5), RandomBrightness(intensity_min=0.5, intensity_max=1.5), RandomSaturation(intensity_min=0.5, intensity_max=1.5), RandomLighting(scale=1.1931034212737668)]
[01/19 18:35:50 d2.data.build]: Using training sampler TrainingSampler
[01/19 18:35:51 fvcore.common.checkpoint]: Loading checkpoint from ./out_dir_r101/totaltext_model/model_tt_r101.pth
[01/19 18:35:51 d2.engine.train_loop]: Starting training from iteration 0
[01/19 18:35:53 d2.engine.hooks]: Total training time: 0:00:01 (0:00:00 on hooks)
Traceback (most recent call last):
  File "tools/train_net.py", line 161, in <module>
    args=(args,),
  File "/home/ensa/JYB/TextFuseNet/detectron2/engine/launch.py", line 49, in launch
    daemon=False,
  File "/home/ensa/anaconda3/envs/textfusenet2/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn
    while not spawn_context.join():
  File "/home/ensa/anaconda3/envs/textfusenet2/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 118, in join
    raise Exception(msg)
Exception: 

-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/home/ensa/anaconda3/envs/textfusenet2/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
    fn(i, *args)
  File "/home/ensa/JYB/TextFuseNet/detectron2/engine/launch.py", line 84, in _distributed_worker
    main_func(*args)
  File "/home/ensa/JYB/TextFuseNet/tools/train_net.py", line 149, in main
    return trainer.train()
  File "/home/ensa/JYB/TextFuseNet/detectron2/engine/defaults.py", line 356, in train
    super().train(self.start_iter, self.max_iter)
  File "/home/ensa/JYB/TextFuseNet/detectron2/engine/train_loop.py", line 132, in train
    self.run_step()
  File "/home/ensa/JYB/TextFuseNet/detectron2/engine/train_loop.py", line 212, in run_step
    loss_dict = self.model(data)
  File "/home/ensa/anaconda3/envs/textfusenet2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ensa/anaconda3/envs/textfusenet2/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 442, in forward
    output = self.module(*inputs[0], **kwargs[0])
  File "/home/ensa/anaconda3/envs/textfusenet2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ensa/JYB/TextFuseNet/detectron2/modeling/meta_arch/rcnn.py", line 88, in forward
    _, detector_losses = self.roi_heads(images, features, proposals, gt_instances)
  File "/home/ensa/anaconda3/envs/textfusenet2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ensa/JYB/TextFuseNet/detectron2/modeling/roi_heads/roi_heads.py", line 584, in forward
    losses.update(self._forward_mask(features_list, proposals, targets))
  File "/home/ensa/JYB/TextFuseNet/detectron2/modeling/roi_heads/roi_heads.py", line 684, in _forward_mask
    mask_features = self.mutil_path_fuse_module(mask_features, global_context, proposals)
  File "/home/ensa/anaconda3/envs/textfusenet2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ensa/JYB/TextFuseNet/detectron2/modeling/roi_heads/mutil_path_fuse_module.py", line 110, in forward
    feature_fuse = char_context + x + global_context
RuntimeError: The size of tensor a (19) must match the size of tensor b (145) at non-singleton dimension 0

To test whether learning is possible,I just tested with 3 images. then this error is occurred

I compared the your sample coco format to my coco format, but it was the same.

I need to learn at least 1000 characters, does this error relevant to the number of characters? or relevant to input size?

Thank you for reading please help...

Real-YeJ commented 3 years ago

@bbangbin2780 it seems that the num of char_context, x and globa_context is not equal. This implementation only train with batchsize 4 with 4gpus. our 64 classes are text, 0-9, a-z, A-Z and background.

bbangbin2780 commented 3 years ago

I appreciate your answer, Thanks

I modify my config file ( batch size 4)

then below error occurred

Traceback (most recent call last):
  File "tools/train_net.py", line 161, in <module>
    args=(args,),
  File "/home/ensa/JYB/TextFuseNet/detectron2/engine/launch.py", line 49, in launch
    daemon=False,
  File "/home/ensa/anaconda3/envs/textfusenet2/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn
    while not spawn_context.join():
  File "/home/ensa/anaconda3/envs/textfusenet2/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 118, in join
    raise Exception(msg)
Exception: 

-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/home/ensa/anaconda3/envs/textfusenet2/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
    fn(i, *args)
  File "/home/ensa/JYB/TextFuseNet/detectron2/engine/launch.py", line 84, in _distributed_worker
    main_func(*args)
  File "/home/ensa/JYB/TextFuseNet/tools/train_net.py", line 149, in main
    return trainer.train()
  File "/home/ensa/JYB/TextFuseNet/detectron2/engine/defaults.py", line 356, in train
    super().train(self.start_iter, self.max_iter)
  File "/home/ensa/JYB/TextFuseNet/detectron2/engine/train_loop.py", line 132, in train
    self.run_step()
  File "/home/ensa/JYB/TextFuseNet/detectron2/engine/train_loop.py", line 212, in run_step
    loss_dict = self.model(data)
  File "/home/ensa/anaconda3/envs/textfusenet2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ensa/anaconda3/envs/textfusenet2/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 442, in forward
    output = self.module(*inputs[0], **kwargs[0])
  File "/home/ensa/anaconda3/envs/textfusenet2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ensa/JYB/TextFuseNet/detectron2/modeling/meta_arch/rcnn.py", line 88, in forward
    _, detector_losses = self.roi_heads(images, features, proposals, gt_instances)
  File "/home/ensa/anaconda3/envs/textfusenet2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ensa/JYB/TextFuseNet/detectron2/modeling/roi_heads/roi_heads.py", line 581, in forward
    losses = self._forward_box(features_list, proposals)
  File "/home/ensa/JYB/TextFuseNet/detectron2/modeling/roi_heads/roi_heads.py", line 650, in _forward_box
    return outputs.losses()
  File "/home/ensa/JYB/TextFuseNet/detectron2/modeling/roi_heads/fast_rcnn.py", line 267, in losses
    "loss_box_reg": self.smooth_l1_loss(),
  File "/home/ensa/JYB/TextFuseNet/detectron2/modeling/roi_heads/fast_rcnn.py", line 209, in smooth_l1_loss
    self.proposals.tensor, self.gt_boxes.tensor
  File "/home/ensa/JYB/TextFuseNet/detectron2/modeling/box_regression.py", line 66, in get_deltas
    assert (src_widths > 0).all().item(), "Input boxes to Box2BoxTransform are not valid!"
RuntimeError: CUDA error: device-side assert triggered

If the number of classes is greater than 64, is error occurred?

I want to learning at least 1000 characters

Thanks

Real-YeJ commented 3 years ago

@bbangbin2780 the IMS_PER_BATCH should be set to 4 when using 4 gpus. if you set more classes, the pred_branches in our model will be skiped when training your custom datasets.

bbangbin2780 commented 3 years ago

I have for GPUs (4 TITAN RTX)

if class number over 64, then that error occured

Real-YeJ commented 3 years ago

@bbangbin2780 if you change the num of classes, there are several configs should be modified in detectron2/data/datasets/builtin.py as well

bbangbin2780 commented 3 years ago

@bbangbin2780 if you change the num of classes, there are several configs should be modified in detectron2/data/datasets/builtin.py as well

Why is pred_branches in your model skipped if I set the number of classes more than 63?

I read your paper again but I don't understand why is pred_branches skipped.

ducthinh14091999 commented 2 years ago

@Real-YeJ. I have a same as problem and I have updated for file detectron2/data/datasets/builtin.py and this is my config

'BASE: "./Base-RCNN-FPN.yaml" MODEL: MASK_ON: True TEXTFUSENET_MUTIL_PATH_FUSE_ON: True WEIGHTS: "" PIXEL_STD: [57.375, 57.120, 58.395] RESNETS: STRIDE_IN_1X1: False # this is a C2 model NUM_GROUPS: 32 WIDTH_PER_GROUP: 8 DEPTH: 50 ROI_HEADS: NMS_THRESH_TEST: 0.3 TEXTFUSENET_SEG_HEAD: FPN_FEATURES_FUSED_LEVEL: 2 POOLER_SCALES: (0.0625,) DATASETS: TRAIN: ("mydataset",) TEST: ("mydataset",) SOLVER: IMS_PER_BATCH: 1 BASE_LR: 0.001 STEPS: (40000,80000,) MAX_ITER: 120000 CHECKPOINT_PERIOD: 2500 INPUT: MIN_SIZE_TRAIN: (800,1000,1200) MAX_SIZE_TRAIN: 1500 MIN_SIZE_TEST: 800 MAX_SIZE_TEST: 1500

OUTPUT_DIR: "./out_dir_r101/icdar2013_model/" ' and my command line is:

python train_net.py --num-gpus 1 --config-file configs/ocr/icdar2013_101_FPN.yaml

and in file detectron2/data/datasets/builtin.py . I add one more key in dict PREDEFINED_SPLITS_COCO["coco"] is:

"mydataset":("F:/project_2/New_folder/data/downloads", "F:/project_2/New_folder/data/downloads/train.json")

But it still have issue below:

File "/home/ensa/JYB/TextFuseNet/detectron2/modeling/box_regression.py", line 66, in get_deltas assert (src_widths > 0).all().item(), "Input boxes to Box2BoxTransform are not valid!" RuntimeError: CUDA error: device-side assert triggered

ying09 / TextFuseNet