How is effcientnet backbone faster than resnet backbone?

abhaydoke09 commented 3 years ago

Hello,

I am trying to find the 2x improvement model. As reported here : https://github.com/zetyquickly/DensePoseFnL/blob/master/doc/OUR_MODEL_ZOO.md, looks like Resnet-50 backbone has the best FPS 13.16.

Does it mean that changing the backbone to Mobilenet or Efficientnet does not improve inference latency of the densepose model?

zetyquickly commented 3 years ago

This family of the models is fastest https://github.com/zetyquickly/DensePoseFnL/blob/master/configs/s0_bv2_bifpn_f64_s3x.yaml

You can measure speed without training. If you want to reproduce results of the paper you need to train it on DensePose COCO first.

abhaydoke09 commented 3 years ago

Thank you @zetyquickly. I will compare the inference speed.

abhaydoke09 commented 3 years ago

Hello @zetyquickly

I am using s0_bv2_bifpn_f64_s3x.yaml for kickstarting the training on COCO dataset. But when I start the training, it directly jump on the evaluation step.

[03/17 03:44:00 d2.data.common]: Serialized dataset takes 444.87 MiB [03/17 03:44:00 d2.data.build]: Using training sampler TrainingSampler [03/17 03:44:02 fvcore.common.checkpoint]: No checkpoint found. Initializing model from scratch [03/17 03:44:02 d2.engine.train_loop]: Starting training from iteration 0 [03/17 03:44:04 fvcore.common.checkpoint]: Saving checkpoint to ./output/model_final.pth [03/17 03:44:05 d2.data.datasets.coco]: Loaded 1508 images in COCO format from datasets/coco/annotations/densepose_minival2014.json [03/17 03:44:05 d2.data.build]: Distribution of instances among all 1 categories:	category	#instances
person	5581

[03/17 03:44:05 d2.data.common]: Serializing 1508 elements to byte tensors and concatenating them all ... [03/17 03:44:05 d2.data.common]: Serialized dataset takes 21.13 MiB [03/17 03:44:07 d2.evaluation.evaluator]: Start inference on 1508 images [03/17 03:44:59 d2.evaluation.evaluator]: Inference done 11/1508. 0.0421 s / img. ETA=1:50:37 [03/17 03:45:08 d2.evaluation.evaluator]: Inference done 13/1508. 0.0427 s / img. ETA=1:49:55 [03/17 03:45:14 d2.evaluation.evaluator]: Inference done 14/1508. 0.0428 s / img. ETA=1:54:23

Is there a bug in s0_bv2_bifpn_f64_s3x.yaml config?

I tried using other configs e.g. https://github.com/zetyquickly/DensePoseFnL/blob/master/configs/densepose_rcnn_mobilenetv3_rw_FPN_s1x.yaml and training worked fine. I doubt that something is different for s0_bv2_bifpn_f64_s3x.yaml which is causing this issue.

abhaydoke09 commented 3 years ago

I took a look at the s0_bv2_bifpn_f64_s3x.yaml. MAX_ITER is set to 10. Looks like a typo/bug to me.

BASE: "s0_bv2_bifpn.yaml" MODEL: FPN: OUT_CHANNELS: 64 ROI_BOX_HEAD: CONV_DIM: 64 ROI_SHARED_BLOCK: ASPP_DIM: 64 CONV_HEAD_DIM: 64 SOLVER: MAX_ITER: 10 STEPS: (330000, 370000) TEST: EVAL_PERIOD: 390000

abhaydoke09 commented 3 years ago

Changing MAX_ITER to 390000 fixed the issue.

zetyquickly commented 3 years ago

@abhaydoke09 Yes, this option tells when to end training, and detectron2 by default starts evaluation after whole training that took only 10 batches

abhaydoke09 commented 3 years ago

@zetyquickly Can you please let us know how you measured the FPS of s0_bv2_bifpn_f64_s3x.yaml and densepose_rcnn_R_50_FPN_s1x.yaml models? In your paper, s0_bv2_bifpn_f64_s3x.yaml achieves 23.55 FPS as compared to densepose_rcnn_R_50_FPN_s1x.yaml model's 13.16 FPS.

I am looking for a way to accurately measure and compare FPS of these 2 models. When I ran the evaluation on COCO dataset on Tesla V100 GPU, densepose_rcnn_R_50_FPN_s1x.yaml inference time was better than the s0_bv2_bifpn_f64_s3x.yaml.

Can you please let me know how you exactly computed/measured FPS of these models?

Thank you in advance.

zetyquickly commented 3 years ago

The results are obtained on 1080ti. FPS measured by dividing 1 on pure compute time that outputs detectron2 evaluator. When we run it on whole evaluation set

abhaydoke09 commented 3 years ago

Can you please add the script which does this evaluation? That will provide a deterministic way of reproducing FPS results from the paper.

zetyquickly commented 3 years ago

I believe there might be ambiguities in exact results, but you should obtain results close to ours on your machine with 1080Ti if you follow all hints listed in the paper. I should note that there's no need to modify anything except configs.

You need to:

use s0_bv2_bifpn_f64_s3x.yaml
set shortest edge on test to 512px
run it on DensePose COCO minival and obtain pure compute time

Also refer to the part 4.4 of our paper and run evaluation using the following command (with additional flags):

python train_net.py --config-file configs/s0_bv2_bifpn_f64_s3x.yaml --eval-only MODEL.WEIGHTS model.pth 
MODEL.RPN.POST_NMS_TOPK_TEST 100 MODEL.ROI_HEADS.NMS_THRESH_TEST 0.3

Feel free to ask further questions on that topic

nihal1294 commented 3 years ago

@abhaydoke09 Were you able to train configs/s0_bv2_bifpn_f64_s3x.yaml model from scratch? Could you let me know the loss values you got for this?

Edit: @zetyquickly What are the correct values for losses during training when just starting with DensePose COCO dataset? Could you please let me know? I am getting something like this: eta: 2 days, 5:33:07 iter: 19 total_loss: 25.537 loss_cls: 0.648 loss_box_reg: 0.163 loss_densepose_U: 8.298 loss_densepose_V: 9.309 loss_densepose_I: 0.973 loss_densepose_S: 5.418 loss_rpn_cls: 0.693 loss_rpn_loc: 0.025 time: 0.4841 data_time: 0.0519 lr: 0.000293 max_mem: 10681M . . . eta: 2 days, 3:59:03 iter: 6599 total_loss: 25.650 loss_cls: 0.651 loss_box_reg: 0.238 loss_densepose_U: 8.273 loss_densepose_V: 8.976 loss_densepose_I: 0.973 loss_densepose_S: 5.418 loss_rpn_cls: 0.693 loss_rpn_loc: 0.022 time: 0.4836 data_time: 0.0247 lr: 0.002500 max_mem: 10684M

I am training on a single GPU (V100) with batch size = 4 and base LR = 0.0025 and the model does not seem to be learning at all. Thanks in advance

zetyquickly commented 3 years ago

Hi @nihal1294

I am training on a single GPU (V100) with batch size = 4 and base LR = 0.0025 and the model does not seem to be learning at all.

Regarding this I'd say that you've chosen correct LR relative to batch size. It might be the problem in the data, have you downloaded COCO + DensePose COCO and created symlinks to it? It might be a problem with your train script call, how to do you start training? And it is also likely that 3 hours is not enough for the model to and I propose to find an environment where there are 4 GPUs or increase LR as an experiment

nihal1294 commented 3 years ago

@zetyquickly thanks for the response.

I was trying to train with --qat enabled. When I tried quantization aware training, the model was not learning at all even after 24 hours. So I am currently training without --qat flag and the model is learning, but even after 400k iterations, the model has not learned well. I am using DensePose COCO dataset itself. This is the network I am trying to train configs/s0_bv2_bifpn_f64_s3x.yaml

I am currently training with 2 batch size and LR = 0.0025 and the total_loss value has dropped to 1.8 - 2.0. I am hoping that with more iterations, the model will learn better.

I will not be able to get 4 GPUs for training to test with.

zetyquickly commented 3 years ago

@zetyquickly thanks for the response.

I was trying to train with --qat enabled. When I tried quantization aware training, the model was not learning at all even after 24 hours. So I am currently training without --qat flag and the model is learning, but even after 400k iterations, the model has not learned well. I am using DensePose COCO dataset itself. This is the network I am trying to train configs/s0_bv2_bifpn_f64_s3x.yaml

I am currently training with 2 batch size and LR = 0.0025 and the total_loss value has dropped to 1.8 - 2.0. I am hoping that with more iterations, the model will learn better.

I will not be able to get 4 GPUs for training to test with.

Oh, I see. We perform QAT only to fine tune the trained network. So, I recommend firstly to train non-quantized version of s0_bv2_bifpn_f64_s3x

nihal1294 commented 3 years ago

@zetyquickly thanks for the response. I was trying to train with --qat enabled. When I tried quantization aware training, the model was not learning at all even after 24 hours. So I am currently training without --qat flag and the model is learning, but even after 400k iterations, the model has not learned well. I am using DensePose COCO dataset itself. This is the network I am trying to train configs/s0_bv2_bifpn_f64_s3x.yaml I am currently training with 2 batch size and LR = 0.0025 and the total_loss value has dropped to 1.8 - 2.0. I am hoping that with more iterations, the model will learn better. I will not be able to get 4 GPUs for training to test with.

Oh, I see. We perform QAT only to fine tune the trained network. So, I recommend firstly to train non-quantized version of s0_bv2_bifpn_f64_s3x

Oh ok, then I will do that. Also, could you let me know how long it takes normally for the model to learn well with batch_size = 16 and LR = 0.01? Because, in my case it should take more time than usual because of much lower batch_size and LR.

zetyquickly commented 3 years ago

@zetyquickly thanks for the response. I was trying to train with --qat enabled. When I tried quantization aware training, the model was not learning at all even after 24 hours. So I am currently training without --qat flag and the model is learning, but even after 400k iterations, the model has not learned well. I am using DensePose COCO dataset itself. This is the network I am trying to train configs/s0_bv2_bifpn_f64_s3x.yaml I am currently training with 2 batch size and LR = 0.0025 and the total_loss value has dropped to 1.8 - 2.0. I am hoping that with more iterations, the model will learn better. I will not be able to get 4 GPUs for training to test with.

Oh, I see. We perform QAT only to fine tune the trained network. So, I recommend firstly to train non-quantized version of s0_bv2_bifpn_f64_s3x

Oh ok, then I will do that. Also, could you let me know how long it takes normally for the model to learn well with batch_size = 16 and LR = 0.01? Because, in my case it should take more time than usual because of much lower batch_size and LR.

I don't remember exactly but on 4-8 GPUs it took less than 8 hours

zetyquickly / DensePoseFnL

How is effcientnet backbone faster than resnet backbone? #7