weiliu89 / caffe

Caffe: a fast open framework for deep learning.
http://caffe.berkeleyvision.org/
Other
4.77k stars 1.67k forks source link

How fast should be training ? #593

Closed MSutt closed 5 months ago

MSutt commented 7 years ago

First of all, thanks for this amazing work.

Issue summary

I am training an SSD 512 on my own dataset, training time between 10 iterations is about 2 minutes as shown next. I consider that this is slow compared to other network training, is it a normal speed or is there something wrong? (i am training on a Tesla K80) How speed should be training on the different models (300 / 512) and on different configurations (one / multiple GPU) ?

I0512 17:41:30.676178 14426 solver.cpp:243] Iteration 4440, loss = 1.44454
I0512 17:41:30.676414 14426 solver.cpp:259]     Train net output #0: mbox_loss = 1.08515 (* 1 = 1.08515 loss)
I0512 17:41:30.676455 14426 sgd_solver.cpp:138] Iteration 4440, lr = 0.001
I0512 17:43:20.057090 14426 solver.cpp:243] Iteration 4450, loss = 1.42478
I0512 17:43:20.061616 14426 solver.cpp:259]     Train net output #0: mbox_loss = 1.21771 (* 1 = 1.21771 loss)
I0512 17:43:20.061663 14426 sgd_solver.cpp:138] Iteration 4450, lr = 0.001
I0512 17:45:20.628170 14426 solver.cpp:243] Iteration 4460, loss = 1.38607
I0512 17:45:20.633569 14426 solver.cpp:259]     Train net output #0: mbox_loss = 1.63547 (* 1 = 1.63547 loss)

Steps to reproduce

Datas

I create my lmdb using create_data.sh with max_dim=512 cause i want my images to keep their aspect ratio.

Network

To train the network i used the finetune_ssd_pascal_512.py file coming from the 07++12+COCO model. I removed 'mirror': True, and 'mean_value': [104, 117, 123], from both train and test transform param. I also changed the resize_mode to resize_mode: FIT_LARGE_SIZE_AND_PAD,. Of course i adapted all paths to my dataset and model, changed number of classe, changed gpus = "0,1,2,3" to gpus = "0" cause i have only one GPU and modified the base LR.

Your system configuration

I am training on a GoogleCompute server with a Tesla K80.

Operating system: Ubuntu 14.04 Compiler: 4.8.4 CUDA version : release 7.0, V7.0.27 CUDNN version : 4.0.7 BLAS: atlas Python version : 2.7

zgplvyou commented 7 years ago

Hi,I,m a new ssd learner. and now i am training my own data only for detecting person.but i got some errors when testing.maybe i need to try to finetune the model,but i cannot find the finetune_ssd_pascal_512.py file ,so i wanna ask you where is it!Thank you very much!

MSutt commented 7 years ago

You can find finetune_ssd_pascal_512.py in the 07++12+COCO model folder linked at the end of README.md.

weiliu89 commented 7 years ago

You could set debug_info to true in the script and check which layer takes long time. It could be the anno data layer as well

weiliu89 commented 7 years ago

K80 is slow as well. And you are only using one GPU, which could also be the reason. You could reduce the batch size. But you have to tune the lr to get comparable performance

MSutt commented 7 years ago

Thanks for the answer, i will activate debug info to see which layer takes long time. Can you give some example of your training speed ?

MSutt commented 7 years ago

After activating debug infos, here is what's happening during 10 iterations of my training. As previously, my 10 iterations are during about 2 min. The first step reading data take almost all of this time.
The second step taking time is the Layer mbox_priorbox, top blob which take about 6 seconds. That's not big deal compared to the 2 min of reading datas.

I0519 11:00:58.664758  7715 solver.cpp:243] Iteration 12000, loss = 0.69167
I0519 11:00:58.664795  7715 solver.cpp:259]     Train net output #0: mbox_loss = 0.977279 (* 1 = 0.977279 loss)
I0519 11:00:58.664825  7715 sgd_solver.cpp:138] Iteration 12000, lr = 0.001
I0519 11:03:00.699906  7715 net.cpp:608]     [Forward] Layer data, top blob data data: 40.1579
I0519 11:03:00.700281  7715 net.cpp:608]     [Forward] Layer data, top blob label data: 3.86338
...
...
...
I0519 11:03:02.182437  7715 net.cpp:608]     [Forward] Layer mbox_conf, top blob mbox_conf data: 2.22728
I0519 11:03:02.185241  7715 net.cpp:608]     [Forward] Layer mbox_priorbox, top blob mbox_priorbox data: 0.326064
I0519 11:03:08.926136  7715 net.cpp:608]     [Forward] Layer mbox_loss, top blob mbox_loss data: 0.66974
I0519 11:03:09.444665  7715 net.cpp:636]     [Backward] Layer mbox_loss, bottom blob mbox_loc diff: 1.10698e-06
...
...
...
I0519 11:03:12.308850  7715 solver.cpp:243] Iteration 12010, loss = 0.479683
I0519 11:03:12.308941  7715 solver.cpp:259]     Train net output #0: mbox_loss = 0.66974 (* 1 = 0.66974 loss)
I0519 11:03:12.309013  7715 sgd_solver.cpp:138] Iteration 12010, lr = 0.001

Is the loading time normal ? can i reduce it ?

here is my data layer (i removed some batch samplers to train just on the original images)

layer {
  name: "data"
  type: "AnnotatedData"
  top: "data"
  top: "label"
  include {
    phase: TRAIN
  }
  transform_param {
    resize_param {
      prob: 1.0
      resize_mode: FIT_LARGE_SIZE_AND_PAD
      height: 512
      width: 512
      interp_mode: LINEAR
      interp_mode: AREA
      interp_mode: NEAREST
      interp_mode: CUBIC
      interp_mode: LANCZOS4
    }
    emit_constraint {
      emit_type: CENTER
    }
    distort_param {
      brightness_prob: 0.5
      brightness_delta: 32.0
      contrast_prob: 0.5
      contrast_lower: 0.5
      contrast_upper: 1.5
      hue_prob: 0.5
      hue_delta: 18.0
      saturation_prob: 0.5
      saturation_lower: 0.5
      saturation_upper: 1.5
      random_order_prob: 0.0
    }
    expand_param {
      prob: 0.5
      max_expand_ratio: 4.0
    }
  }
  data_param {
    source: "examples/mydataset/mydataset_train_lmdb"
    batch_size: 8
    backend: LMDB
  }
  annotated_data_param {
    batch_sampler {
      max_sample: 1
      max_trials: 1
    }
    label_map_file: "data/mydataset/labelmap_voc.prototxt"
  }
}
mxmxlwlw commented 6 years ago

@weiliu89 Yeah, my training process is too slow too. It can only run about 14000 epochs for one day. Is there any solution? Thank you!

IEEE-FELLOW commented 6 years ago

@mxmxlwlw hi ,I'm currently training my customer dataset on GTX1060, It takes about 1 min to train 10 iterations. what's ur GPU devices?

mxmxlwlw commented 6 years ago

@IEEE-FELLOW Hahaha, you are lucy! My GPU is GTX1080. You just need to annotate the expand_param in the prototxt of net. If your training image is big, it will slow you down!

IEEE-FELLOW commented 6 years ago

@mxmxlwlw wow! I just change the code as u say, and it takes 14 s to train 10 iterations! currently the loss is around 3. my training image is 1920*1080 ,and the detection target is much small, do u have any idea how to detect small target? many thanks!

mxmxlwlw commented 6 years ago

@IEEE-FELLOW For detect small target you needs to predict the box before too many pooling was apply to the feature. And the input size should be as big as you can. You can use kernel of stride 4 and size 7 in the first conv layer to reduce the computation.

IEEE-FELLOW commented 6 years ago

@mxmxlwlw thanks,I'll have a try.