xw-hu / SINet

IEEE Transactions on Intelligent Transportation Systems (TITS), 2019
Other
80 stars 22 forks source link

Did my training converge? #2

Closed skywalker9096 closed 6 years ago

skywalker9096 commented 6 years ago

I run the training phase on KITTI according to the instructions, and get the caffemodel of iter 75000, the training loss drops from 6.0 to below 1.0, and then jumps between 0 ~ 1, and it's weired that all the confidence of proposals seems non-sense, What's wrong?

** In proposals/SINet_KITTI_result.txt

1,1180.4,315.91,61.596,59.091,0.023928 1,1139,315.91,103,59.091,0.023928 1,1108,315.91,112.54,59.091,0.023928 1,1076.9,315.91,112.54,59.091,0.023928 1,1045.9,315.91,112.54,59.091,0.023928 1,1014.8,315.91,112.54,59.091,0.023928 1,983.75,315.91,112.54,59.091,0.023928 1,952.7,315.91,112.54,59.091,0.023928 1,921.65,315.91,112.54,59.091,0.023928 1,890.6,315.91,112.54,59.091,0.023928 1,859.55,315.91,112.54,59.091,0.023928 1,828.5,315.91,112.54,59.091,0.023928 1,797.45,315.91,112.54,59.091,0.023928 1,766.4,315.91,112.54,59.091,0.023928 1,735.35,315.91,112.54,59.091,0.023928 1,704.3,315.91,112.54,59.091,0.023928 1,673.25,315.91,112.54,59.091,0.023928 1,642.2,315.91,112.54,59.091,0.023928 1,611.15,315.91,112.54,59.091,0.023928 1,580.1,315.91,112.54,59.091,0.023928 ... ... 3769,18.113,199.22,217.35,54.688,90.192 3769,1073.8,188.8,168.19,54.688,90.192 3769,991.01,188.8,217.35,54.688,90.192 3769,939.26,188.8,217.35,54.688,90.192 3769,887.51,188.8,217.35,54.688,90.192 3769,835.76,188.8,217.35,54.688,90.192 3769,784.01,188.8,217.35,54.688,90.192 3769,732.26,188.8,217.35,54.688,90.192 3769,680.51,188.8,217.35,54.688,90.192 3769,628.76,188.8,217.35,54.688,90.192 3769,577.01,188.8,217.35,54.688,90.192 3769,525.26,188.8,217.35,54.688,90.192 3769,473.51,188.8,217.35,54.688,90.192 3769,421.76,188.8,217.35,54.688,90.192 3769,370.01,188.8,217.35,54.688,90.192 3769,318.26,188.8,217.35,54.688,90.192 3769,266.51,188.8,217.35,54.688,90.192

xw-hu commented 6 years ago

How about the detection results

skywalker9096 commented 6 years ago

hello @xw-hu , the detection result is similar, in detections/SINet_KITTI_result_car.txt:

1,1139.4,305.22,102.59,69.783,1.2737e-05
1,1077.3,305.22,120.09,69.783,1.2737e-05
1,1015.2,305.22,120.09,69.783,1.2737e-05
1,953.15,305.22,120.09,69.783,1.2737e-05
1,891.05,305.22,120.09,69.783,1.2737e-05
1,828.95,305.22,120.09,69.783,1.2737e-05
1,766.85,305.22,120.09,69.783,1.2737e-05
1,704.75,305.22,120.09,69.783,1.2737e-05
1,642.65,305.22,120.09,69.783,1.2737e-05
1,580.55,305.22,120.09,69.783,1.2737e-05
1,518.45,305.22,120.09,69.783,1.2737e-05
...
3769,1035.8,185.7,206.2,145.18,1.0966e-21
3769,799.76,185.7,369.53,145.18,1.0966e-21
3769,15.675,187.43,369.53,145.18,1.0966e-21
3769,592.76,175.28,369.53,145.18,1.0966e-21
3769,437.51,175.28,369.53,145.18,1.0966e-21
3769,282.26,175.28,369.53,145.18,1.0966e-21
3769,127.01,175.28,369.53,145.18,1.0966e-21
3769,939.49,156.18,302.51,145.18,1.0966e-21
3769,696.26,154.45,369.53,145.18,1.0966e-21
3769,541.01,144.03,369.53,145.18,1.0966e-21
3769,385.76,144.03,369.53,145.18,1.0966e-21
3769,230.51,144.03,369.53,145.18,1.0966e-21
3769,75.263,144.03,369.53,145.18,1.0966e-21
3769,799.76,133.61,369.53,145.18,1.0966e-21
3769,1042,128.41,200.05,145.18,1.0966e-21
3769,639.34,128.41,369.53,145.18,1.0966e-21

The confidence score is too low( 1e-5).

I replaced field_w and field_h in image_gt_data_param of trainval_1st.prototxt, trainval_2nd.prototxt and trainval_2nd_ini.prototxt:

image_gt_data_param {
    source: "../../../data/kitti/window_files/mscnn_window_file_kitti_vehicle_train.txt"
    batch_size: 4
    coord_num: 4
    resize_width: 1920
    resize_height: 576
    crop_width: 768  #crop around the object random
    crop_height: 576
    min_gt_height: 35 #gt roi<min_gt_height ignore
    downsample_rate: 8 # label downsample's ratio
    downsample_rate: 8
    downsample_rate: 16
    downsample_rate: 16
    downsample_rate: 32
    downsample_rate: 32
    downsample_rate: 64
    field_w: 62 #rpn anchor_w
    field_w: 103
    field_w: 159
    field_w: 229
    field_w: 322
    field_w: 434
    field_w: 590
    field_h: 43 #rpn anchor_h
    field_h: 63
    field_h: 87
    field_h: 120
    field_h: 161
    field_h: 218
    field_h: 291
    fg_threshold: 0.5
    do_multiple_scale: true
    min_scale: 60  #bounding box scale  classify roi into different scales
    max_scale: 480

I didn't change the min_scale and max_scale value. and roi_split_param in trainval_2nd.prototxt is changed:

layer {
  name: "ROISplit" #use ./data/kitti/statistical_size.m to determine the parameter
  type: "ROISplit"
  bottom: "rois"
  top: "roi_num"
  top: "hash_table"
  roi_split_param {
    branch_num: 2
    split_area1: 8214
    fluctuation_range_large: 742
    fluctuation_range_small: 742
  }
}

The log_1st.txt goes:

I0503 21:24:08.507083 23925 solver.cpp:320] Iteration 10000, loss = 0.144807
I0503 21:24:08.507103 23925 solver.cpp:340] Iteration 10000, Testing net (#0)
I0503 21:24:56.591938 23925 solver.cpp:419]     Test net output #0: accuracy_1_5x5 = 0.984565
I0503 21:24:56.591954 23925 solver.cpp:419]     Test net output #1: accuracy_1_5x5 = 0.819886
I0503 21:24:56.591958 23925 solver.cpp:419]     Test net output #2: accuracy_1_7x7 = 0.967705
I0503 21:24:56.591960 23925 solver.cpp:419]     Test net output #3: accuracy_1_7x7 = 0.850334
I0503 21:24:56.591962 23925 solver.cpp:419]     Test net output #4: accuracy_2_5x5 = 0.967036
I0503 21:24:56.591964 23925 solver.cpp:419]     Test net output #5: accuracy_2_5x5 = 0.857131
I0503 21:24:56.591967 23925 solver.cpp:419]     Test net output #6: accuracy_2_7x7 = 0.963154
I0503 21:24:56.591969 23925 solver.cpp:419]     Test net output #7: accuracy_2_7x7 = 0.861945
I0503 21:24:56.591971 23925 solver.cpp:419]     Test net output #8: accuracy_3_5x5 = 0.958406
I0503 21:24:56.591974 23925 solver.cpp:419]     Test net output #9: accuracy_3_5x5 = 0.884946
I0503 21:24:56.591975 23925 solver.cpp:419]     Test net output #10: accuracy_3_7x7 = 0.95563
I0503 21:24:56.591977 23925 solver.cpp:419]     Test net output #11: accuracy_3_7x7 = 0.857311
I0503 21:24:56.591979 23925 solver.cpp:419]     Test net output #12: accuracy_4_5x5 = 0.951604
I0503 21:24:56.591981 23925 solver.cpp:419]     Test net output #13: accuracy_4_5x5 = 0.785199
I0503 21:24:56.591984 23925 solver.cpp:419]     Test net output #14: boxiou_1_5x5 = 0.706048
I0503 21:24:56.591986 23925 solver.cpp:419]     Test net output #15: boxiou_1_7x7 = 0.692426
I0503 21:24:56.591989 23925 solver.cpp:419]     Test net output #16: boxiou_2_5x5 = 0.688052
I0503 21:24:56.591990 23925 solver.cpp:419]     Test net output #17: boxiou_2_7x7 = 0.633385
I0503 21:24:56.591992 23925 solver.cpp:419]     Test net output #18: boxiou_3_5x5 = 0.660376
I0503 21:24:56.591995 23925 solver.cpp:419]     Test net output #19: boxiou_3_7x7 = 0.562137
I0503 21:24:56.591996 23925 solver.cpp:419]     Test net output #20: boxiou_4_5x5 = 0.623004
I0503 21:24:56.592001 23925 solver.cpp:419]     Test net output #21: loss_1_5x5 = 0.208121 (* 0.9 = 0.187309 loss)
I0503 21:24:56.592005 23925 solver.cpp:419]     Test net output #22: loss_1_5x5 = 0.000375387 (* 0.9 = 0.000337848 loss)
I0503 21:24:56.592007 23925 solver.cpp:419]     Test net output #23: loss_1_7x7 = 0.180292 (* 0.9 = 0.162263 loss)
I0503 21:24:56.592010 23925 solver.cpp:419]     Test net output #24: loss_1_7x7 = 0.000365495 (* 0.9 = 0.000328946 loss)
I0503 21:24:56.592013 23925 solver.cpp:419]     Test net output #25: loss_2_5x5 = 0.155888 (* 1 = 0.155888 loss)
I0503 21:24:56.592016 23925 solver.cpp:419]     Test net output #26: loss_2_5x5 = 0.000271001 (* 1 = 0.000271001 loss)
I0503 21:24:56.592020 23925 solver.cpp:419]     Test net output #27: loss_2_7x7 = 0.134279 (* 1 = 0.134279 loss)
I0503 21:24:56.592023 23925 solver.cpp:419]     Test net output #28: loss_2_7x7 = 0.000303086 (* 1 = 0.000303086 loss)
I0503 21:24:56.592027 23925 solver.cpp:419]     Test net output #29: loss_3_5x5 = 0.0982034 (* 1 = 0.0982034 loss)
I0503 21:24:56.592033 23925 solver.cpp:419]     Test net output #30: loss_3_5x5 = 0.000191059 (* 1 = 0.000191059 loss)
I0503 21:24:56.592038 23925 solver.cpp:419]     Test net output #31: loss_3_7x7 = 0.111578 (* 1 = 0.111578 loss)
I0503 21:24:56.592043 23925 solver.cpp:419]     Test net output #32: loss_3_7x7 = 0.000217731 (* 1 = 0.000217731 loss)
I0503 21:24:56.592049 23925 solver.cpp:419]     Test net output #33: loss_4_5x5 = 0.073229 (* 1 = 0.073229 loss)
I0503 21:24:56.592054 23925 solver.cpp:419]     Test net output #34: loss_4_5x5 = 6.21701e-05 (* 1 = 6.21701e-05 loss)
I0503 21:24:56.592058 23925 solver.cpp:325] Optimization Done.
I0503 21:24:56.592061 23925 caffe.cpp:215] Optimization Done.

The log_2nd.txt is currently updated.

I0508 11:54:27.499981  7949 solver.cpp:236] Iteration 1050, loss = 2.8295
I0508 11:54:27.500003  7949 solver.cpp:252]     Train net output #0: accuracy_1_5x5 = 0.991976
I0508 11:54:27.500007  7949 solver.cpp:252]     Train net output #1: accuracy_1_5x5 = 1
I0508 11:54:27.500010  7949 solver.cpp:252]     Train net output #2: accuracy_1_7x7 = 0.990082
I0508 11:54:27.500012  7949 solver.cpp:252]     Train net output #3: accuracy_1_7x7 = 0.953488
I0508 11:54:27.500015  7949 solver.cpp:252]     Train net output #4: accuracy_2_5x5 = 0.984035
I0508 11:54:27.500017  7949 solver.cpp:252]     Train net output #5: accuracy_2_5x5 = 0.690909
I0508 11:54:27.500020  7949 solver.cpp:252]     Train net output #6: accuracy_2_7x7 = 0.97393
I0508 11:54:27.500022  7949 solver.cpp:252]     Train net output #7: accuracy_2_7x7 = 0.674157
I0508 11:54:27.500025  7949 solver.cpp:252]     Train net output #8: accuracy_3_5x5 = 0.935972
I0508 11:54:27.500027  7949 solver.cpp:252]     Train net output #9: accuracy_3_5x5 = 0.615385
I0508 11:54:27.500030  7949 solver.cpp:252]     Train net output #10: accuracy_3_7x7 = 0.915509
I0508 11:54:27.500031  7949 solver.cpp:252]     Train net output #11: accuracy_3_7x7 = 0.948718
I0508 11:54:27.500035  7949 solver.cpp:252]     Train net output #12: accuracy_4_5x5 = 0.928241
I0508 11:54:27.500036  7949 solver.cpp:252]     Train net output #13: accuracy_4_5x5 = 0.964286
I0508 11:54:27.500038  7949 solver.cpp:252]     Train net output #14: boxiou_1_5x5 = 0.79634
I0508 11:54:27.500041  7949 solver.cpp:252]     Train net output #15: boxiou_1_7x7 = 0.720175
I0508 11:54:27.500043  7949 solver.cpp:252]     Train net output #16: boxiou_2_5x5 = 0.416827
I0508 11:54:27.500046  7949 solver.cpp:252]     Train net output #17: boxiou_2_7x7 = 0.287881
I0508 11:54:27.500048  7949 solver.cpp:252]     Train net output #18: boxiou_3_5x5 = 0.568348
I0508 11:54:27.500051  7949 solver.cpp:252]     Train net output #19: boxiou_3_7x7 = 0.229502
I0508 11:54:27.500053  7949 solver.cpp:252]     Train net output #20: boxiou_4_5x5 = 0.210682
I0508 11:54:27.500056  7949 solver.cpp:252]     Train net output #21: cls_accuracy_large = 0.879699
I0508 11:54:27.500058  7949 solver.cpp:252]     Train net output #22: cls_accuracy_small = 0.943089
I0508 11:54:27.500062  7949 solver.cpp:252]     Train net output #23: loss_1_5x5 = 0.206021 (* 0.9 = 0.185419 loss)
I0508 11:54:27.500066  7949 solver.cpp:252]     Train net output #24: loss_1_5x5 = 0.00461197 (* 0.9 = 0.00415077 loss)
I0508 11:54:27.500069  7949 solver.cpp:252]     Train net output #25: loss_1_7x7 = 0.171553 (* 0.9 = 0.154398 loss)
I0508 11:54:27.500073  7949 solver.cpp:252]     Train net output #26: loss_1_7x7 = 0.00808075 (* 0.9 = 0.00727267 loss)
I0508 11:54:27.500077  7949 solver.cpp:252]     Train net output #27: loss_2_5x5 = 0.467066 (* 1 = 0.467066 loss)
I0508 11:54:27.500079  7949 solver.cpp:252]     Train net output #28: loss_2_5x5 = 0.051064 (* 1 = 0.051064 loss)
I0508 11:54:27.500082  7949 solver.cpp:252]     Train net output #29: loss_2_7x7 = 0.73152 (* 1 = 0.73152 loss)
I0508 11:54:27.500085  7949 solver.cpp:252]     Train net output #30: loss_2_7x7 = 0.0907826 (* 1 = 0.0907826 loss)
I0508 11:54:27.500089  7949 solver.cpp:252]     Train net output #31: loss_3_5x5 = 0.237203 (* 1 = 0.237203 loss)
I0508 11:54:27.500092  7949 solver.cpp:252]     Train net output #32: loss_3_5x5 = 0.0191655 (* 1 = 0.0191655 loss)
I0508 11:54:27.500095  7949 solver.cpp:252]     Train net output #33: loss_3_7x7 = 0.15761 (* 1 = 0.15761 loss)
I0508 11:54:27.500100  7949 solver.cpp:252]     Train net output #34: loss_3_7x7 = 0.115767 (* 1 = 0.115767 loss)
I0508 11:54:27.500103  7949 solver.cpp:252]     Train net output #35: loss_4_5x5 = 0.139059 (* 1 = 0.139059 loss)
I0508 11:54:27.500107  7949 solver.cpp:252]     Train net output #36: loss_4_5x5 = 0.0791429 (* 1 = 0.0791429 loss)
I0508 11:54:27.500110  7949 solver.cpp:252]     Train net output #37: loss_bbox_large = 0.35103 (* 1 = 0.35103 loss)
I0508 11:54:27.500113  7949 solver.cpp:252]     Train net output #38: loss_bbox_small = 0.203048 (* 1 = 0.203048 loss)
I0508 11:54:27.500118  7949 solver.cpp:252]     Train net output #39: loss_cls_large = 0.366006 (* 1 = 0.366006 loss)
I0508 11:54:27.500120  7949 solver.cpp:252]     Train net output #40: loss_cls_small = 0.168262 (* 1 = 0.168262 loss)
I0508 11:54:27.500123  7949 sgd_solver.cpp:106] Iteration 1050, lr = 0.0005

The SINet_kitti_train_1st_iter_10000.caffemodel is 15MB, SINet_kitti_train_2nd_iter_75000.caffemodel is 595MB.

skywalker9096 commented 6 years ago

problem solved by testing without cudnn.