yoxu515 / aot-benchmark

An efficient modular implementation of Associating Objects with Transformers for Video Object Segmentation in PyTorch
BSD 3-Clause "New" or "Revised" License
600 stars 108 forks source link

questions about inference fps #9

Closed xinyeCH closed 2 years ago

xinyeCH commented 2 years ago

Thanks for making code available! I met some questions while testing the pretrained model! It can only get a speed of near 29FPS when testing the PRE_YTB_DAV pretrained model of DAVIS2017, AOTS which should be 40FPS according to paper result. But the test J & F-mean is the same as the results posted in model_zoo which is 0.820575.

I did not modify the default test config of aots.py exclude dir such as dataset. Did I need to modify something in train_eval.sh?

My device: 2 x Tesla V100 SXM2 32GB Driver Version: 450.51.06 CUDA Version: 11.0 pytorch==1.7.0 torchvision==0.8.1 spatial-correlation-sampler == 0.3.0

Exp alldataset_AOTS:
{
    "DATASETS": [
        "youtubevos",
        "davis2017"
    ],
    "DATA_DAVIS_REPEAT": 5,
    "DATA_DYNAMIC_MERGE_PROB": 0.3,
    "DATA_MAX_CROP_STEPS": 10,
    "DATA_MAX_SCALE_FACTOR": 1.3,
    "DATA_MIN_SCALE_FACTOR": 0.7,
    "DATA_RANDOMCROP": [
        465,
        465
    ],
    "DATA_RANDOMFLIP": 0.5,
    "DATA_RANDOM_GAP_DAVIS": 12,
    "DATA_RANDOM_GAP_YTB": 3,
    "DATA_RANDOM_REVERSE_SEQ": true,
    "DATA_SEQ_LEN": 5,
    "DATA_SHORT_EDGE_LEN": 480,
    "DATA_WORKERS": 8,
    "DIR_CKPT": "/yexin/vos_related_source/experiments2022/10-17AM-on-January-27-2022/alldataset_AOTS/PRE_YTB_DAV/ckpt",
    "DIR_DATA": "./datasets",
    "DIR_DAVIS": "/workdir/xiecunhuang/VOS/dataset/DAVIS/2017/trainval",
    "DIR_EMA_CKPT": "/yexin/vos_related_source/experiments2022/10-17AM-on-January-27-2022/alldataset_AOTS/PRE_YTB_DAV/ema_ckpt",
    "DIR_EVALUATION": "/yexin/vos_related_source/experiments2022/10-17AM-on-January-27-2022/alldataset_AOTS/PRE_YTB_DAV/eval",
    "DIR_IMG_LOG": "/yexin/vos_related_source/experiments2022/10-17AM-on-January-27-2022/alldataset_AOTS/PRE_YTB_DAV/log/img",
    "DIR_LOG": "/yexin/vos_related_source/experiments2022/10-17AM-on-January-27-2022/alldataset_AOTS/PRE_YTB_DAV/log",
    "DIR_RESULT": "/yexin/vos_related_source/experiments2022/10-17AM-on-January-27-2022/alldataset_AOTS/PRE_YTB_DAV",
    "DIR_ROOT": "/yexin/vos_related_source/experiments2022/10-17AM-on-January-27-2022",
    "DIR_STATIC": "/yexin/vos_related_source/vos_exper_dataset/unify_pretrain_dataset",
    "DIR_TB_LOG": "/yexin/vos_related_source/experiments2022/10-17AM-on-January-27-2022/alldataset_AOTS/PRE_YTB_DAV/log/tensorboard",
    "DIR_YTB": "/yexin/vos_related_source/vos_exper_dataset/dataset/Youtube",
    "DIST_BACKEND": "nccl",
    "DIST_ENABLE": true,
    "DIST_START_GPU": 0,
    "DIST_URL": "tcp://127.0.0.1:13241",
    "EXP_NAME": "alldataset_AOTS",
    "MODEL_ALIGN_CORNERS": true,
    "MODEL_ATT_HEADS": 8,
    "MODEL_DECODER_INTERMEDIATE_LSTT": true,
    "MODEL_ENCODER": "mobilenetv2",
    "MODEL_ENCODER_DIM": [
        24,
        32,
        96,
        1280
    ],
    "MODEL_ENCODER_EMBEDDING_DIM": 256,
    "MODEL_ENCODER_PRETRAIN": "./pretrain_models/mobilenet_v2-b0353104.pth",
    "MODEL_ENGINE": "aotengine",
    "MODEL_EPSILON": 1e-05,
    "MODEL_FREEZE_BACKBONE": false,
    "MODEL_FREEZE_BN": true,
    "MODEL_LSTT_NUM": 2,
    "MODEL_MAX_OBJ_NUM": 10,
    "MODEL_NAME": "AOTS",
    "MODEL_SELF_HEADS": 8,
    "MODEL_USE_PREV_PROB": false,
    "MODEL_VOS": "aot",
    "PRETRAIN": true,
    "PRETRAIN_FULL": true,
    "PRETRAIN_MODEL": "/yexin/vos_related_source/experiments2022/10-17AM-on-January-27-2022/alldataset_AOTS/PRE/ema_ckpt/save_step_100000.pth",
    "STAGE_NAME": "PRE_YTB_DAV",
    "TEST_CKPT_PATH": "./AOTS_PRE_YTB_DAV.pth",
    "TEST_CKPT_STEP": null,
    "TEST_DATASET": "davis2017",
    "TEST_DATASET_FULL_RESOLUTION": false,
    "TEST_DATASET_SPLIT": "val",
    "TEST_EMA": true,
    "TEST_FLIP": false,
    "TEST_FRAME_LOG": false,
    "TEST_GPU_ID": 0,
    "TEST_GPU_NUM": 2,
    "TEST_LONG_TERM_MEM_GAP": 9999,
    "TEST_MAX_SIZE": 1040.0,
    "TEST_MIN_SIZE": null,
    "TEST_MULTISCALE": [
        1.0
    ],
    "TEST_WORKERS": 4,
    "TRAIN_AUTO_RESUME": true,
    "TRAIN_AUX_LOSS_RATIO": 1.0,
    "TRAIN_AUX_LOSS_WEIGHT": 1.0,
    "TRAIN_BATCH_SIZE": 16,
    "TRAIN_CLIP_GRAD_NORM": 5.0,
    "TRAIN_DATASET_FULL_RESOLUTION": false,
    "TRAIN_EMA_RATIO": 0.1,
    "TRAIN_ENABLE_PREV_FRAME": false,
    "TRAIN_ENCODER_FREEZE_AT": 2,
    "TRAIN_GPUS": 4,
    "TRAIN_HARD_MINING_RATIO": 0.5,
    "TRAIN_IMG_LOG": true,
    "TRAIN_LOG_STEP": 50,
    "TRAIN_LONG_TERM_MEM_GAP": 9999,
    "TRAIN_LR": 0.0002,
    "TRAIN_LR_COSINE_DECAY": false,
    "TRAIN_LR_ENCODER_RATIO": 0.1,
    "TRAIN_LR_MIN": 2e-05,
    "TRAIN_LR_POWER": 0.9,
    "TRAIN_LR_RESTART": 1,
    "TRAIN_LR_UPDATE_STEP": 1,
    "TRAIN_LR_WARM_UP_RATIO": 0.05,
    "TRAIN_LSTT_DROPPATH": 0.1,
    "TRAIN_LSTT_DROPPATH_LST": false,
    "TRAIN_LSTT_DROPPATH_SCALING": false,
    "TRAIN_LSTT_EMB_DROPOUT": 0.0,
    "TRAIN_LSTT_ID_DROPOUT": 0.0,
    "TRAIN_LSTT_LT_DROPOUT": 0.0,
    "TRAIN_LSTT_ST_DROPOUT": 0.0,
    "TRAIN_MAX_KEEP_CKPT": 8,
    "TRAIN_OPT": "adamw",
    "TRAIN_RESUME": false,
    "TRAIN_RESUME_CKPT": null,
    "TRAIN_RESUME_STEP": 0,
    "TRAIN_SAVE_STEP": 1000,
    "TRAIN_SEQ_TRAINING_FREEZE_PARAMS": [
        "patch_wise_id_bank"
    ],
    "TRAIN_SEQ_TRAINING_START_RATIO": 0.5,
    "TRAIN_SGD_MOMENTUM": 0.9,
    "TRAIN_START_STEP": 0,
    "TRAIN_TBLOG": true,
    "TRAIN_TBLOG_STEP": 50,
    "TRAIN_TOP_K_PERCENT_PIXELS": 0.15,
    "TRAIN_TOTAL_STEPS": 100000,
    "TRAIN_WEIGHT_DECAY": 0.07,
    "TRAIN_WEIGHT_DECAY_EXCLUSIVE": {},
    "TRAIN_WEIGHT_DECAY_EXEMPTION": [
        "absolute_pos_embed",
        "relative_position_bias_table",
        "relative_emb_v",
        "conv_out"
    ]
}
Use GPU 0 for evaluating.
Use GPU 1 for evaluating.
Build VOS model.
Load checkpoint from ./AOTS_PRE_YTB_DAV.pth
Process dataset...
/workdir/xiecunhuang/VOS/dataset/DAVIS/2017/trainval/JPEGImages/480p
Eval alldataset_AOTS on davis2017 val:
Done!
/workdir/xiecunhuang/VOS/dataset/DAVIS/2017/trainval/JPEGImages/480p
GPU 0 - Processing Seq bike-packing [1/30]:
GPU 1 - Processing Seq blackswan [2/30]:
GPU 1 - Seq blackswan - FPS: 29.29. All-Frame FPS: 29.29, All-Seq FPS: 29.29, Max Mem: 0.53G
GPU 1 - Processing Seq breakdance [4/30]:
GPU 0 - Seq bike-packing - FPS: 28.99. All-Frame FPS: 28.99, All-Seq FPS: 28.99, Max Mem: 0.58G
GPU 0 - Processing Seq bmx-trees [3/30]:
GPU 1 - Seq breakdance - FPS: 29.57. All-Frame FPS: 29.46, All-Seq FPS: 29.43, Max Mem: 0.53G
GPU 1 - Processing Seq camel [5/30]:
GPU 0 - Seq bmx-trees - FPS: 29.16. All-Frame FPS: 29.08, All-Seq FPS: 29.08, Max Mem: 0.58G
GPU 0 - Processing Seq car-roundabout [6/30]:
GPU 1 - Seq camel - FPS: 30.71. All-Frame FPS: 29.95, All-Seq FPS: 29.84, Max Mem: 0.53G
GPU 1 - Processing Seq car-shadow [7/30]:
GPU 0 - Seq car-roundabout - FPS: 29.29. All-Frame FPS: 29.15, All-Seq FPS: 29.15, Max Mem: 0.58G
GPU 0 - Processing Seq cows [8/30]:
GPU 1 - Seq car-shadow - FPS: 30.62. All-Frame FPS: 30.05, All-Seq FPS: 30.03, Max Mem: 0.53G
GPU 1 - Processing Seq dance-twirl [9/30]:
GPU 0 - Seq cows - FPS: 27.67. All-Frame FPS: 28.67, All-Seq FPS: 28.76, Max Mem: 0.58G
GPU 0 - Processing Seq dog [10/30]:
GPU 1 - Seq dance-twirl - FPS: 25.66. All-Frame FPS: 28.80, All-Seq FPS: 29.04, Max Mem: 0.53G
GPU 1 - Processing Seq dogs-jump [11/30]:
GPU 0 - Seq dog - FPS: 28.28. All-Frame FPS: 28.60, All-Seq FPS: 28.67, Max Mem: 0.58G
GPU 0 - Processing Seq drift-chicane [12/30]:
GPU 1 - Seq dogs-jump - FPS: 27.15. All-Frame FPS: 28.52, All-Seq FPS: 28.71, Max Mem: 0.53G
GPU 1 - Processing Seq drift-straight [13/30]:
GPU 0 - Seq drift-chicane - FPS: 28.43. All-Frame FPS: 28.58, All-Seq FPS: 28.63, Max Mem: 0.58G
GPU 0 - Processing Seq goat [14/30]:
GPU 1 - Seq drift-straight - FPS: 29.75. All-Frame FPS: 28.65, All-Seq FPS: 28.85, Max Mem: 0.53G
GPU 1 - Processing Seq gold-fish [15/30]:
GPU 0 - Seq goat - FPS: 27.72. All-Frame FPS: 28.43, All-Seq FPS: 28.49, Max Mem: 0.58G
GPU 0 - Processing Seq horsejump-high [16/30]:
GPU 1 - Seq gold-fish - FPS: 28.61. All-Frame FPS: 28.64, All-Seq FPS: 28.82, Max Mem: 0.53G
GPU 1 - Processing Seq india [17/30]:
GPU 0 - Seq horsejump-high - FPS: 28.93. All-Frame FPS: 28.47, All-Seq FPS: 28.55, Max Mem: 0.58G
GPU 0 - Processing Seq judo [18/30]:
GPU 0 - Seq judo - FPS: 31.24. All-Frame FPS: 28.61, All-Seq FPS: 28.82, Max Mem: 0.58G
GPU 0 - Processing Seq lab-coat [20/30]:
GPU 1 - Seq india - FPS: 28.42. All-Frame FPS: 28.61, All-Seq FPS: 28.78, Max Mem: 0.53G
GPU 1 - Processing Seq kite-surf [19/30]:
GPU 0 - Seq lab-coat - FPS: 29.81. All-Frame FPS: 28.69, All-Seq FPS: 28.92, Max Mem: 0.58G
GPU 0 - Processing Seq libby [21/30]:
GPU 1 - Seq kite-surf - FPS: 30.69. All-Frame FPS: 28.76, All-Seq FPS: 28.96, Max Mem: 0.53G
GPU 1 - Processing Seq loading [22/30]:
GPU 0 - Seq libby - FPS: 31.08. All-Frame FPS: 28.85, All-Seq FPS: 29.10, Max Mem: 0.58G
GPU 0 - Processing Seq mbike-trick [23/30]:
GPU 1 - Seq loading - FPS: 31.06. All-Frame FPS: 28.90, All-Seq FPS: 29.14, Max Mem: 0.53G
GPU 1 - Processing Seq motocross-jump [24/30]:
GPU 1 - Seq motocross-jump - FPS: 31.09. All-Frame FPS: 29.01, All-Seq FPS: 29.29, Max Mem: 0.53G
GPU 1 - Processing Seq parkour [26/30]:
GPU 0 - Seq mbike-trick - FPS: 27.85. All-Frame FPS: 28.74, All-Seq FPS: 28.99, Max Mem: 0.58G
GPU 0 - Processing Seq paragliding-launch [25/30]:
GPU 1 - Seq parkour - FPS: 28.09. All-Frame FPS: 28.90, All-Seq FPS: 29.20, Max Mem: 0.53G
GPU 1 - Processing Seq pigs [27/30]:
GPU 0 - Seq paragliding-launch - FPS: 29.78. All-Frame FPS: 28.84, All-Seq FPS: 29.05, Max Mem: 0.58G
GPU 0 - Processing Seq scooter-black [28/30]:
GPU 0 - Seq scooter-black - FPS: 30.13. All-Frame FPS: 28.89, All-Seq FPS: 29.13, Max Mem: 0.58G
GPU 0 - Processing Seq soapbox [30/30]:
GPU 1 - Seq pigs - FPS: 30.28. All-Frame FPS: 29.01, All-Seq FPS: 29.27, Max Mem: 0.53G
GPU 1 - Processing Seq shooting [29/30]:
GPU 1 - Seq shooting - FPS: 28.25. All-Frame FPS: 28.98, All-Seq FPS: 29.20, Max Mem: 0.65G
Finished the evaluation on GPU 1.
GPU 0 - Seq soapbox - FPS: 29.63. All-Frame FPS: 28.96, All-Seq FPS: 29.16, Max Mem: 0.58G
Finished the evaluation on GPU 0.
GPU [0, 1] - All-Frame FPS: 28.97, All-Seq FPS: 29.18, Max Mem: 0.65G
xinyeCH commented 2 years ago

Sorry, but I am still confused about the speed, in model zoo , AOTS has around 40 FPS and AOTT 50 FPS in DAVIS-VOS as marked in the following picture.

屏幕快照 2022-02-14 下午8 24 05
z-x-yang commented 2 years ago

@xinyeCH I update the code about result saving. At present, the result saving happens after the inference of each sequence instead of each frame.

The speed should be faster on your device.

z-x-yang commented 2 years ago

@xinyeCH One more update. Currently, the evaluator call torch.cuda.synchronize() only once. After the inference of each sequence, I synchronize all the frame timers (start and end). The influence of slow CPU and storage devices should be negligible now.

I evaluated the new code on a V100 device with slow storage, maybe like your device, the speed results are slightly faster than the reported results now (AOTS: 41fps now, 33fps before).

xinyeCH commented 2 years ago

@z-x-yang Thanks for your reply.

I updated the evaluator.py according to your newest commit and modified some hyperparameters of dataloader such as worker numbers. FPS of AOTS is about 34 now and AOTT 45.

I think the difference between my testing result and paper maybe comes from computer environment.

z-x-yang commented 2 years ago

@xinyeCH Could try to use a single GPU in the inference?

xinyeCH commented 2 years ago

@z-x-yang Sorry for the late reply. Yes, I use only one V100 during inference. Using 2 gpus gets worser performance.

z-x-yang commented 2 years ago

OK. If your cpu is strong enough, using multiple gpus should not make speed slower.