Inquiry on Performance Reproduction

HenryHZY commented 1 week ago

Hi @yeliudev , thanks for your great project!

First, I use your provided checkpoint and get the same result as the provided log.

Then, I just reproduce the training of TVSum-PK:

bash tools/prepare_data.sh
python tools/launch.py configs/tvsum/r2_tuning_tvsum_pk.py

However, there is a significant performance difference in my training log, compared to the provided log (https://huggingface.co/yeliudev/R2-Tuning/resolve/main/checkpoints/r2_tuning_tvsum_pk.log):

[2024-10-12 01:52:06 INFO]: Environment info:
----------------------  -------------------------------------------------------------------------------------------------
System                  Linux
Python                  3.12.7 | packaged by Anaconda, Inc. | (main, Oct  4 2024, 13:27:36) [GCC 11.2.0]
CPU                     Intel(R) Xeon(R) Platinum 8358P CPU @ 2.60GHz
CUDA_HOME               /mnt/petrelfs/share/cuda-11.8
NVCC                    Build cuda_11.8.r11.8/compiler.31833905_0
GPU 0                   NVIDIA A100-SXM4-80GB
PyTorch                 2.2.1+cu118 @ /mnt/petrelfs/huziyuan/miniconda3/envs/tg/lib/python3.12/site-packages/torch
PyTorch debug build     False
torchvision             0.17.1+cu118 @ /mnt/petrelfs/huziyuan/miniconda3/envs/tg/lib/python3.12/site-packages/torchvision
torchvision arch flags  sm_35, sm_50, sm_60, sm_70, sm_75, sm_80, sm_86
nncore                  0.4.4
numpy                   1.26.3
PIL                     10.2.0
cv2                     4.10.0
.......
[2024-10-12 01:52:08 INFO]: Learnable Parameters: 1.761M (100.0%)
[2024-10-12 01:52:08 INFO]: Auto Scale Batch Size: 1 GPU(s) * 4 Samples
[2024-10-12 01:52:08 INFO]: Distributed: False, AMP: fp16, Debug: False
[2024-10-12 01:52:08 INFO]: Launch engine, host: huziyuan@SH-IDC1-10-140-1-47, work_dir: work_dirs/r2_tuning_tvsum_pk
[2024-10-12 01:52:08 INFO]: Stage: 1, epochs: 500, optimizer: AdamW(lr: 0.0005, weight_decay: 0.0001)
[2024-10-12 01:52:12 INFO]: Epoch [1][1/1] lr: 5e-07, eta: 0:27:38, time: 2.853, data_time: 0.470, memory: 1213, loss_cls: 0.6610, loss_sal: 0.4547, loss_video_cal: 0.1470, loss_layer_cal: 0.1715, loss: 1.4342, grad_norm: 4.1796, scale: 65536.0000
[2024-10-12 01:52:12 INFO]: Saving checkpoint to work_dirs/r2_tuning_tvsum_pk/epoch_1.pth...
[2024-10-12 01:52:12 INFO]: Validating...
[2024-10-12 01:52:13 INFO]: Epoch (val) [1][1] loss_cls: 0.6457, loss_sal: 0.5442, loss_video_cal: 0.0000, loss_layer_cal: 0.1925, loss: 1.3824, mAP: 0.3966, best_mAP: 0.3966
[2024-10-12 01:52:14 INFO]: Epoch [2][1/1] lr: 1.049e-05, eta: 0:21:08, time: 0.090, data_time: 0.650, memory: 1237, loss_cls: 0.6673, loss_sal: 0.4626, loss_video_cal: 0.1621, loss_layer_cal: 0.1532, loss: 1.4453, grad_norm: 4.4095, scale: 65536.0000
[2024-10-12 01:52:14 INFO]: Saving checkpoint to work_dirs/r2_tuning_tvsum_pk/epoch_2.pth...
[2024-10-12 01:52:14 INFO]: Validating...
[2024-10-12 01:52:14 INFO]: Epoch (val) [2][1] loss_cls: 0.6402, loss_sal: 0.5446, loss_video_cal: 0.0000, loss_layer_cal: 0.1915, loss: 1.3763, mAP: 0.3966, best_mAP: 0.3966
......
[2024-10-12 02:04:38 INFO]: Epoch [499][1/1] lr: 0.0005, eta: 0:00:01, time: 0.081, data_time: 0.671, memory: 1237, loss_cls: 0.5472, loss_sal: 0.0262, loss_video_cal: 0.0034, loss_layer_cal: 0.0021, loss: 0.5789, grad_norm: 1.2453, scale: 16384.0000
[2024-10-12 02:04:38 INFO]: Saving checkpoint to work_dirs/r2_tuning_tvsum_pk/epoch_499.pth...
[2024-10-12 02:04:38 INFO]: Validating...
[2024-10-12 02:04:39 INFO]: Epoch (val) [499][1] loss_cls: 0.5335, loss_sal: 0.4638, loss_video_cal: 0.0000, loss_layer_cal: 0.0000, loss: 0.9973, mAP: 0.3252, best_mAP: 0.4442
[2024-10-12 02:04:40 INFO]: Epoch [500][1/1] lr: 0.0005, eta: 0:00:00, time: 0.081, data_time: 0.673, memory: 1237, loss_cls: 0.5457, loss_sal: 0.0398, loss_video_cal: 0.0035, loss_layer_cal: 0.0043, loss: 0.5932, grad_norm: 0.9142, scale: 16384.0000
[2024-10-12 02:04:40 INFO]: Saving checkpoint to work_dirs/r2_tuning_tvsum_pk/epoch_500.pth...
[2024-10-12 02:04:40 INFO]: Validating...
[2024-10-12 02:04:41 INFO]: Epoch (val) [500][1] loss_cls: 0.5342, loss_sal: 0.4752, loss_video_cal: 0.0000, loss_layer_cal: 0.0000, loss: 1.0095, mAP: 0.3530, best_mAP: 0.4442
[2024-10-12 02:04:41 INFO]: Overall training speed: 500 iterations in 0:00:42 (0.0840 s/it)

Apart from the random seed, I think the main difference is lr schedule. For example, my LR in Epoch 1 is 5e-07, while in the provided log, the LR in Epoch 1 is 0.0.

Do you know how to fix this training issue and have a normal reproduction? Thank you!

HenryHZY commented 1 week ago

By the way, the performance listed in the repo is different from that shown in the paper. Do you re-train the models?

yeliudev commented 1 week ago

Hi @HenryHZY, thanks for your interest in our work!

May I know if your training always produces poor results, even when running multiple times? The release configs should be strictly aligned with the settings for our checkpoints & the results in the paper, and the 0.0 LR for epoch 1 in our log was caused by the lower precision of logging. We observed that the performance (of almost all models) on TVSum is extremely unstable, as we only use 4 videos for training and 1 video for evaluation. The results on QVHighlights (test split) should be much more reliable for benchmarking.

yeliudev commented 1 week ago

By the way, the performance listed in the repo is different from that shown in the paper. Do you re-train the models?

Yes. We re-trained all the models (with the same settings in the paper) before releasing the code, so that the numbers are slightly different.

HenryHZY commented 1 week ago

Hi @HenryHZY, thanks for your interest in our work!

May I know if your training always produces poor results, even when running multiple times? The release configs should be strictly aligned with the settings for our checkpoints & the results in the paper, and the 0.0 LR for epoch 1 in our log was caused by the lower precision of logging. We observed that the performance (of almost all models) on TVSum is extremely unstable, as we only use 4 videos for training and 1 video for evaluation. The results on QVHighlights (test split) should be much more reliable for benchmarking.

Thanks for your quick response! I will try it later.

yeliudev / R2-Tuning

Inquiry on Performance Reproduction #18