Question about inference resolution

During MOT training the input resolution is set to 1280x1280 while the test size is 1560 (longer edge). This mean that the input frames have an aspect-ratio (square) and a resolution (lower) compared to the test ones (rectangular aspect-ratio and bigger resolution). I have tried to test with videos of the same resolution and aspect-ratio of training (1280x1280) but the performances were the worst.

My question is, how is it possible to obtain bad performances while maintaining the aspect-ratio and the same resolution of the training? Shouldn't the network perform better in that situation? If not, what is the reason (maybe I am missing some properties of the detector/transformer module)?

xingyizhou / GTR

Question about inference resolution #36