I'm testing the inference demo code with the default setting (r2_tuning_qvhighlights). While printing the value of output['_out']['saliency'], I found that it has negative values. Is this reasonable?

Furthermore, if I want to use this model for video retrieval task, should I use the average saliency to represent the clip score?

output['_out']['saliency']

tensor([-0.0343, -0.0360, -0.0387, -0.0397, -0.0320, -0.0345, -0.0370, -0.0371, -0.0331, -0.0292, -0.0297, -0.0313, -0.0316, -0.0310, -0.0315, -0.0331, -0.0314, -0.0298, -0.0303, -0.0318, -0.0339, -0.0335, -0.0323, -0.0289, -0.0274, -0.0262, -0.0271, -0.0275, -0.0330, -0.0315, -0.0349, -0.0313, -0.0492, -0.0390, -0.0345, -0.0321, -0.0230, -0.0220, -0.0190, -0.0137, -0.0236, -0.0219, 0.0140, 0.0164, 0.0221, 0.0272, 0.0253, 0.0258, -0.0045, 0.0361, 0.0355, 0.0158, 0.0122, 0.0146, 0.0366, 0.0380, 0.0433, 0.0294, 0.0306, 0.0315, 0.0392, 0.0407, 0.0425, 0.0346, 0.0381, 0.0396, 0.0455, 0.0410, 0.0397, 0.0432, 0.0465, 0.0310, 0.0261, 0.0397, 0.0332, 0.0335, 0.0299, 0.0208, 0.0195, 0.0373, 0.0143, 0.0156, 0.0156, 0.0274, 0.0172, 0.0033, 0.0055, -0.0062, -0.0117, -0.0114, -0.0148, -0.0246, -0.0299, -0.0326, -0.0399, -0.0376, -0.0376, -0.0347, -0.0523, -0.0589, -0.0879, -0.0853, -0.0893, -0.1123, -0.0485, -0.0520, -0.0483, -0.0528, -0.0507, -0.0532, -0.0555, -0.0542, -0.0322, -0.0324, -0.0365, -0.0354, -0.0324, -0.0307, -0.0342, -0.0301, -0.0334, -0.0333, -0.0315, -0.0329, -0.0342, -0.0314, -0.0303, -0.0315, -0.0285, -0.0311, -0.0360, -0.0343, -0.0298, -0.0315, -0.0357, -0.0285, -0.0302, -0.0330, -0.0343, -0.0321, -0.0286, -0.0274, -0.0325, -0.0319, -0.0391, -0.0388, -0.0364, -0.0412], device='cuda:0')

yeliudev / R2-Tuning

inference demo: negative saliency value #15

output['_out']['saliency']