Issues with the evaluation

Hi,

Before everyone gets too excited, I need to point out some obvious issues in the evaluation described in the paper.

In Figure 1, the inference time of the semi-dense approaches is largely under-estimated because it is computed at a much lower resolution than the pose accuracy (on MegaDepth). This is evidenced by Table 8: 56.4 AUC@5° and 40.1 ms (Table 1) actually correspond to resolutions 1184×1184 and 640×640, respectively. In reality, the proposed approach is much slower: the inference time at this resolution is 139ms (compare this to LightGlue's 30ms). For the reported inference time, the proposed approach is actually not more accurate than LightGlue (and most likely less).

The same story goes for other semi-dense matchers - for LoFTR it should be much higher than 66ms, closer to 180ms (LightGlue paper, Table 2). Even at this resolution, the accuracy gap might completely vanish when using a modern implementation of RANSAC, as found in PoseLib. Evidence of this can also be found in LightGlue, Table 2 (LO-RANSAC). This can be easily evaluated in glue-factory so this omission is surprising.

We'd appreciate having the authors comment on this - @wyf2020 @hxy-123 @Cuistiano - thank you! cc @Phil26AT

Hi Sarlin,

Thank you for your interest in our work. We address your questions as follows.

1) Firstly, we clarify that all statistics in Fig. 1 are from Tab. 1, where we evaluate the running time of all methods using unified 640×480 image resolution on ScanNet, as pointed out in Tab. 1 caption and Sec. 4.2 Evaluation protocol. The reason for not using MegaDepth for time evaluation is due to the significantly varied image resolutions used in the baselines’ original papers, as summarized in the following table, where the running times of our method on different MegaDepth resolutions are shown in Tab. 8.

Summary of 13 papers from top conferences; the data in '()' is from the official code, while the rest is from the paper and appendix.

method	MegaDepth	ScanNet	Ransac/thr	conference
SP+SG	1600 max_keypoints=2048 nms=3 (keypoint_threshold=0.005)	640×480 max_keypoints=1024 nms=4 (keypoint_threshold=0.005)	OpenCV RANSAC/0.5pix	CVPR2020
LoFTR	840	640×480	OpenCV RANSAC/0.5pix	CVPR2021
QuadTree	(832)	640×480	OpenCV RANSAC/0.5pix	ICLR2022
ASpanFormer	1152	640×480	OpenCV RANSAC/0.5pix	ECCV2022
MatchFormer	840	640×480	OpenCV RANSAC/0.5pix	ACCV2022
TopicFM	1200	640×480	OpenCV RANSAC/0.5pix	AAAI2023
DKM	(880×660)	(640×480)	OpenCV RANSAC/0.5pix	CVPR2023
ASTR	1216	640×480	OpenCV RANSAC/0.5pix	CVPR2023
PATS	None	(640×480)	OpenCV RANSAC/0.5pix	CVPR2023
CasMTR	1152	640×480	OpenCV RANSAC/0.5pix	ICCV2023
SP+LG	1600 max_keypoints=2048 (nms=4 keypoint_threshold=0.0005)[LG] & (nms=3 keypoint_threshold=0)[gluefactory]	Not eval	OpenCV RANSAC or poselib(LO-RANSAC)/self tune	ICCV2023
RoMa	672	(560×560)	OpenCV RANSAC/0.5pix	CVPR2024
Ours	1152	640×480	OpenCV RANSAC/0.5pix	CVPR2024

Notably, we can also use ScanNet AUC in Fig. 1’s accuracy comparisons, where the gaps between ours and LG still exist: Compared to the SP+LG AUC (49.9, 67.0, 80.1) on MegaDepth, our AUC (56.4, 72.2, 83.5) increased by (13%, 8%, 4%). Meanwhile, on ScanNet’s generalization results, compared to SP+LG AUC (14.8, 30.8, 47.5), our AUC (19.2, 37.0, 53.6) increased by (30%, 20%, 13%). We show both figures for a comprehensive understanding:

However, we didn't use ScanNet's AUC in Fig. 1 mainly because our experiments found that the quality of MegaDepth is better than ScanNet. (Perhaps this is the reason why LightGlue didn't perform experiments on ScanNet. )

2) As for the accuracy and efficiency comparisons with LightGlue on MegaDepth, given that the strongest model from the LightGlue paper (SP + LG, 1600 input image size with 2048 keypoints, carefully tuned RANSAC thr), the AUC is 49.9, 67.0, 80.1 for AUC@5, 10, 20 respectively and the total running time is SP (46.1ms) + LG (26.8ms) = 72.9ms. Our method with a 640×640 image resolution without changing the RANSAC method and thr (kept same with LoFTR and many other baselines) can already achieve generally better accuracy (51.0, 67.4, 79.8) and faster end-to-end inference speed 41.7ms), as shown in Tab. 8. And our optimized model can even achieve AUC (50.5, 67.1, 79.6)(fixed RANSAC thr) and (51.9, 68.0, 80.0)(tuned RANSAC thr) in just 34.1 ms.

By the way, we kindly remind that the running time of feature extraction & keypoint detection (SuperPoint or DISK) is missing in LightGlue’s Tab. 2, where only matching time (from sparse features input to matches output) is used to compare with dense methods which are end-to-end (from image input to matches output).

3) As for RANSAC setting, we follow the setting used by most of (12/13) the previous methods that use the same fixed RANSAC method (OpenCV) and threshold (0.5px), as summarized above in Table. We also evaluate using LG’s setting that changes RANSAC method and tunes the inlier thresholds. As shown below, the accuracy gaps between ours and LG still exist with more advanced LO-RANSAC.

	RANSAC(follow 12/13 papers)(0.5px)	RANSAC(tuned by LG)	RANSAC(tuned by us)(0.3px)	LO-RANSAC(follow LG in gluefactory)(2.0px)	LO-RANSAC(tuned by us)(1.5px)
Ours	56.4 / 72.2 / 83.5	None	58.4 / 73.4 / 84.2	69.3 / 80.7 / 88.5	69.5 / 80.9 / 88.8
	RANSAC(follow 12/13 papers)(0.5px)	RANSAC(tuned by LG)(unknown px)	RANSAC(tuned by us)	LO-RANSAC(tuned by LG in gluefactory)(2.0px)	LO-RANSAC(tuned by us)
SP+LG	None	49.9 / 67.0 / 80.1	None	66.8 / 79.3 / 87.9	None
	RANSAC(follow 12/13 papers)(0.5px)	RANSAC(tuned by LG)(unknown px)	RANSAC(tuned by us)(0.3px)	LO-RANSAC(tuned by LG)(unknown px)	LO-RANSAC(tuned by us)
ASpanFormer	55.3 / 71.5 / 83.1	55.3 / 71.5 / 83.1	58.3 / 73.3 / 84.2	69.4 / 81.1 / 88.9	None

Moreover, we observe the AUCs of ASpanFormer and other dense methods in the LightGlue’s Tab.2, RANSAC column, are identical to their original papers (without changing RANSAC thresholds), whereas all methods seem carefully tuned as stated in the LightGlue paper (for example, “tuning the RANSAC threshold yields +7% AUC@5 on SuperGlue” in its sup).

Therefore, we also try to tune the RANSAC threshold (finally set to 0.3 pix same as ours) for AspanFormer, where AUCs go from 55.3 / 71.5 / 83.1 (reported in LightGlue) to 58.3 / 73.3 / 84.2. We think this may reflect that finding the best parameter for each method may require a large tuning range and dense sample steps and may potentially overfit to a specific dataset.

I hope these responses can answer your questions and any discussion is welcome :).

I think what you're trying to convey with Figure 1 is that you can achieve a AUC@5 of 56 at a speed of 40ms, but that's not actually the case. This could lead to misunderstandings and confusion.

I think what you're trying to convey with Figure 1 is that you can achieve a AUC@5 of 56 at a speed of 40ms, but that's not actually the case. This could lead to misunderstandings and confusion.

Thank you for your reminder and suggestions. We have changed Figure 1 in the camera-ready version to the right image mentioned above to dispel the potaintial misunderstandings.

Thank you very much for the extensive reply.

1) Figure 1: Thank you for updating the figure. As @Master-cai mentions, reporting accuracy and speed from two different datasets was confusing.

After some investigation, it seems that there might be a problem with your evaluation of LightGlue on ScanNet. We actually have a fully-reproducible evaluation available at https://github.com/cvg/glue-factory/pull/25. Running it with 2k keypoints on full-resolution images (1296x968) with the basic RANSAC, we obtain an AUC of 17.7 / 34.6 / 51.2 while you report 14.8 / 30.8 / 47.5. Running this evaluation at 1k or 640x480 only does not make much sense given that dense approaches can leverage a lot more correspondences.

As for the speed, running LightGlue on 2k keypoints takes less than 20ms an RTX 3090 (see our benchmark) while you report >30ms. Your paper does not mention whether this time includes the feature extraction by SuperPoint (~10ms).

By the way, we kindly remind that the running time of feature extraction & keypoint detection (SuperPoint or DISK) is missing in LightGlue’s Tab. 2, where only matching time (from sparse features input to matches output) is used to compare with dense methods which are end-to-end (from image input to matches output).

Indeed, the speed reported in the LightGlue paper does not include the time of feature extraction. In most applications, such as SfM, features are extracted once per image and subsequently cached. Since each image is matched to multiple ones, the matching time dominates by far. For sparse matching, I think that including the extraction time is thus not fair and does not reflect realistic operating conditions. For dense matching, it should probably be included because caching the dense features is unfeasible due to their high memory footprint.

To summarize, I suggest that a more appropriate Figure 1 would look like the following:

fig1_new

Alternatively, you could use the MegaDepth evaluation and report the average matching time per image. The image resolution is a great way to control the speed vs accuracy trade-off for dense approaches, so you could even report multiple values per approach.

3) RANSAC threshold and variant: Thank you providing these numbers, this is very insightful. Nice to see that better tuning the inlier threshold and switching to LO-RANSAC can both increase the accuracy of your approach. I think that using LO-RANSAC should become the standard in future evaluation because it is much less sensitive to the choice of inlier threshold, making the evaluation a lot more reliable and closer to downstream applications like SfM.

Thank you very much for the detailed and insightful reply.

1) Whether to Include SP Time:

Indeed, the speed reported in the LightGlue paper does not include the time of feature extraction. In most applications, such as SfM, features are extracted once per image and subsequently cached. Since each image is matched to multiple ones, the matching time dominates by far. For sparse matching, I think that including the extraction time is thus not fair and does not reflect realistic operating conditions. For dense matching, it should probably be included because caching the dense features is unfeasible due to their high memory footprint.

Indeed, in the SfM application, the sparse features can be extracted once and stored. However, we believe that applications such as visual localization which is a more latency-sensitive application than offline SfM, only show the matching time without sparse detection and description time (i.e., SP time) cannot reveal practical operating conditions:

1) For visual localization based on SfM models (and pre-stored features), the running time of feature detection and description for each query image cannot be eliminated. Show only matching time cannot show this scenario.

2) As argued in MeshLoc, storing the sparse features of entire database maps is large, especially for large-scale scenes (”E.g., storing SuperPoint features for the Aachen Day-Night v1.1 dataset requires more than 25 GB…”). Moreover, loading pre-stored database features instead of re-extracting them at the localization phase will also lead to extra data loading time. Therefore, MeshLoc proposes to store mesh instead of SfM features. Using matching methods in this practical setting, we think it would be more appropriate to show end-to-end matching time (SP + LG).

Therefore, we think it would be better to show both times of (SP+LG), and (LG only) to cover both SfM and visual localization settings for a more comprehensive understanding.

2) The experimental setting of SP+LG:

As for the speed, running LightGlue on 2k keypoints takes less than 20ms an RTX 3090 (see our benchmark) while you report >30ms. Your paper does not mention whether this time includes the feature extraction by SuperPoint (~10ms).

We first clarify that we calculate the average running time of SuperPoint+LightGlue across 1500 pairs from ScanNet. This differs from the LightGlue benchmark in two main respects. First, since the results in the LightGlue paper did not utilize flash attention to compare with other methods (as stated in issue), we opted for a fair comparison by not using flash attention for all methods. Second, Lightglue only repeats on 1 pair while ours average over 1500 pairs. We use a forked gluefactory and LightGlue benchmark to elaborate the difference of timing measurement as following table. All times are measured using gluefactory’s environment (torch2.2.1) on RTX3090:

time measure method(SP time) + (LG time) = Total time	LG no_prune[-1/-1]	LG prune [0.95/0.95]	LG prune [0.95/0.99]
↓without flash attention
repeat 1 (easy) pair [2048kpts] [noflash]	SP+36.9	SP+15.5	SP+16.9
repeat 1 (hard) pair [2048kpts] [noflash]	SP+36.9	SP+20.2	SP+23.4
average 1500 pairs [1296/2048/0/3] [noflash]	32.8+36.6=69.4	32.2+26.6=58.8 (corresponding to the following figure)	32.2+28.8=61.0
average 1500 pairs [640/2048/0/3] [noflash]	10.8+37.0=47.8	10.8+30.8=41.6	10.9+33.4=44.3
average 1500 pairs [640/2048/0.0005/4] [noflash]	11.1+20.2=31.3	11.2+20.4=31.6 (corresponding to our paper )	11.1+20.7=31.8
↓with flash attention
repeat 1 (easy) pair [2048kpts] [flash]	SP+18.5	SP+10.8	SP+10.9
repeat 1 (hard) pair [2048kpts] [flash]	SP+18.5	SP+16.6	SP+16.8
average 1500 pairs [1296/2048/0/3] [flash]	32.2+19.4=51.6	32.2+18.5=50.7	32.2+19.2=51.4
average 1500 pairs [640/2048/0/3] [flash]	10.9+20.8=31.7	11.0+20.7=31.7	10.9+21.4=32.3
average 1500 pairs [640/2048/0.0005/4] [flash]	10.9+19.9=30.8	10.9+19.2=30.1	11.0+19.6=30.6
↓with fp16 + flash attention
average 1500 pairs [1296/2048/0/3] [fp16+flash]	32.6+17.7=50.3	32.6+18.0=50.6 (corresponding to the following figure)	32.7+18.2=50.9
average 1500 pairs [640/2048/0/3] [fp16+flash]	11.0+18.0=29	11.2+20.9=32.1	11.1+21.1=32.2
average 1500 pairs [640/2048/0.0005/4] [fp16+flash]	11.0+19.0=30.0	11.0+18.7=29.7	11.1+19.0=30.1

Detailed explanation of the ScanNet settings
- SuperPoint [image_resize/max_num_keypoints/detection_threshold/nms_radius]
  - [1296/2048/0/3]: gluefactory setting
  - [640/2048/0/3]: 640 follow SuperGlue (and 9/10 following papers) ScanNet setting; [2048/0/3] is gluefactory setting
  - [640/2048/0.0005/4]: 640 follow SuperGlue (and 9/10 following papers) ScanNet setting; [2048/0.0005/4] is LightGlue default setting
- LightGlue [depth_confidence/width_confidence]
  - [-1/-1]: no prune used in both LightGlue and gluefactory
  - [0.95/0.95]: gluefactory setting
  - [0.95/0.99]: LightGlue benchmark and default setting

After some investigation, it seems that there might be a problem with your evaluation of LightGlue on ScanNet. We actually have a fully-reproducible evaluation available at https://github.com/cvg/glue-factory/pull/25. Running it with 2k keypoints on full-resolution images (1296x968) with the basic RANSAC, we obtain an AUC of 17.7 / 34.6 / 51.2 while you report 14.8 / 30.8 / 47.5. Running this evaluation at 1k or 640x480 only does not make much sense given that dense approaches can leverage a lot more correspondences.

Regarding the accuracy we reported, we first clarify that the AUC in our paper corresponds to SP[640/2048/0.0005/4]+LG prune[0.95/0.95], where 640 follows the setting of SuperGlue (and 9/10 following papers) on ScanNet, and (2048/0.0005/4) uses the default parameters for LightGlue. Using a 1296x968 resolution image as the input for SuperPoint not only greatly increases the time of SuperPoint (even exceeding the total time of our full model), but the extraction of higher quality features may also makes the matching process of LightGlue pruning easier (running faster) and better (+2~3AUC@5). In this context, especially considering the real-time feature extraction needs of downstream localization tasks, comparing only the time taken by LightGlue to other end-to-end methods may not be entirely fair.

What’s more, the AUC of 17.7/34.6/51.2 at cvg/glue-factory#25 correspond to LG no_prune, which requires flash attention to decrease the time from 36.6ms to 19.4ms as shown in the table above. The accuracy of all settings is reproduced by forked gluefactory, using gluefactory’s environment (torch2.2.1) on RTX3090:

SP[img_size/max_kpts/thr/nms] V.S. LG[depth/width_confidence]	LG no_prune [-1/-1]	LG prune [0.95/0.95]	LG prune [0.95/0.99]
↓withou flash attention
SP[1296/2048/0/3]	18.13/34.9/51.55	17.84/34.45/50.3 (corresponding to the following figure)	17.53/34.51/50.35
SP[640/2048/0/3]	15.77/33.09/50.85	14.94/32.03/49.57	16.03/33.34/50.49
SP[640/2048/0.0005/4]	15.6/31.77/48.01	14.65/30.36/47.42 (corresponding to our paper)	14.9/30.42/47.16
↓with flash attention
SP[1296/2048/0/3]	17.89/34.72/51.3	17.47/33.99/49.72	17.99/34.84/50.84
SP[640/2048/0/3]	16.33/33.57/50.92	15.2/32.54/50.21	16.47/33.57/50.4
SP[640/2048/0.0005/4]	15.33/31.31/47.57	14.2/29.79/46.84	14.21/29.8/46.82
↓with fp16 + flash attention
SP[1296/2048/0/3]	18/34.89/51.01	17.26/34.2/49.92 (corresponding to the following figure)	18.02/34.8/50.42
SP[640/2048/0/3]	15.42/32.43/50.21	14.87/32.13/49.98	15.69/32.86/50.56
SP[640/2048/0.0005/4]	15.74/31.78/47.91	14.58/30.4/47.15	14.61/30.37/47.11

If we don't follow the ScanNet setting of SuperGlue, as shown in the table below, we can achieve similar accuracy at a lower resolution with faster speeds. We also provide reproduce scripts.

AUC[5/10/20]	640×480	512×384	384×288
Ours (Full+fp32+noflash)	19.8/37.8/54.3 (40.1ms)	19.5/37.3/53.9 (30.1ms)	17.7/34.9/51.5 (24.2ms)
Ours (Opt.+mp+noflash)	18.5/35.8/52.2 (27.0ms)	18.0/35.3/51.9 (25.5ms)	16.7/33.1/49.1 (23.5ms)
Ours (Opt.+fp16+flash)	18.4/35.6/52.0 (24.2ms)	18.1/35.6/52.3 (20.8ms)	16.4/32.6/48.6 (19.4ms)

In summary, if we don't follow SuperGlue’s ScanNet 640x480 setting, maybe a more correct Figure 1 would look like this:

Note that these hardware-specific accelerations are not available on all hardware. Flash requires Turing (sm_75) or newer architectures, while FP16 requires Volta (sm_70) or newer. That is, V100 GPUs can't use Flash, and GPUs older than V100 can't use either Flash or FP16.

3) RANSAC threshold and variant:

I think that using LO-RANSAC should become the standard in future evaluation because it is much less sensitive to the choice of inlier threshold, making the evaluation a lot more reliable and closer to downstream applications like SfM.

We agree that LO-RANSAC is a better method than OpenCV RANSAC for practical use, as it is more robust and appears to be less sensitive to the choice of inlier threshold. We also conducted more experiments with LO-RANSAC on ScanNet and found that the accuracy gap still exists.

Method	ScanNet LO-RANSAC (2.0px)
SP[1296/2048/0/3]+SG	21.4/38.1/53.1
SP[1296/2048/0/3]+LG(adaptive)	21.2/38.4/54.1
SP[1296/2048/0/3]+LG	21.3/39.1/55.2
LoFTR[640]	23.1/40.8/56.6
Ours (Optimized)[640]	24.8/43.7/59.9
Ours[640]	25.5/44.3/60.1
DKM[640]	28.4/49.0/65.9
RoMa[560]	31.4/53.0/70.4

I am afraid the "image pairs per second" cannot be simply obtained by inverse of "processing time per pair", unless one can make sure the GPU is 100% utilized (ie: with batch inference). I don't think this is guaranteed from discussion above and in LightGlue repo. Correct me if I am wrong @sarlinpe @Phil26AT

I guess the community should report both latency and throughput as shown in the "THE EFFICIENCY MISNOMER" paper. One should measure them carefully and rigorously.

We agree that “fps cannot be simply obtained by the inverse of latency” and thanks for the comment. Perhaps our Figure 1 could have been more accurately described using "forward pass per second (1/latency)" to avoid confusion with throughput.

As one of LightGlue’s core novelties and designs, adaptive pruning significantly reduces LightGlue's latency with negligible effects on accuracy. Therefore we follow LightGlue paper’s setting that uses 1/latency as the running time comparisons in the teaser figure.

Furthermore, as emphasized in the "THE EFFICIENCY MISNOMER" paper you mentioned: "we argue that cost indicators that one cares about strongly depend on the applications and setup in which the models are supposed to be used", we believe that for the image matching task, latency is more crucial for online applications like SLAM and visual localization that is more sensitive to speed than offline SfM.

zju3dv / EfficientLoFTR

Issues with the evaluation #1