zju3dv / EfficientLoFTR

491 stars 34 forks source link

Issues with the evaluation #1

Open sarlinpe opened 3 months ago

sarlinpe commented 3 months ago

Hi,

Before everyone gets too excited, I need to point out some obvious issues in the evaluation described in the paper.

In Figure 1, the inference time of the semi-dense approaches is largely under-estimated because it is computed at a much lower resolution than the pose accuracy (on MegaDepth). This is evidenced by Table 8: 56.4 AUC@5° and 40.1 ms (Table 1) actually correspond to resolutions 1184×1184 and 640×640, respectively. In reality, the proposed approach is much slower: the inference time at this resolution is 139ms (compare this to LightGlue's 30ms). For the reported inference time, the proposed approach is actually not more accurate than LightGlue (and most likely less).

The same story goes for other semi-dense matchers - for LoFTR it should be much higher than 66ms, closer to 180ms (LightGlue paper, Table 2). Even at this resolution, the accuracy gap might completely vanish when using a modern implementation of RANSAC, as found in PoseLib. Evidence of this can also be found in LightGlue, Table 2 (LO-RANSAC). This can be easily evaluated in glue-factory so this omission is surprising.

We'd appreciate having the authors comment on this - @wyf2020 @hxy-123 @Cuistiano - thank you! cc @Phil26AT

wyf2020 commented 3 months ago

Hi Sarlin,

Thank you for your interest in our work. We address your questions as follows.

1) Firstly, we clarify that all statistics in Fig. 1 are from Tab. 1, where we evaluate the running time of all methods using unified 640×480 image resolution on ScanNet, as pointed out in Tab. 1 caption and Sec. 4.2 Evaluation protocol. The reason for not using MegaDepth for time evaluation is due to the significantly varied image resolutions used in the baselines’ original papers, as summarized in the following table, where the running times of our method on different MegaDepth resolutions are shown in Tab. 8.

Summary of 13 papers from top conferences; the data in '()' is from the official code, while the rest is from the paper and appendix.

method MegaDepth ScanNet Ransac/thr conference
SP+SG 1600 max_keypoints=2048 nms=3 (keypoint_threshold=0.005) 640×480 max_keypoints=1024 nms=4 (keypoint_threshold=0.005) OpenCV RANSAC/0.5pix CVPR2020
LoFTR 840 640×480 OpenCV RANSAC/0.5pix CVPR2021
QuadTree (832) 640×480 OpenCV RANSAC/0.5pix ICLR2022
ASpanFormer 1152 640×480 OpenCV RANSAC/0.5pix ECCV2022
MatchFormer 840 640×480 OpenCV RANSAC/0.5pix ACCV2022
TopicFM 1200 640×480 OpenCV RANSAC/0.5pix AAAI2023
DKM (880×660) (640×480) OpenCV RANSAC/0.5pix CVPR2023
ASTR 1216 640×480 OpenCV RANSAC/0.5pix CVPR2023
PATS None (640×480) OpenCV RANSAC/0.5pix CVPR2023
CasMTR 1152 640×480 OpenCV RANSAC/0.5pix ICCV2023
SP+LG 1600 max_keypoints=2048 (nms=4 keypoint_threshold=0.0005)[LG] & (nms=3 keypoint_threshold=0)[gluefactory] Not eval OpenCV RANSAC or poselib(LO-RANSAC)/self tune ICCV2023
RoMa 672 (560×560) OpenCV RANSAC/0.5pix CVPR2024
Ours 1152 640×480 OpenCV RANSAC/0.5pix CVPR2024

Notably, we can also use ScanNet AUC in Fig. 1’s accuracy comparisons, where the gaps between ours and LG still exist: Compared to the SP+LG AUC (49.9, 67.0, 80.1) on MegaDepth, our AUC (56.4, 72.2, 83.5) increased by (13%, 8%, 4%). Meanwhile, on ScanNet’s generalization results, compared to SP+LG AUC (14.8, 30.8, 47.5), our AUC (19.2, 37.0, 53.6) increased by (30%, 20%, 13%). We show both figures for a comprehensive understanding:

B3593C77-60ED-4EB3-862C-5674A12F13DD

However, we didn't use ScanNet's AUC in Fig. 1 mainly because our experiments found that the quality of MegaDepth is better than ScanNet. (Perhaps this is the reason why LightGlue didn't perform experiments on ScanNet. )

2) As for the accuracy and efficiency comparisons with LightGlue on MegaDepth, given that the strongest model from the LightGlue paper (SP + LG, 1600 input image size with 2048 keypoints, carefully tuned RANSAC thr), the AUC is 49.9, 67.0, 80.1 for AUC@5, 10, 20 respectively and the total running time is SP (46.1ms) + LG (26.8ms) = 72.9ms. Our method with a 640×640 image resolution without changing the RANSAC method and thr (kept same with LoFTR and many other baselines) can already achieve generally better accuracy (51.0, 67.4, 79.8) and faster end-to-end inference speed 41.7ms), as shown in Tab. 8. And our optimized model can even achieve AUC (50.5, 67.1, 79.6)(fixed RANSAC thr) and (51.9, 68.0, 80.0)(tuned RANSAC thr) in just 34.1 ms.

By the way, we kindly remind that the running time of feature extraction & keypoint detection (SuperPoint or DISK) is missing in LightGlue’s Tab. 2, where only matching time (from sparse features input to matches output) is used to compare with dense methods which are end-to-end (from image input to matches output).

3) As for RANSAC setting, we follow the setting used by most of (12/13) the previous methods that use the same fixed RANSAC method (OpenCV) and threshold (0.5px), as summarized above in Table. We also evaluate using LG’s setting that changes RANSAC method and tunes the inlier thresholds. As shown below, the accuracy gaps between ours and LG still exist with more advanced LO-RANSAC.

RANSAC(follow 12/13 papers)(0.5px) RANSAC(tuned by LG) RANSAC(tuned by us)(0.3px) LO-RANSAC(follow LG in gluefactory)(2.0px) LO-RANSAC(tuned by us)(1.5px)
Ours 56.4 / 72.2 / 83.5 None 58.4 / 73.4 / 84.2 69.3 / 80.7 / 88.5 69.5 / 80.9 / 88.8
RANSAC(follow 12/13 papers)(0.5px) RANSAC(tuned by LG)(unknown px) RANSAC(tuned by us) LO-RANSAC(tuned by LG in gluefactory)(2.0px) LO-RANSAC(tuned by us)
SP+LG None 49.9 / 67.0 / 80.1 None 66.8 / 79.3 / 87.9 None
RANSAC(follow 12/13 papers)(0.5px) RANSAC(tuned by LG)(unknown px) RANSAC(tuned by us)(0.3px) LO-RANSAC(tuned by LG)(unknown px) LO-RANSAC(tuned by us)
ASpanFormer 55.3 / 71.5 / 83.1 55.3 / 71.5 / 83.1 58.3 / 73.3 / 84.2 69.4 / 81.1 / 88.9 None

Moreover, we observe the AUCs of ASpanFormer and other dense methods in the LightGlue’s Tab.2, RANSAC column, are identical to their original papers (without changing RANSAC thresholds), whereas all methods seem carefully tuned as stated in the LightGlue paper (for example, “tuning the RANSAC threshold yields +7% AUC@5 on SuperGlue” in its sup).

Therefore, we also try to tune the RANSAC threshold (finally set to 0.3 pix same as ours) for AspanFormer, where AUCs go from 55.3 / 71.5 / 83.1 (reported in LightGlue) to 58.3 / 73.3 / 84.2. We think this may reflect that finding the best parameter for each method may require a large tuning range and dense sample steps and may potentially overfit to a specific dataset.

I hope these responses can answer your questions and any discussion is welcome :).

Master-cai commented 3 months ago

I think what you're trying to convey with Figure 1 is that you can achieve a AUC@5 of 56 at a speed of 40ms, but that's not actually the case. This could lead to misunderstandings and confusion.

wyf2020 commented 3 months ago

I think what you're trying to convey with Figure 1 is that you can achieve a AUC@5 of 56 at a speed of 40ms, but that's not actually the case. This could lead to misunderstandings and confusion.

Thank you for your reminder and suggestions. We have changed Figure 1 in the camera-ready version to the right image mentioned above to dispel the potaintial misunderstandings.

sarlinpe commented 3 months ago

Thank you very much for the extensive reply.

1) Figure 1: Thank you for updating the figure. As @Master-cai mentions, reporting accuracy and speed from two different datasets was confusing.

After some investigation, it seems that there might be a problem with your evaluation of LightGlue on ScanNet. We actually have a fully-reproducible evaluation available at https://github.com/cvg/glue-factory/pull/25. Running it with 2k keypoints on full-resolution images (1296x968) with the basic RANSAC, we obtain an AUC of 17.7 / 34.6 / 51.2 while you report 14.8 / 30.8 / 47.5. Running this evaluation at 1k or 640x480 only does not make much sense given that dense approaches can leverage a lot more correspondences.

As for the speed, running LightGlue on 2k keypoints takes less than 20ms an RTX 3090 (see our benchmark) while you report >30ms. Your paper does not mention whether this time includes the feature extraction by SuperPoint (~10ms).

By the way, we kindly remind that the running time of feature extraction & keypoint detection (SuperPoint or DISK) is missing in LightGlue’s Tab. 2, where only matching time (from sparse features input to matches output) is used to compare with dense methods which are end-to-end (from image input to matches output).

Indeed, the speed reported in the LightGlue paper does not include the time of feature extraction. In most applications, such as SfM, features are extracted once per image and subsequently cached. Since each image is matched to multiple ones, the matching time dominates by far. For sparse matching, I think that including the extraction time is thus not fair and does not reflect realistic operating conditions. For dense matching, it should probably be included because caching the dense features is unfeasible due to their high memory footprint.

To summarize, I suggest that a more appropriate Figure 1 would look like the following:

fig1_new

Alternatively, you could use the MegaDepth evaluation and report the average matching time per image. The image resolution is a great way to control the speed vs accuracy trade-off for dense approaches, so you could even report multiple values per approach.

3) RANSAC threshold and variant: Thank you providing these numbers, this is very insightful. Nice to see that better tuning the inlier threshold and switching to LO-RANSAC can both increase the accuracy of your approach. I think that using LO-RANSAC should become the standard in future evaluation because it is much less sensitive to the choice of inlier threshold, making the evaluation a lot more reliable and closer to downstream applications like SfM.

wyf2020 commented 3 months ago

Thank you very much for the detailed and insightful reply.

1) Whether to Include SP Time:

Indeed, the speed reported in the LightGlue paper does not include the time of feature extraction. In most applications, such as SfM, features are extracted once per image and subsequently cached. Since each image is matched to multiple ones, the matching time dominates by far. For sparse matching, I think that including the extraction time is thus not fair and does not reflect realistic operating conditions. For dense matching, it should probably be included because caching the dense features is unfeasible due to their high memory footprint.

Indeed, in the SfM application, the sparse features can be extracted once and stored. However, we believe that applications such as visual localization which is a more latency-sensitive application than offline SfM, only show the matching time without sparse detection and description time (i.e., SP time) cannot reveal practical operating conditions:

1) For visual localization based on SfM models (and pre-stored features), the running time of feature detection and description for each query image cannot be eliminated. Show only matching time cannot show this scenario.

2) As argued in MeshLoc, storing the sparse features of entire database maps is large, especially for large-scale scenes (”E.g., storing SuperPoint features for the Aachen Day-Night v1.1 dataset requires more than 25 GB…”). Moreover, loading pre-stored database features instead of re-extracting them at the localization phase will also lead to extra data loading time. Therefore, MeshLoc proposes to store mesh instead of SfM features. Using matching methods in this practical setting, we think it would be more appropriate to show end-to-end matching time (SP + LG).

Therefore, we think it would be better to show both times of (SP+LG), and (LG only) to cover both SfM and visual localization settings for a more comprehensive understanding.

2) The experimental setting of SP+LG:

As for the speed, running LightGlue on 2k keypoints takes less than 20ms an RTX 3090 (see our benchmark) while you report >30ms. Your paper does not mention whether this time includes the feature extraction by SuperPoint (~10ms).

We first clarify that we calculate the average running time of SuperPoint+LightGlue across 1500 pairs from ScanNet. This differs from the LightGlue benchmark in two main respects. First, since the results in the LightGlue paper did not utilize flash attention to compare with other methods (as stated in issue), we opted for a fair comparison by not using flash attention for all methods. Second, Lightglue only repeats on 1 pair while ours average over 1500 pairs. We use a forked gluefactory and LightGlue benchmark to elaborate the difference of timing measurement as following table. All times are measured using gluefactory’s environment (torch2.2.1) on RTX3090:

time measure method(SP time) + (LG time) = Total time LG no_prune[-1/-1] LG prune [0.95/0.95] LG prune [0.95/0.99]
↓without flash attention      
repeat 1 (easy) pair [2048kpts] [noflash] SP+36.9 SP+15.5 SP+16.9
repeat 1 (hard) pair [2048kpts] [noflash] SP+36.9 SP+20.2 SP+23.4
average 1500 pairs [1296/2048/0/3] [noflash] 32.8+36.6=69.4 32.2+26.6=58.8 (corresponding to the following figure) 32.2+28.8=61.0
average 1500 pairs [640/2048/0/3] [noflash] 10.8+37.0=47.8 10.8+30.8=41.6 10.9+33.4=44.3
average 1500 pairs [640/2048/0.0005/4] [noflash] 11.1+20.2=31.3 11.2+20.4=31.6 (corresponding to our paper ) 11.1+20.7=31.8
↓with flash attention      
repeat 1 (easy) pair [2048kpts] [flash] SP+18.5 SP+10.8 SP+10.9
repeat 1 (hard) pair [2048kpts] [flash] SP+18.5 SP+16.6 SP+16.8
average 1500 pairs [1296/2048/0/3] [flash] 32.2+19.4=51.6 32.2+18.5=50.7 32.2+19.2=51.4
average 1500 pairs [640/2048/0/3] [flash] 10.9+20.8=31.7 11.0+20.7=31.7 10.9+21.4=32.3
average 1500 pairs [640/2048/0.0005/4] [flash] 10.9+19.9=30.8 10.9+19.2=30.1 11.0+19.6=30.6
↓with fp16 + flash attention      
average 1500 pairs [1296/2048/0/3] [fp16+flash] 32.6+17.7=50.3 32.6+18.0=50.6 (corresponding to the following figure) 32.7+18.2=50.9
average 1500 pairs [640/2048/0/3] [fp16+flash] 11.0+18.0=29 11.2+20.9=32.1 11.1+21.1=32.2
average 1500 pairs [640/2048/0.0005/4] [fp16+flash] 11.0+19.0=30.0 11.0+18.7=29.7 11.1+19.0=30.1

After some investigation, it seems that there might be a problem with your evaluation of LightGlue on ScanNet. We actually have a fully-reproducible evaluation available at https://github.com/cvg/glue-factory/pull/25. Running it with 2k keypoints on full-resolution images (1296x968) with the basic RANSAC, we obtain an AUC of 17.7 / 34.6 / 51.2 while you report 14.8 / 30.8 / 47.5. Running this evaluation at 1k or 640x480 only does not make much sense given that dense approaches can leverage a lot more correspondences.

Regarding the accuracy we reported, we first clarify that the AUC in our paper corresponds to SP[640/2048/0.0005/4]+LG prune[0.95/0.95], where 640 follows the setting of SuperGlue (and 9/10 following papers) on ScanNet, and (2048/0.0005/4) uses the default parameters for LightGlue. Using a 1296x968 resolution image as the input for SuperPoint not only greatly increases the time of SuperPoint (even exceeding the total time of our full model), but the extraction of higher quality features may also makes the matching process of LightGlue pruning easier (running faster) and better (+2~3AUC@5). In this context, especially considering the real-time feature extraction needs of downstream localization tasks, comparing only the time taken by LightGlue to other end-to-end methods may not be entirely fair.

What’s more, the AUC of 17.7/34.6/51.2 at cvg/glue-factory#25 correspond to LG no_prune, which requires flash attention to decrease the time from 36.6ms to 19.4ms as shown in the table above. The accuracy of all settings is reproduced by forked gluefactory, using gluefactory’s environment (torch2.2.1) on RTX3090:

SP[img_size/max_kpts/thr/nms] V.S. LG[depth/width_confidence] LG no_prune [-1/-1] LG prune [0.95/0.95] LG prune [0.95/0.99]
↓withou flash attention      
SP[1296/2048/0/3] 18.13/34.9/51.55 17.84/34.45/50.3 (corresponding to the following figure) 17.53/34.51/50.35    
SP[640/2048/0/3] 15.77/33.09/50.85 14.94/32.03/49.57 16.03/33.34/50.49
SP[640/2048/0.0005/4] 15.6/31.77/48.01 14.65/30.36/47.42 (corresponding to our paper) 14.9/30.42/47.16  
↓with flash attention      
SP[1296/2048/0/3] 17.89/34.72/51.3 17.47/33.99/49.72 17.99/34.84/50.84
SP[640/2048/0/3] 16.33/33.57/50.92 15.2/32.54/50.21 16.47/33.57/50.4
SP[640/2048/0.0005/4] 15.33/31.31/47.57 14.2/29.79/46.84 14.21/29.8/46.82
↓with fp16 + flash attention      
SP[1296/2048/0/3] 18/34.89/51.01 17.26/34.2/49.92 (corresponding to the following figure) 18.02/34.8/50.42
SP[640/2048/0/3] 15.42/32.43/50.21 14.87/32.13/49.98 15.69/32.86/50.56
SP[640/2048/0.0005/4] 15.74/31.78/47.91 14.58/30.4/47.15 14.61/30.37/47.11

If we don't follow the ScanNet setting of SuperGlue, as shown in the table below, we can achieve similar accuracy at a lower resolution with faster speeds. We also provide reproduce scripts.

AUC[5/10/20] 640×480 512×384 384×288
Ours (Full+fp32+noflash) 19.8/37.8/54.3 (40.1ms) 19.5/37.3/53.9 (30.1ms) 17.7/34.9/51.5 (24.2ms)
Ours (Opt.+mp+noflash) 18.5/35.8/52.2 (27.0ms) 18.0/35.3/51.9 (25.5ms) 16.7/33.1/49.1 (23.5ms)
Ours (Opt.+fp16+flash) 18.4/35.6/52.0 (24.2ms) 18.1/35.6/52.3 (20.8ms) 16.4/32.6/48.6 (19.4ms)

In summary, if we don't follow SuperGlue’s ScanNet 640x480 setting, maybe a more correct Figure 1 would look like this: Figure1

Note that these hardware-specific accelerations are not available on all hardware. Flash requires Turing (sm_75) or newer architectures, while FP16 requires Volta (sm_70) or newer. That is, V100 GPUs can't use Flash, and GPUs older than V100 can't use either Flash or FP16.

3) RANSAC threshold and variant:

I think that using LO-RANSAC should become the standard in future evaluation because it is much less sensitive to the choice of inlier threshold, making the evaluation a lot more reliable and closer to downstream applications like SfM.

We agree that LO-RANSAC is a better method than OpenCV RANSAC for practical use, as it is more robust and appears to be less sensitive to the choice of inlier threshold. We also conducted more experiments with LO-RANSAC on ScanNet and found that the accuracy gap still exists.

Method ScanNet LO-RANSAC (2.0px)
SP[1296/2048/0/3]+SG 21.4/38.1/53.1
SP[1296/2048/0/3]+LG(adaptive) 21.2/38.4/54.1
SP[1296/2048/0/3]+LG 21.3/39.1/55.2
LoFTR[640] 23.1/40.8/56.6
Ours (Optimized)[640] 24.8/43.7/59.9
Ours[640] 25.5/44.3/60.1
DKM[640] 28.4/49.0/65.9
RoMa[560] 31.4/53.0/70.4
hit2sjtu commented 3 months ago

I am afraid the "image pairs per second" cannot be simply obtained by inverse of "processing time per pair", unless one can make sure the GPU is 100% utilized (ie: with batch inference). I don't think this is guaranteed from discussion above and in LightGlue repo. Correct me if I am wrong @sarlinpe @Phil26AT

I guess the community should report both latency and throughput as shown in the "THE EFFICIENCY MISNOMER" paper. One should measure them carefully and rigorously.

wyf2020 commented 3 months ago

We agree that “fps cannot be simply obtained by the inverse of latency” and thanks for the comment. Perhaps our Figure 1 could have been more accurately described using "forward pass per second (1/latency)" to avoid confusion with throughput.

As one of LightGlue’s core novelties and designs, adaptive pruning significantly reduces LightGlue's latency with negligible effects on accuracy. Therefore we follow LightGlue paper’s setting that uses 1/latency as the running time comparisons in the teaser figure.

Furthermore, as emphasized in the "THE EFFICIENCY MISNOMER" paper you mentioned: "we argue that cost indicators that one cares about strongly depend on the applications and setup in which the models are supposed to be used", we believe that for the image matching task, latency is more crucial for online applications like SLAM and visual localization that is more sensitive to speed than offline SfM.