Reproducing the training results on a megadepth dataset

FlyFish-space commented 1 year ago

Thank you very much for your excellent work. I recently reproduce the training results on 4 3090 GPUs for 30 epochs based on README. The batch size each GPU is 2. I trained and tested on the D2 Net-undistorted megadepth dataset, and the results are as follows: auc@5: 44.1 acu@10: 60.28 auc@20: 72.93 At the same time, I also saw that in the previous issue, it was recommended to set the image sizes of both val and test to 640, but the results did not improve. What is the reason for this decline in accuracy？

benjaminkelenyi commented 1 year ago

Hello, I'm facing the same issue. The loss is very fluctuating... Screenshot from 2023-04-26 15-25-35

chicleee commented 1 year ago

Hi， Have you made any progress on this issue？

benjaminkelenyi commented 1 year ago

Hello,

Thanks for your reply. Yes, I fixed the issue!

Thank you!

On Thu, 1 Jun 2023 at 09:42, Xi Li @.***> wrote:

Hi， Have you made any progress on this issue？

— Reply to this email directly, view it on GitHub https://github.com/zju3dv/LoFTR/issues/253#issuecomment-1571450717, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFOHDZZFSQDSDNVZG5TUFHTXJA2VVANCNFSM6AAAAAAWA7GPK4 . You are receiving this because you commented.Message ID: @.***>

--

Benjamin Kelenyi Student | Computer Science | Technical University m: +40743586598 e: @.*** a: Str. G. Baritiu nr. 26-28, 400027 Cluj-Napoca, Romania https://www.facebook.com/benjamin.kelenyi https://www.facebook.com/benjamin.kelenyi https://www.linkedin.com/in/benjamin-kelenyi-aa322710a/

chen9run commented 1 year ago

Hello, Thanks for your reply. Yes, I fixed the issue! Thank you! … On Thu, 1 Jun 2023 at 09:42, Xi Li @.> wrote: Hi， Have you made any progress on this issue？ — Reply to this email directly, view it on GitHub <#253 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFOHDZZFSQDSDNVZG5TUFHTXJA2VVANCNFSM6AAAAAAWA7GPK4 . You are receiving this because you commented.Message ID: @.> -- Benjamin Kelenyi Student | Computer Science | Technical University m: +40743586598 e: @.*** a: Str. G. Baritiu nr. 26-28, 400027 Cluj-Napoca, Romania https://www.facebook.com/benjamin.kelenyi https://www.facebook.com/benjamin.kelenyi https://www.linkedin.com/in/benjamin-kelenyi-aa322710a/ Hi，have you find the reeason?

Mysophobias commented 1 year ago

hello,my results are similar to yours,Have you tried changing your TRAIN_IMG_SIZE to 840.

Master-cai commented 1 year ago

I'm training outdoor_ds with the default setting(image size 640), I use 4 3090 GPUs too. Also, i use the origin megadepth data to train as the undistorted image is not accessible now.

After 11 epoches training, I got the following results(val): auc@5: 45.6 acu@10: 62.4 auc@20: 75.1

it does not seem to grow anymore. I will try to train it 30 epoches and test the model on the test set(it may takes another two days).

Has anyone else already reproduced the results using a similar setting? Would setting TRAIN_IMG_SIZE to 840 help?

Master-cai commented 1 year ago

After 30 epoches training, I reproduced the test on megadepth and got: 'auc@10': 0.6676607412455137, 'auc@20': 0.7952598445093988, 'auc@5': 0.4983204021567033, 'prec@5e-04': 0.9549532078302655

3 points lower than the reported accuracy.

Mysophobias commented 1 year ago

1690765297249 @Master-cai Hello，This is the training result I obtained when I set 'TRAIN_IMG_SIZE' to 640. Your training result is much better than mine. Have you tried setting 'TRAIN_IMG_SIZE' to 840?

Master-cai commented 1 year ago

@Master-cai Hello，This is the training result I obtained when I set 'TRAIN_IMG_SIZE' to 640. Your training result is much better than mine. Have you tried setting 'TRAIN_IMG_SIZE' to 840?

no, i use the default settings. Your results are very similar to mine with 11 epoches training. So what device you use and how long you train it ?

Mysophobias commented 1 year ago

@Master-cai Hello，This is the training result I obtained when I set 'TRAIN_IMG_SIZE' to 640. Your training result is much better than mine. Have you tried setting 'TRAIN_IMG_SIZE' to 840?

no, i use the default settings. Your results are very similar to mine with 11 epoches training. So what device you use and how long you train it ?

I also used 4 Nvidia RTX 3090 GPUs and trained for approximately 100 hours. I have tried using D2-net to process the dataset, and these are the validation results I saved during the training process. I am really eager to know if setting 'TRAIN_IMG_SIZE' to 840 would improve the accuracy after training. 1690768809960

Master-cai commented 1 year ago

@Mysophobias I didn't process the megadepth via D2-net, and your ckpts seems similar to mine. I have no idea about your bad test results. I just used the default reproduce_test\outdoor_ds.sh script to test.

As to the image size=840, I think it might help as it was officially recommended after all. 3090 is enough to train with 840 and you can try it.

Mysophobias commented 1 year ago

``

@Mysophobias I didn't process the megadepth via D2-net, and your ckpts seems similar to mine. I have no idea about your bad test results. I just used the default reproduce_test\outdoor_ds.sh script to test.

As to the image size=840, I think it might help as it was officially recommended after all. 3090 is enough to train with 840 and you can try it.

Based on the code comments in configs/data/megadepth_trainval_840.py it is indicated that 32GB of GPU memory is required for training. Of course, I have also attempted training on four 24GB 3090GPUs, but it was not successful. I will try again later. Anyway, thank you.

Master-cai commented 1 year ago

@Mysophobias 3090 can train it with physical bs=1. I use accumulate grad=2 to make the bs=1 2 4=8, which is suggested by the author. i have trained it for one epoch but now i'm not available with GPUs💔. I hope my experience can help you and It would be nice if you could share the final results.

xmlyqing00 commented 1 year ago

May I ask how could you train the model on MegaDepth? I got stuck in getting the training images from D2-net. I noticed the authors of LoFTR says the differences are subtle, but I don't know how to create the symbol link. Do I need to download the MegaDepth SfM dataset?

Best, yq

Master-cai commented 1 year ago

@xmlyqing00 I think this issue can help.

xmlyqing00 commented 1 year ago

@xmlyqing00 I think this issue can help.

Thanks, I just fixed the training on MegaDepth

RunyuZhu commented 8 months ago

I'm training outdoor_ds with the default setting(image size 640), I use 4 3090 GPUs too. Also, i use the origin megadepth data to train as the undistorted image is not accessible now.

After 11 epoches training, I got the following results(val): auc@5: 45.6 acu@10: 62.4 auc@20: 75.1

it does not seem to grow anymore. I will try to train it 30 epoches and test the model on the test set(it may takes another two days).

Has anyone else already reproduced the results using a similar setting? Would setting TRAIN_IMG_SIZE to 840 help?

can i ask about your device's memery capacity? i train loftr on my device with a single 3090ti 24G, 13 i7, and 128G memery, i set batch size to 1, n_gpus_per_node=1, nums_workers=0. but the process got killed when training at epoch 2, i find that loftr nearly run out of my memry(full of swap & 125/126g main memery used). so, i beg for your info of device, and have you ever met this issue? it will be so nice of you if you could give me some tips. thanks. zhu

Master-cai commented 8 months ago

@RunyuZhu That`s weird, I use 4 3090ti and 128GB memery(8GB swap) to get that results, nums_workers is 4. But memory consumption does indeed increase over time. I never met this bug, sorry i cannot help you. I suggest you to look at the system log and make sure that the process is killed due to OOM and if there are some other processes occupy a large amount of memory.

RunyuZhu commented 8 months ago

@RunyuZhu That`s weird, I use 4 3090ti and 128GB memery(8GB swap) to get that results, nums_workers is 4. But memory consumption does indeed increase over time. I never met this bug, sorry i cannot help you. I suggest you to look at the system log and make sure that the process is killed due to OOM and if there are some other processes occupy a large amount of memory.

thanks for your reply and precious suggestion! i will run it again with a bigger nums_workers or batch_size, and log the info to locate the issue. thanks again! zhu

WJJLBJ commented 6 months ago

@xmlyqing00 I think this issue can help.

hello, how do you fix the problem in line 47 in LoFTR/src/datasets/megadepth.py ? line 47 in offical code is self.scene_info = np.load(npz_path, allow_pickle=True), which is different from this issue

WJJLBJ commented 6 months ago

@xmlyqing00 I think this issue can help.

可以问下您是怎么解决loftr无法下载d2-net预处理数据的问题的吗？这个issue里面的做法有帮助嘛我看LoFTR/src/datasets/megadepth.py 里面的第47行并不是他给的那个而是 self.scene_info = np.load(npz_path, allow_pickle=True) 想问下你是怎么改动这个文件的呀多谢！

Master-cai commented 6 months ago

@WJJLBJ 直接使用原始图像，根据issue里给的做法处理

WJJLBJ commented 6 months ago

@WJJLBJ 直接使用原始图像，根据issue里给的做法处理

多谢多谢问题解决了

zju3dv / LoFTR

Reproducing the training results on a megadepth dataset #253