av2 stage3 crash during infer

billbliss3 commented 4 months ago

It is a awesome work.

But recently, I found a crash [av2 stage3 crash during infer]

2024-04-20 07:13:02,507 - mmdet - INFO - Iter [3400/34040] lr: 4.878e-05, eta: 2 days, 21:50:59, time: 7.116, data_time: 0.236, memory: 18159, cls: 0.3527, reg: 0.7917, d0.c ls: 0.4578, d0.reg: 1.4255, d1.cls: 0.3876, d1.reg: 1.0831, d2.cls: 0.3705, d2.reg: 0.9497, d3.cls: 0.3587, d3.reg: 0.8702, d4.cls: 0.3446, d4.reg: 0.8254, seg: 0.4308, seg_dice: 0.1130, cls_t0: 0.3730, reg_t0: 0.8893, d0.cls_t0: 0.8766, d0.reg_t0: 2.3198, d1.cls_t0: 0.6213, d1.reg_t0: 1.4020, d2.cls_t0: 0.4872, d2.reg_t0: 1.1538, d3.cls_t0: 0.4217, d3.r eg_t0: 1.0486, d4.cls_t0: 0.3839, d4.reg_t0: 0.9481, seg_t0: 0.6679, seg_dice_t0: 0.1954, cls_t1: 0.3272, reg_t1: 0.7694, d0.cls_t1: 0.4739, d0.reg_t1: 1.4077, d1.cls_t1: 0.4143, d1.reg_t1: 1.1017, d2.cls_t1: 0.3704, d2.reg_t1: 0.9676, d3.cls_t1: 0.3430, d3.reg_t1: 0.8944, d4.cls_t1: 0.3325, d4.reg_t1: 0.8050, seg_t1: 0.4923, seg_dice_t1: 0.1315, cls_t2: 0.2955, reg_t2: 0.7101, d0.cls_t2: 0.4093, d0.reg_t2: 1.2690, d1.cls_t2: 0.3406, d1.reg_t2: 0.9811, d2.cls_t2: 0.3195, d2.reg_t2: 0.8782, d3.cls_t2: 0.3069, d3.reg_t2: 0.7965, d 4.cls_t2: 0.2930, d4.reg_t2: 0.7422, seg_t2: 0.4511, seg_dice_t2: 0.1157, cls_t3: 0.3088, reg_t3: 0.8126, d0.cls_t3: 0.4122, d0.reg_t3: 1.3771, d1.cls_t3: 0.3490, d1.reg_t3: 1.05 67, d2.cls_t3: 0.3231, d2.reg_t3: 0.9688, d3.cls_t3: 0.3187, d3.reg_t3: 0.8932, d4.cls_t3: 0.3129, d4.reg_t3: 0.8366, seg_t3: 0.4328, seg_dice_t3: 0.1138, total_t0: 11.7887, tota l_t1: 8.8308, total_t2: 7.9087, total_t3: 8.5163, total_t4: 8.7612, f_trans_t0: 0.1254, b_trans_t0: 0.0976, f_trans_t1: 0.1138, b_trans_t1: 0.0859, f_trans_t2: 0.1246, b_trans_t2 : 0.0827, f_trans_t3: 0.1257, b_trans_t3: 0.0897, total: 46.6509, grad_norm: 145.7686 2024-04-20 07:13:23,480 - mmdet - INFO - Saving checkpoint at 3404 iterations [ ] 0/23519, elapsed: 0s, ETA:Traceback (most recent call last): File "tools/train.py", line 280, in main() File "tools/train.py", line 269, in main custom_train_model( File "/data/wk/Project/maptracker/plugin/core/apis/train.py", line 30, in custom_train_model custom_train_detector( File "/data/wk/Project/maptracker/plugin/core/apis/mmdet_train.py", line 228, in custom_train_detector runner.run(data_loaders, cfg.workflow) File "/opt/conda/lib/python3.8/site-packages/mmcv/runner/iter_based_runner.py", line 138, in run iter_runner(iter_loaders[i], kwargs) File "/data/wk/Project/maptracker/plugin/core/apis/mmdet_train.py", line 49, in train self.call_hook('after_train_iter') File "/opt/conda/lib/python3.8/site-packages/mmcv/runner/base_runner.py", line 309, in call_hook getattr(hook, fn_name)(self) File "/opt/conda/lib/python3.8/site-packages/mmcv/runner/hooks/evaluation.py", line 262, in after_train_iter self._do_evaluate(runner) File "/data/wk/Project/maptracker/plugin/core/evaluation/eval_hooks.py", line 78, in _do_evaluate results = custom_multi_gpu_test( File "/data/wk/Project/maptracker/plugin/core/apis/test.py", line 72, in custom_multi_gpu_test result = model(return_loss=False, rescale=True, data) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, kwargs) File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 886, in forward output = self.module(*inputs[0], *kwargs[0]) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(input, kwargs) File "/data/wk/Project/maptracker/plugin/models/mapers/base_mapper.py", line 95, in forward return self.forward_test(*args, *kwargs) File "/opt/conda/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context return func(args, **kwargs) File "/data/wk/Project/maptracker/plugin/models/mapers/MapTracker.py", line 638, in forward_test self.temporal_propagate(bev_feats, img_metas, all_history_curr2prev, File "/data/wk/Project/maptracker/plugin/models/mapers/MapTracker.py", line 158, in temporal_propagate self.memory_bank.trans_memory_bank(self.query_propagate, b_i, img_metas[b_i]) File "/data/wk/Project/maptracker/plugin/models/mapers/vector_memory.py", line 236, in trans_memory_bank relative_seq_pe = self.cached_pe[relative_seq_idx].to(mem_embeds.device) IndexError: index 100 is out of bounds for dimension 0 with size 100

woodfrog commented 4 months ago

Hi, thanks for your interest. The relative_seq_idx is the relative frame interval between the current frame and the past frame from the memory. In normal cases, this value should never exceed 100. I can try to diagnose if you provide more information:

(1). Are you running the old split or the new split? (2). Are stage 2's training and inference all good, meaning the losses are all normal, and the testing results are reasonable?

billbliss3 commented 4 months ago

@woodfrog I am using the official av2 stage3 old_split train config, and it seems stage2 works well.

woodfrog commented 4 months ago

@woodfrog I am using the official av2 stage3 old_split train config, and it seems stage2 works well.

Thanks. The log shows [ ] 0/23519, elapsed: 0s,, suggesting that you changed some settings (maybe by accident) like the data interval? With the default setting, the number of test frames is slightly less than 6000 -- AV2 frames are uniformly sub-sampled to keep the same frame rate as nuScenes.

The model should work well with a higher frame rate, but it might trigger some underlying tiny bugs. Can you confirm the test settings you are using? Then I will see if I can reproduce the error and fix it.

billbliss3 commented 4 months ago

You are right. I found the reason. Since the av2 old split load maptr_info.pkl, and the samples in line 65 in argo_dataset.py does not contain self.interval.

https://github.com/woodfrog/maptracker/blob/main/plugin/datasets/argo_dataset.py#L65C17-L65C81

woodfrog commented 4 months ago

Yes, for the old split, the test samples are "hard coded" to ensure they are the same as those used in the MapTR codebase (the most popular codebase for this task), so self.interval is not used there.

I'm sorry I forgot to commit the "maptr_info.pkl" file. It contains the metadata directly exported from the MapTR codebase. I just added it, can you try again?

woodfrog commented 4 months ago

For that "frame interval out of index" issue, the relative_seq_idx would be seq_id for invalid memory entries, and those values won't be used in memory fusion (masked). But when seq_id becomes greater than 100 in a very long sequence, that error occurs.

My original assumption was that all sequences would be less than 100 frames, so I set 100 as the length of the pre-computed positional encodings. I will change it to 1000 so the inference won't be broken with longer test sequences.

billbliss3 commented 4 months ago

Yes, for the old split, the test samples are "hard coded" to ensure they are the same as those used in the MapTR codebase (the most popular codebase for this task), so self.interval is not used there.

I'm sorry I forgot to commit the "maptr_info.pkl" file. It contains the metadata directly exported from the MapTR codebase. I just added it, can you try again?

It seems dataset av2 missing some timestamp.

Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/mmcv/utils/registry.py", line 69, in build_from_cfg return obj_cls(args) File "/data/wk/Project/maptracker/plugin/datasets/argo_dataset.py", line 31, in init super().init(kwargs) File "/data/wk/Project/maptracker/plugin/datasets/base_dataset.py", line 62, in init self.load_annotations(self.ann_file) File "/data/wk/Project/maptracker/plugin/datasets/argo_dataset.py", line 65, in load_annotations samples = [unique_token2samples[x] for x in maptr_unique_tokens] File "/data/wk/Project/maptracker/plugin/datasets/argo_dataset.py", line 65, in samples = [unique_token2samples[x] for x in maptr_unique_tokens] KeyError: '15ec0778-826e-3ed7-9775-54fbf66997f4_315970274060083000'

woodfrog commented 4 months ago

That's weird. Seems that our AV2 datasets are a bit different. Can you check how many MapTR test samples are available in your current AV2 dataset? Something like samples = [unique_token2samples[x] for x in maptr_unique_tokens if x in unique_token2samples] -- if the resulting samples is empty, there should be some naming inconsistencies.

billbliss3 commented 4 months ago

I have printed all missing stamp. Only one left.

log as follows. Prepare sequence information for ./datasets/av2/av2_map_infos_train.pkl 15ec0778-826e-3ed7-9775-54fbf66997f4_315970274060083000

billbliss3 commented 4 months ago

Total length of val is 23519

woodfrog commented 4 months ago

Total length of val is 23519

The total length of val in my AV2 is 23522, so the difference comes from the metadata generated by the data converter. This file is borrowed from StreamMapNet's codebase without modification. There are some filterings to discard invalid data. The download probably failed for some data, and some samples were filtered, leading to different sample numbers.

In your case, I can think of two potential solutions: (1) Skip that single sample, although it slightly changes the test set.
(2) Check the downloaded data and re-download the broken samples.

billbliss3 commented 4 months ago

Thanks for your advice and help

woodfrog / maptracker

av2 stage3 crash during infer #3