xiuqhou / Relation-DETR

[ECCV2024 Oral] Official implementation of the paper "Relation DETR: Exploring Explicit Position Relation Prior for Object Detection"
Apache License 2.0
135 stars 11 forks source link

关于Relation编码在Def、DAB和DN DETR这三个detr变体的实现代码请求 #28

Open whuxfx opened 1 week ago

whuxfx commented 1 week ago

Question

作者你好,关于论文中“5.4 Transferability of position relation”中Relation编码实现了对三种detr变体的性能提升,请问可以开源这三个detr变体的实现代码吗。谢谢!

补充信息

No response

xiuqhou commented 1 week ago

好的,感谢对本仓库的关注,这两天我会把这些模型代码开源出来,如果权重还能找到的话,我也会上传

xiuqhou commented 1 week ago

Hi @whuxfx 这几个模型的代码已更新,Deformable的权重时间太久远已经找不到了,所以只上传了DAB和DN的权重。

whuxfx commented 5 days ago

❯ CUDA_VISIBLE_DEVICES=2 python tools/benchmark_model.py --model-config configs/deformable_detr_mp/def_detr_pp_resnet_800_1333.py

Using /home//.cache/torch_extensions/py38_cu121 as PyTorch extensions root... Detected CUDA files, patching ldflags Emitting ninja build file /home//.cache/torch_extensions/py38_cu121/MultiScaleDeformableAttention/build.ninja... Building extension module MultiScaleDeformableAttention... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module MultiScaleDeformableAttention...

module #parameters or shape #flops
model 47.485M 0.281T
backbone 23.455M 87.581G
backbone.conv1 9.408K 2.529G
backbone.conv1.weight (64, 3, 7, 7)
backbone.layer1 0.213M 14.313G
backbone.layer1.0 73.728K 4.955G
backbone.layer1.1 69.632K 4.679G
backbone.layer1.2 69.632K 4.679G
backbone.layer2 1.212M 22.02G
backbone.layer2.0 0.377M 7.982G
backbone.layer2.1 0.279M 4.679G
backbone.layer2.2 0.279M 4.679G
backbone.layer2.3 0.279M 4.679G
backbone.layer3 7.078M 31.379G
backbone.layer3.0 1.507M 7.982G
backbone.layer3.1 1.114M 4.679G
backbone.layer3.2 1.114M 4.679G
backbone.layer3.3 1.114M 4.679G
backbone.layer3.4 1.114M 4.679G
backbone.layer3.5 1.114M 4.679G
backbone.layer4 14.942M 17.341G
backbone.layer4.0 6.029M 7.982G
backbone.layer4.1 4.456M 4.679G
backbone.layer4.2 4.456M 4.679G
neck.convs 5.638M 5.17G
neck.convs.0 0.132M 2.224G
neck.convs.0.0 0.131M 2.202G
neck.convs.0.1 0.512K 21.504M
neck.convs.1 0.263M 1.106G
neck.convs.1.0 0.262M 1.101G
neck.convs.1.1 0.512K 5.376M
neck.convs.2 0.525M 0.552G
neck.convs.2.0 0.524M 0.551G
neck.convs.2.1 0.512K 1.344M
neck.convs.3 4.719M 1.289G
neck.convs.3.0 4.719M 1.288G
neck.convs.3.1 0.512K 0.349M
transformer 18.392M 0.188T
transformer.level_embeds (4, 256)
transformer.enc_output 65.792K 1.463G
transformer.enc_output.weight (256, 256)
transformer.enc_output.bias (256,)
transformer.enc_output_norm 0.512K 28.573M
transformer.enc_output_norm.weight (256,)
transformer.enc_output_norm.bias (256,)
transformer.encoder.layers 7.693M 0.172T
transformer.encoder.layers.0 1.282M 28.585G
transformer.encoder.layers.1 1.282M 28.585G
transformer.encoder.layers.2 1.282M 28.585G
transformer.encoder.layers.3 1.282M 28.585G
transformer.encoder.layers.4 1.282M 28.585G
transformer.encoder.layers.5 1.282M 28.585G
transformer.decoder 10.343M 11.758G
transformer.decoder.layers 9.275M 11.439G
transformer.decoder.ref_point_head 0.132M 39.706M
transformer.decoder.class_head 0.14M 41.933M
transformer.decoder.bbox_head 0.796M 0.238G
transformer.decoder.position_relation_embedding.pos_proj.0 0.52K
transformer.encoder_class_head 23.387K 0.52G
transformer.encoder_class_head.weight (91, 256)
transformer.encoder_class_head.bias (91,)
transformer.encoder_bbox_head.layers 0.133M 2.949G
transformer.encoder_bbox_head.layers.0 65.792K 1.463G
transformer.encoder_bbox_head.layers.1 65.792K 1.463G
transformer.encoder_bbox_head.layers.2 1.028K 22.859M
transformer.pos_trans 0.131M 39.322M
transformer.pos_trans.weight (256, 512)
transformer.pos_trans.bias (256,)
transformer.pos_trans_norm 0.512K 0.384M
transformer.pos_trans_norm.weight (256,)
transformer.pos_trans_norm.bias (256,)

Memory allocation 0.20057344436645508 GB Max memory allocation 3.286261558532715 GB Model parameters 0.04422363732010126 GB warm up... testing inference time... 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:02<00:00, 24.93it/s] avg inference time per image = 0.04077534570499342

whuxfx commented 5 days ago
作者你好,我使用了代码自带的工具进行测试,为什么def-detr的参数量和浮点计算数相差这么大。带位置编码和不带位置编码 module #parameters or shape #flops
model 47.484M 0.281T

参数量和浮点数相差不大,但是和原始代码的def-detr数据相差有点大

xiuqhou commented 5 days ago

应该是def_detr_pp_resnet50_800_1333配置的部分参数和和原始实现不一致,比如这里dim_feedforward=2048,官方用的是1024。抽空我把配置文件和官方对齐一下

xiuqhou commented 5 days ago

还有一个原因,FLOPs和输入图片的尺寸有关,只有在相同尺寸的输入图片下才有对比性。我这里默认的FLOPs计算时输入的图片尺寸是800*1333,您可以看一下官方输入的尺寸是多少。

我刚才测试了一下,如果把dim_feedforward改成1024,并且尺寸都设置成800*1333,那么本仓库中的测试结果和detrex中的deformable-detr结果是一致的。(下面分别是本仓库和detrex结果,detrex输入图片的尺寸被固定成了800*1333)

(cp311pt211) houxiuquan@amax:/data2/houxiuquan/detection$ python tools/benchmark_model.py --model-config configs/models/deformable_detr/def_detr_resnet50_1024.py 
/data2/houxiuquan/envs/cp311pt211/lib/python3.11/site-packages/torch/overrides.py:110: UserWarning: 'has_cuda' is deprecated, please use 'torch.backends.cuda.is_built()'
  torch.has_cuda,
/data2/houxiuquan/envs/cp311pt211/lib/python3.11/site-packages/torch/overrides.py:111: UserWarning: 'has_cudnn' is deprecated, please use 'torch.backends.cudnn.is_available()'
  torch.has_cudnn,
/data2/houxiuquan/envs/cp311pt211/lib/python3.11/site-packages/torch/overrides.py:117: UserWarning: 'has_mps' is deprecated, please use 'torch.backends.mps.is_built()'
  torch.has_mps,
/data2/houxiuquan/envs/cp311pt211/lib/python3.11/site-packages/torch/overrides.py:118: UserWarning: 'has_mkldnn' is deprecated, please use 'torch.backends.mkldnn.is_available()'
  torch.has_mkldnn,
| module                                    | #parameters or shape   | #flops     |
|:------------------------------------------|:-----------------------|:-----------|
| model                                     | 41.181M                | 0.21T      |
|  backbone                                 |  23.455M               |  87.581G   |
|   backbone.conv1                          |   9.408K               |   2.529G   |
|    backbone.conv1.weight                  |    (64, 3, 7, 7)       |            |
|   backbone.layer1                         |   0.213M               |   14.313G  |
|    backbone.layer1.0                      |    73.728K             |    4.955G  |
|    backbone.layer1.1                      |    69.632K             |    4.679G  |
|    backbone.layer1.2                      |    69.632K             |    4.679G  |
|   backbone.layer2                         |   1.212M               |   22.02G   |
|    backbone.layer2.0                      |    0.377M              |    7.982G  |
|    backbone.layer2.1                      |    0.279M              |    4.679G  |
|    backbone.layer2.2                      |    0.279M              |    4.679G  |
|    backbone.layer2.3                      |    0.279M              |    4.679G  |
|   backbone.layer3                         |   7.078M               |   31.379G  |
|    backbone.layer3.0                      |    1.507M              |    7.982G  |
|    backbone.layer3.1                      |    1.114M              |    4.679G  |
|    backbone.layer3.2                      |    1.114M              |    4.679G  |
|    backbone.layer3.3                      |    1.114M              |    4.679G  |
|    backbone.layer3.4                      |    1.114M              |    4.679G  |
|    backbone.layer3.5                      |    1.114M              |    4.679G  |
|   backbone.layer4                         |   14.942M              |   17.341G  |
|    backbone.layer4.0                      |    6.029M              |    7.982G  |
|    backbone.layer4.1                      |    4.456M              |    4.679G  |
|    backbone.layer4.2                      |    4.456M              |    4.679G  |
|  neck.convs                               |  5.638M                |  5.17G     |
|   neck.convs.0                            |   0.132M               |   2.224G   |
|    neck.convs.0.0                         |    0.131M              |    2.202G  |
|    neck.convs.0.1                         |    0.512K              |    21.504M |
|   neck.convs.1                            |   0.263M               |   1.106G   |
|    neck.convs.1.0                         |    0.262M              |    1.101G  |
|    neck.convs.1.1                         |    0.512K              |    5.376M  |
|   neck.convs.2                            |   0.525M               |   0.552G   |
|    neck.convs.2.0                         |    0.524M              |    0.551G  |
|    neck.convs.2.1                         |    0.512K              |    1.344M  |
|   neck.convs.3                            |   4.719M               |   1.289G   |
|    neck.convs.3.0                         |    4.719M              |    1.288G  |
|    neck.convs.3.1                         |    0.512K              |    0.349M  |
|  transformer                              |  12.087M               |  0.117T    |
|   transformer.level_embeds                |   (4, 256)             |            |
|   transformer.enc_output                  |   65.792K              |   1.463G   |
|    transformer.enc_output.weight          |    (256, 256)          |            |
|    transformer.enc_output.bias            |    (256,)              |            |
|   transformer.enc_output_norm             |   0.512K               |   28.573M  |
|    transformer.enc_output_norm.weight     |    (256,)              |            |
|    transformer.enc_output_norm.bias       |    (256,)              |            |
|   transformer.encoder.layers              |   4.541M               |   0.101T   |
|    transformer.encoder.layers.0           |    0.757M              |    16.881G |
|    transformer.encoder.layers.1           |    0.757M              |    16.881G |
|    transformer.encoder.layers.2           |    0.757M              |    16.881G |
|    transformer.encoder.layers.3           |    0.757M              |    16.881G |
|    transformer.encoder.layers.4           |    0.757M              |    16.881G |
|    transformer.encoder.layers.5           |    0.757M              |    16.881G |
|   transformer.decoder                     |   7.191M               |   10.815G  |
|    transformer.decoder.layers             |    6.123M              |    10.495G |
|    transformer.decoder.ref_point_head     |    0.132M              |    39.706M |
|    transformer.decoder.class_head         |    0.14M               |    41.933M |
|    transformer.decoder.bbox_head          |    0.796M              |    0.238G  |
|   transformer.encoder_class_head          |   23.387K              |   0.52G    |
|    transformer.encoder_class_head.weight  |    (91, 256)           |            |
|    transformer.encoder_class_head.bias    |    (91,)               |            |
|   transformer.encoder_bbox_head.layers    |   0.133M               |   2.949G   |
|    transformer.encoder_bbox_head.layers.0 |    65.792K             |    1.463G  |
|    transformer.encoder_bbox_head.layers.1 |    65.792K             |    1.463G  |
|    transformer.encoder_bbox_head.layers.2 |    1.028K              |    22.859M |
|   transformer.pos_trans                   |   0.131M               |   39.322M  |
|    transformer.pos_trans.weight           |    (256, 512)          |            |
|    transformer.pos_trans.bias             |    (256,)              |            |
|   transformer.pos_trans_norm              |   0.512K               |   0.384M   |
|    transformer.pos_trans_norm.weight      |    (256,)              |            |
|    transformer.pos_trans_norm.bias        |    (256,)              |            |
Memory allocation 0.1750655174255371 GB
Max memory allocation 2.697573661804199 GB
Model parameters 0.038352333940565586 GB
warm up...
testing inference time...
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:04<00:00, 10.71it/s]
avg inference time per image = 0.09424854909157267
(cp311pt211) houxiuquan@amax:/data2/houxiuquan/detection$ 
WARNING [11/21 18:03:56 fvcore.common.checkpoint]: The checkpoint state_dict contains keys that are not used by the model:
  stem.fc.{bias, weight}
  0%|                                                                                                        | 0/2 [00:00<?, ?it/s]/data2/houxiuquan/envs/detrex/lib/python3.8/site-packages/torch/nn/functional.py:2498: UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
  _verify_batch_size([input.size(0) * input.size(1) // num_groups, num_groups] + list(input.size()[2:]))
/data2/houxiuquan/envs/detrex/lib/python3.8/site-packages/torch/functional.py:568: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at  /opt/conda/conda-bld/pytorch_1646755903507/work/aten/src/ATen/native/TensorShape.cpp:2228.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
WARNING [11/21 18:04:00 fvcore.nn.jit_analysis]: Unsupported operator aten::cumsum encountered 9 time(s)
WARNING [11/21 18:04:00 fvcore.nn.jit_analysis]: Unsupported operator aten::pow encountered 5 time(s)
WARNING [11/21 18:04:00 fvcore.nn.jit_analysis]: Unsupported operator aten::sin encountered 9 time(s)
WARNING [11/21 18:04:00 fvcore.nn.jit_analysis]: Unsupported operator aten::cos encountered 9 time(s)
WARNING [11/21 18:04:00 fvcore.nn.jit_analysis]: Unsupported operator aten::prod encountered 1 time(s)
WARNING [11/21 18:04:00 fvcore.nn.jit_analysis]: Unsupported operator aten::sum encountered 16 time(s)
WARNING [11/21 18:04:00 fvcore.nn.jit_analysis]: Unsupported operator aten::linspace encountered 16 time(s)
WARNING [11/21 18:04:00 fvcore.nn.jit_analysis]: Unsupported operator prim::PythonOp.MultiScaleDeformableAttnFunction encountered 12 time(s)
WARNING [11/21 18:04:00 fvcore.nn.jit_analysis]: Unsupported operator aten::ones_like encountered 4 time(s)
WARNING [11/21 18:04:00 fvcore.nn.jit_analysis]: Unsupported operator aten::lt encountered 1 time(s)
WARNING [11/21 18:04:00 fvcore.nn.jit_analysis]: Unsupported operator aten::all encountered 1 time(s)
WARNING [11/21 18:04:00 fvcore.nn.jit_analysis]: Unsupported operator aten::log encountered 13 time(s)
WARNING [11/21 18:04:00 fvcore.nn.jit_analysis]: Unsupported operator aten::topk encountered 2 time(s)
WARNING [11/21 18:04:00 fvcore.nn.jit_analysis]: Unsupported operator aten::repeat encountered 2 time(s)
WARNING [11/21 18:04:00 fvcore.nn.jit_analysis]: The following submodules of the model were never called during the trace of the graph. They may be unused, or they were accessed by direct calls to .forward() or via other python methods. In the latter case they will have zeros for statistics, though their statistics will still contribute to their parent calling module.
model.criterion, model.criterion.matcher, model.transformer.decoder.layers.0.attentions.0.attn.out_proj, model.transformer.decoder.layers.1.attentions.0.attn.out_proj, model.transformer.decoder.layers.2.attentions.0.attn.out_proj, model.transformer.decoder.layers.3.attentions.0.attn.out_proj, model.transformer.decoder.layers.4.attentions.0.attn.out_proj, model.transformer.decoder.layers.5.attentions.0.attn.out_proj
 50%|████████████████████████████████████████████████                                                | 1/2 [00:03<00:03,  3.99s/it]/data2/houxiuquan/envs/detrex/lib/python3.8/site-packages/torch/nn/functional.py:2498: UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
  _verify_batch_size([input.size(0) * input.size(1) // num_groups, num_groups] + list(input.size()[2:]))
100%|████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:05<00:00,  2.68s/it]
[11/21 18:04:02 detectron2]: Flops table computed from only one input sample:
| module                                | #parameters or shape   | #flops     |
|:--------------------------------------|:-----------------------|:-----------|
| model                                 | 41.162M                | 0.21T      |
|  backbone                             |  23.455M               |  87.807G   |
|   backbone.stem.conv1                 |   9.408K               |   2.544G   |
|    backbone.stem.conv1.weight         |    (64, 3, 7, 7)       |            |
|    backbone.stem.conv1.norm           |                        |    34.15M  |
|   backbone.res2                       |   0.213M               |   14.416G  |
|    backbone.res2.0                    |    73.728K             |    5.011G  |
|    backbone.res2.1                    |    69.632K             |    4.703G  |
|    backbone.res2.2                    |    69.632K             |    4.703G  |
|   backbone.res3                       |   1.212M               |   22.022G  |
|    backbone.res3.0                    |    0.377M              |    7.99G   |
|    backbone.res3.1                    |    0.279M              |    4.677G  |
|    backbone.res3.2                    |    0.279M              |    4.677G  |
|    backbone.res3.3                    |    0.279M              |    4.677G  |
|   backbone.res4                       |   7.078M               |   31.458G  |
|    backbone.res4.0                    |    1.507M              |    7.997G  |
|    backbone.res4.1                    |    1.114M              |    4.692G  |
|    backbone.res4.2                    |    1.114M              |    4.692G  |
|    backbone.res4.3                    |    1.114M              |    4.692G  |
|    backbone.res4.4                    |    1.114M              |    4.692G  |
|    backbone.res4.5                    |    1.114M              |    4.692G  |
|   backbone.res5                       |   14.942M              |   17.368G  |
|    backbone.res5.0                    |    6.029M              |    7.996G  |
|    backbone.res5.1                    |    4.456M              |    4.686G  |
|    backbone.res5.2                    |    4.456M              |    4.686G  |
|  neck                                 |  5.639M                |  5.157G    |
|   neck.convs                          |   0.92M                |   3.869G   |
|    neck.convs.0                       |    0.132M              |    2.21G   |
|    neck.convs.1                       |    0.263M              |    1.106G  |
|    neck.convs.2                       |    0.525M              |    0.552G  |
|   neck.extra_convs.0                  |   4.719M               |   1.289G   |
|    neck.extra_convs.0.conv            |    4.719M              |    1.288G  |
|    neck.extra_convs.0.norm            |    0.512K              |    0.349M  |
|  transformer                          |  12.068M               |  0.117T    |
|   transformer.level_embeds            |   (4, 256)             |            |
|   transformer.encoder.layers          |   4.541M               |   0.101T   |
|    transformer.encoder.layers.0       |    0.757M              |    16.806G |
|    transformer.encoder.layers.1       |    0.757M              |    16.806G |
|    transformer.encoder.layers.2       |    0.757M              |    16.806G |
|    transformer.encoder.layers.3       |    0.757M              |    16.806G |
|    transformer.encoder.layers.4       |    0.757M              |    16.806G |
|    transformer.encoder.layers.5       |    0.757M              |    16.806G |
|   transformer.decoder                 |   7.195M               |   14.635G  |
|    transformer.decoder.layers         |    6.123M              |    10.732G |
|    transformer.decoder.bbox_embed     |    0.928M              |    3.411G  |
|    transformer.decoder.class_embed    |    0.144M              |    0.492G  |
|   transformer.enc_output              |   65.792K              |   1.456G   |
|    transformer.enc_output.weight      |    (256, 256)          |            |
|    transformer.enc_output.bias        |    (256,)              |            |
|   transformer.enc_output_norm         |   0.512K               |   28.445M  |
|    transformer.enc_output_norm.weight |    (256,)              |            |
|    transformer.enc_output_norm.bias   |    (256,)              |            |
|   transformer.pos_trans               |   0.263M               |   78.643M  |
|    transformer.pos_trans.weight       |    (512, 512)          |            |
|    transformer.pos_trans.bias         |    (512,)              |            |
|   transformer.pos_trans_norm          |   1.024K               |   0.768M   |
|    transformer.pos_trans_norm.weight  |    (512,)              |            |
|    transformer.pos_trans_norm.bias    |    (512,)              |            |
[11/21 18:04:02 detectron2]: Average GFlops for each type of operators:
[('conv', 92.461884416), ('batch_norm', 0.4740864), ('group_norm', 0.02844544), ('upsample_nearest2d', 2.2223e-05), ('linear', 116.379134976), ('layer_norm', 0.37747072), ('bmm', 0.27648)]
[11/21 18:04:02 detectron2]: Total GFlops: 210.0±0.0
(detrex) houxiuquan@amax:/data1/houxiuquan/detrex$ 
whuxfx commented 5 days ago

好的,我将dim_feedforward减小后得到和你类似的结果