Extending the CenterNet model

@yjh0410,

I am working on an interesting project that I will describe here. For privacy reasons, I cannot upload the images but I will explain everything in detail. More than half the work is done and I used this repo (made some changes of course) for Object Detection. Now, comes the next part.

As you well know I am training the model using grayscale image, ResNet18 backbone with Dilated Encoder as Neck and the decoder. The encoder architecture (ResNet-18) has C2, C3, C4, C5 outputs with 64, 128, 256, 512 channels respectively.

I also have another type of image (depth image) which I want to concatenate with the grayscale image and train the model thereafter. So I have the depth image of shape (128, 384), I used the Conv layers to increase the number of channels to (128, 384, 32). I changed the ResNet backbone first layer out_channels to 32 and therefore, I concatenated the grayscale image and depth image to get the shape of C2 as (128, 384, 32+32) = (128, 384, 64)

What I want to do: My Dataset looks like [[x1, y1, x2, y2, class_id, depth]]. The depth value in the dataset is the value calculated using some algorithm I wrote. I want to predict this depth value using the depth image as mentioned earlier. My questions are as follows:-

For this, I need to add the depth value to the gt_tensor variable, i.e. perhaps change the shape of thegt_tensor to (128, 384, 17+1) as I am training with images of size (512, 1536) with 8 classes (including background). Is this correct?
I need to add another head called depth prediction head wherein I will also add a new loss and I want to regress to the target_depth_value. However, my question is how to use the depth image as such inside the network for this regression? i.e. the depth prediction head could like the following:
```
self.depth_pred= nn.Sequential(
        Conv(p2, 64, k=3, p=1, act=act),
        nn.Conv2d(64, 1, kernel_size=1)
    )
```
All the prediction heads have output_channels = 64. Is this because the output_channels of C2 in the encoder is 64? If so, in my case even I have 64 channels for C2 but are shared, i.e. 32 for grayscale images and 32 for depth image. In this case, should I have 32 output channels for the head?

Is this correct or do you suggest any changes?

What else do you think needs to be changed to change your network into something I want? I would be happy to discuss this further privately @yjh0410.

yjh0410 / CenterNet-plus

Extending the CenterNet model #13