Closed YashRunwal closed 3 years ago
"My idea is to divide the 128 channel values by 512 and 384 channel values by 1536..."
I am a little confused about this, sorry~
In this model, the txtytwth_pred's size is [B, H, W, 4], where the H is img_h / 4 and W is img_w / 4. Then I reshape it to [B, HxW, 4](For example, your 49152 is equal to 128x384).
Why I reshape it is just for the convenience of calculation to decode box, compute iou between pred box and target box. You can also use [B, H, W, 4] not [B, HW, 4] to do following processing. It is OK.
In CenterNet, it predicts one bounding box at each pixel location in P2, so it will output HW boundbing boxes, but it only keeps topk bounding boxes as the final predictions.
@yjh0410 Hi, Please bear with me.
So I have changed the code to use [1, 128, 384, channels] but now am stuck at the following line of code:
bbox_pred = torch.clamp((self.decode_boxes(txtytwth_pred) / self.input_size)[0], 0., 1.)
The output of decode_boxes is for me of the shape [1, 128, 384, 4] Here above, you divide the decode_boxes(txtytwth_pred) by input_size. However, my input_size is (512, 1536) so with which value should I divide?
Edit: @yjh0410: Consider the following example with real targets (bboxes normalized between 0 and 1 and class_id)
targets: [[0.00911458 0.26171875 0.53190104 0.89648438 4. ]]
c_x: 415.5, c_y:296.5, box_w:802.9999999999999,box_h:325.0
tx, ty, tw, th:0.875, 0.125, 5.302060352826871, 4.3975308212098465
All these above variables are from generate_txtytwth()
function. The stride used is 4. I presume, this stride is used for the image of the size (512, 512), but if the image size is (512, 1536), do you think I should use the stride (12, 4) so the following becomes:
box_w_s = box_w / 12 # This, do you think the logic is correct? I don't think it is as the heatmap is of size H/4, W/4.
box_h_s = box_h / 4
Then using the decode_boxes() function the output is
output = self.decode_boxes(txtytwth_pred)
But theoutput.max()
is 1532.76 and output.min()
is -1.1088.
This takes me back to the same question: Do I divide this output by 512 or 1536?
@yjh0410 Any ideas? Need some help.
The size of P2 feature map is [B, C, H/4, W/4], so the size of heatmap you create should also be [B, C, H/4, W/4], not [B, C, H/4, W/4]. The 512 x 512 size I used is just an example, and we also can use other size like 640x640, 800x1024.
When I normalize the pred boxes, I just divide it by input_size 512, because I set H and W as the same size 512. As for your case, you shoule divide the height of the pred boxes by H(512), and divide the wifth of the pred boxes by W(1536).
@yjh0410 Yes, my P2 feature map size is [B, C, 128, 384], i.e. the size of the self.decode_boxes/(txtytwth_pred)
is [B, C, 128, 384]. But how do I divide the height of the feature map by 512 and width by 1536? I tried a lot but can't get anywhere.
@YashRunwal The size of box_pred=self.decode_boxes/(txtytwth_pred) is [B, 4, H//4, W//4], so you can divide box_pred[B, 0::2, :, :] by the width and box_pred[B, 1::2, :, :] by the height.
@yjh0410 Thanks. There is slight mistake it should be box_pred[B, 1::3, :, :]
by the height.
Thanks. Will close this issue now
Hi,
I am editing the script for my custom images with shape (512, 1536). So I wrote my own custom resnet18 backbone (had to make some changes, no big deal) and now am editing the CenterNet_plus script and I have some doubts about it:
So in the forward function at this line:
x1y1x2y2_pred = (self.decode_boxes(txtytwth_pred) / self.input_size).view(-1, 4)
In the function decode_boxes, it is written in the comments that the input shape of pred is:[delta_x, delta_y, sqrt(w), sqrt(h)]
, however, the size of the pred variable ispred: torch.Size([1, 49152, 4])
, how is this the case? For me this looks like [B, H*W, 4]. Are the variables under the `self.trainable line correctly permuted?Also, as you can see I have made changes in the create_grid function in the height and width variable as the input size is nonsquare. But in the forward function where decode_boxes function is used , you have divided it by input_size which is for your case is 512 or any square images. What do I do here as my image is (512, 1536), I cannot divide by input_size directly as it is a tuple and will throw an error.
You have used the COCO Dataset for training and evaluation purposes and hence can use the COCOEval class from pycocotools directly. However, my annotation format is different. I have a .txt file for each image like YOLO. How do I use the COCOEval class in this case? Any ideas?
Thanks.
Edit 1: @yjh0410 So printed the shapes of the variables before using the function decode_boxes using the input_shape (512, 1536) and are: [1, 49152, 4]. So this comes from [1, 128, 384, 4]. My idea is to divide the 128 channel values by 512 and 384 channel values by 1536. What do you think? How can I do this though and does it make any sense? @developer0hye What do you think? Sorry for tagging you here, but even I am trying to extend the functionality of this model.