zc-alexfan / digit-interacting

[3DV 2021 Oral] Estimating two strongly interacting hand poses using Probabilistic Per-pixel Part Segmentation.
Other
52 stars 6 forks source link

segmentation probability distribution #3

Closed ZhengdiYu closed 2 years ago

ZhengdiYu commented 2 years ago

image image

Hi,

Q1. I was trying to understand this part and Fig 6c. of your paper. What is segmentation probability distribution? Do you mean (B, class_num, H, W) before doing argmax into (B, 1, H, W)

You mentioned you trained two different pipelines using Fig6c. But I couldn't find a correspondence in Fig6c. Do you mean that you replace Segm.(S) with (B, class_num, H, W)and (B, 1, H, W) separately?

Q2. What is the corresponding class of the part-segmentation label ? is it the same as original MANO's joints without tips or InterHands2.6M's order? (i.e. wrist is 1 or 16?)

Q3. https://github.com/zc-alexfan/digit-interacting/blob/36e5110ac91b0e6ac2151d43d5b1cc5712037080/src/dataset/dataset_utils.py#L133-L149 What is img_segm_swapped[1], img_segm_swapped[2] = img_segm_swapped[2].clone(), img_segm_swapped[1].clone() used for? I think this operation is conducted on the channel dimension, I'm not sure what is this for. The 3 channels should be the same.

Q4. Why don't you ignore background label when training the segmentation?

zc-alexfan commented 2 years ago

Q1. Yes, it is basically the logits that preserve lots of information for downstream.

F6c means you don't use any image features for pose estimation. Compared to DIGIT (see Fig. 2), DIGIT uses the image features and segmentation features; Fig6c only uses segmentation features.

Q2. The segmentation labels are on the MANO faces, so it is not related to the joints. DIGIT predicts the 21 joints from InterHand, not from MANO skeleton.

Q3. When you flip an image in data augmentation, the left hand becomes the right hand for example, then you need to flip the segmenation classes between hands as well. Yes the 3 channels are the same. I use 3 channels because I use PNG file format to reduce file size as they have lossless compression.

Q4. I need to label a background pixel to some class, so it is easier to just have a class for the background. Further, you want a model to tell if a pixel belongs to a background or a hand.

ZhengdiYu commented 2 years ago

Thanks for your reply!

Regarding Q3. Yes I understand that I need to flip the labels. But I just don't know why is img_segm_swapped[1], img_segm_swapped[2] = img_segm_swapped[2].clone(), img_segm_swapped[1].clone() needed. Because this line of code is swapping the channel order inside the first dimension (e.g. (3, 256, 256)), which doesn't make sense, because the each channel in the first channel is the same.

Since the 3 channels are the same, What is the purpose of swapping two identical channel? Sorry if I misunderstood.

zc-alexfan commented 2 years ago

Ah. I see what you mean now.

I would say, try to have an assertion like:

assert img_segm_swapped[1].sum() == img_segm_swapped.[2].sum()

If they are identical, you can ignore this line. I think the reason I have that line is from a previous version of my code that needs flipping. However, it is not important if this operation does nothing here. Sorry for the inconvenience.