Question about Keypoint Detector

Hi, thank you for sharing wonderful codes.

I have a question about the Keypoint Detector. Since, the paper says Since we need to extract 3D keypoints, we project the encoded features to 3D through a 1 x1 convolution, I thought there needs to be a five 1 x1 convolution projection layors to make like a U-Net structure.

So I can see your code projects the last feature of Hourglass Encoder (1024 -> 16384), but why the intermediate features are not projected? (Actually, the implementation is not like a U-Net structure)

Since I also implemented the papers, I projected all the intermediate layers and concat to the decoder. (64 -> 1024 , 128 -> 2048, 256 -> 4096, 512 -> 8129, 1024 -> 16384) Was there any concerns in this part? I'm giving this question because the only difference between my implementation and yours are this. So I'm curious whether this point could be some problem.

By the way, could you tell me how you made the dataset? Ex) Picking random 2 frames in one video

zhanglonghao1992 / One-Shot_Free-View_Neural_Talking_Head_Synthesis

Question about Keypoint Detector #5