zhanglonghao1992 / One-Shot_Free-View_Neural_Talking_Head_Synthesis

Pytorch implementation of paper "One-Shot Free-View Neural Talking-Head Synthesis for Video Conferencing"
Other
764 stars 143 forks source link

Question about Keypoint Detector #5

Closed myoons closed 2 years ago

myoons commented 2 years ago

Hi, thank you for sharing wonderful codes.

I have a question about the Keypoint Detector. Since, the paper says Since we need to extract 3D keypoints, we project the encoded features to 3D through a 1 x1 convolution, I thought there needs to be a five 1 x1 convolution projection layors to make like a U-Net structure.

So I can see your code projects the last feature of Hourglass Encoder (1024 -> 16384), but why the intermediate features are not projected? (Actually, the implementation is not like a U-Net structure)

Since I also implemented the papers, I projected all the intermediate layers and concat to the decoder. (64 -> 1024 , 128 -> 2048, 256 -> 4096, 512 -> 8129, 1024 -> 16384) Was there any concerns in this part? I'm giving this question because the only difference between my implementation and yours are this. So I'm curious whether this point could be some problem.

By the way, could you tell me how you made the dataset? Ex) Picking random 2 frames in one video

zhanglonghao1992 commented 2 years ago

@myoons I don't think there should be any problems. I made the dataset following FOMM