wty-ustc / HairCLIP

[CVPR 2022] HairCLIP: Design Your Hair by Text and Reference Image
GNU Lesser General Public License v2.1
508 stars 66 forks source link

About training details #4

Closed janchen0611 closed 2 years ago

janchen0611 commented 2 years ago

Hi, I am trying to re-implement your paper but can not get good results on both image and text path. So I would like to verify some implementation details:

  1. Below is my implementation of Modulation Module inside Mapper(in pytorch):
    
    import torch
    import torch.nn as nn
    import torch.nn.functional as F
    from models.stylegan2.model import EqualLinear

class MapperBlock(nn.Module): def init(self, channels=512): super(MapperBlock, self).init() self.fc = EqualLinear(channels,channels) self.f_gamma = nn.Sequential( EqualLinear(channels,channels), nn.LayerNorm(channels), nn.LeakyReLU(0.2), EqualLinear(channels,channels) ) self.f_beta = nn.Sequential( EqualLinear(channels,channels), nn.LayerNorm(channels), nn.LeakyReLU(0.2), EqualLinear(channels,channels) ) self.act = nn.LeakyReLU(0.2)

def modulation(self, x, e):
    gamma = self.f_gamma(e)
    beta = self.f_beta(e)

    # norm x
    x = F.layer_norm(x, (x.shape[-1],))

    # modulation
    return (1.0 + gamma) * x + beta

def forward(self, x, e):
    x = self.fc(x)
    x = self.modulation(x, e)
    return self.act(x)

Is it correct?

2. According to your paper, the reference style/text is randomly set to image or text. My understanding is the image/text manipulation loss is only calculated when image/text reference is used, but the total loss value range is vary in different condition. Does the loss weights always keep the same in all condition or need to adjust for different condition?

3. In your paper: "we also generated several edited images using our text-guided hair editing method to augment the diversity of the
reference image set." Could you elaborate more details about your method? Or any other reference paper?

Thanks for your help.
wty-ustc commented 2 years ago
  1. The normalized_shape of the layernorm in your modulation module should be [layer_nums, 512] and elementwise_affine needs to be set to False. And note that the modulation module in the corresponding mapper will not be applied if the user does not provide a hairstyle or hair color description.
  2. The loss weights for image or text manipulation are different, and the settings of the loss weights are described in detail in the paper.
  3. We first generate some edited results using our proposed text-based hair editing method, and then use these results as part of the reference images to retrain the final network.

We will release the code after the paper is accepted, so please be patient.