About the attention implementation

agave233 commented 4 years ago

Hi, Thanks for your excellent work. I found the implementation of attention in your code is a little different from equation 8. In the paper, Z_T, Z_C, Z_F use three different transformation weight matrices. But it seems that Z_T, Z_C, Z_F (e.g., emb1, emb2, Xcom) share the same transformation weight matrix in the code. I am confused about this.

class Attention(nn.Module):
    def __init__(self, in_size, hidden_size=16):
        super(Attention, self).__init__()

        self.project = nn.Sequential(
            nn.Linear(in_size, hidden_size),
            nn.Tanh(),
            nn.Linear(hidden_size, 1, bias=False)
        )

    def forward(self, z):
        w = self.project(z)
        beta = torch.softmax(w, dim=1)
        return (beta * z).sum(1), beta

##attention
emb = torch.stack([emb1, emb2, Xcom], dim=1)
emb, att = self.attention(emb)

zhumeiqiBUPT commented 4 years ago

Well, thanks for your attention on our work, and you're right. Actually, W_T, W_C, W_F can be three different weight matrices or three shared weight matrices. And we choose the latter one in our implementation. Maybe you've read an old version of our paper, and we have already changed it in the new version attached in our github.Here

agave233 commented 4 years ago

Well, thanks for your attention on our work, and you're right. Actually, W_T, W_C, W_F can be three different weight matrices or three shared weight matrices. And we choose the latter one in our implementation. Maybe you've read an old version of our paper, and we have already changed it in the new version attached in our github.Here

Oh, I see. Thanks! By the way, is the latter(using shared weight matrices) performs better?

zhumeiqiBUPT commented 4 years ago

I'm sorry that we didn't specifically compare the performance for these two mechanisms, so I'm not sure about that. Maybe you could have a try when you design your attention mechanism. :)

zhumeiqiBUPT / AM-GCN

About the attention implementation #5