Open xuanxh1 opened 1 year ago
in attention paper, W = QK^T, right? However, in this implementation. W = Q^TW. is there something wrong?
the image above is the original code in this implementation. and the second one is the correct way I thought.
in attention paper, W = QK^T, right? However, in this implementation. W = Q^TW. is there something wrong?
the image above is the original code in this implementation. and the second one is the correct way I thought.