Closed speedcell4 closed 2 years ago
Hi @speedcell4
My implementation references this repo. In practice, I found the default initialization strategy or orthogonal initialization also work very well.
So there is no significant performance difference between these three initialization strategies, i.e., zeros, uniform, and orthogonal. Honestly, using all zeros is quite strange to me, doesn't it map every input vector to a zero vector, and accumulate zero gradients?
@speedcell4 Sorry, It's been a long time and I don't remember the exact details. But empirically I found zero init has always performed well.
Okay, I got it. Thanks for you replies~
@speedcell4
doesn't it map every input vector to a zero vector, and accumulate zero gradients?
I don't think this could lead to zero gradients as gold dependencies can back propagate 1 gradients to Biaffine layers and help get rid of zero init quickly. However, it does have some potential problems, and in some cases zero init is difficult to train. In practice, I found normal init performs much better than zero init on Chinese Constituency Parsing.
I understand it now, thank you for your explanation~
Hi~
Why do you initialize the
weight
to be all zeros?https://github.com/yzhangcs/parser/blob/16ad39534957bc4ee7af6ca8874de79332e8e8a2/supar/modules/affine.py#L54-L55
As I remember, PyTorch initializes the weight in a different way. Could you please explain your different choice?