A question about the initialization of biaffine module

yzhangcs / parser

:rocket: State-of-the-art parsers for natural language.

https://parser.yzhang.site/

MIT License

832 stars 141 forks source link

A question about the initialization of biaffine module #100

Closed speedcell4 closed 2 years ago

speedcell4 commented 2 years ago

Hi~

Why do you initialize the weight to be all zeros?

https://github.com/yzhangcs/parser/blob/16ad39534957bc4ee7af6ca8874de79332e8e8a2/supar/modules/affine.py#L54-L55

As I remember, PyTorch initializes the weight in a different way. Could you please explain your different choice?

yzhangcs commented 2 years ago

Hi @speedcell4

My implementation references this repo. In practice, I found the default initialization strategy or orthogonal initialization also work very well.

speedcell4 commented 2 years ago

So there is no significant performance difference between these three initialization strategies, i.e., zeros, uniform, and orthogonal. Honestly, using all zeros is quite strange to me, doesn't it map every input vector to a zero vector, and accumulate zero gradients?

yzhangcs commented 2 years ago

@speedcell4 Sorry, It's been a long time and I don't remember the exact details. But empirically I found zero init has always performed well.

speedcell4 commented 2 years ago

Okay, I got it. Thanks for you replies~

yzhangcs commented 2 years ago

@speedcell4

doesn't it map every input vector to a zero vector, and accumulate zero gradients?

I don't think this could lead to zero gradients as gold dependencies can back propagate 1 gradients to Biaffine layers and help get rid of zero init quickly. However, it does have some potential problems, and in some cases zero init is difficult to train. In practice, I found normal init performs much better than zero init on Chinese Constituency Parsing.

speedcell4 commented 2 years ago

I understand it now, thank you for your explanation~