yzhangcs / crfpar

[ACL'20, IJCAI'20] Code for "Efficient Second-Order TreeCRF for Neural Dependency Parsing" and "Fast and Accurate Neural CRF Constituency Parsing".
https://www.aclweb.org/anthology/2020.acl-main.302
MIT License
76 stars 7 forks source link

bug: dim[1] unmatch on DataParallel #6

Closed xsthunder closed 3 years ago

xsthunder commented 3 years ago

background

--feat char with two gpu

error

  File "/data/user/jupyter-ws/sematic/crfpar/parser/model.py", line 98, in forward
    word_embed, feat_embed = self.embed_dropout(word_embed, feat_embed)
  File "/home/user/anaconda3/envs/crfpar/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/data/user/jupyter-ws/sematic/crfpar/parser/modules/dropout.py", line 56, in forward
    total = sum(masks)
RuntimeError: The size of tensor a (126) must match the size of tensor b (125) at non-singleton dimension 1

analyse

https://github.com/yzhangcs/crfpar/blob/8abb95f177e5cbf4d7ebc494bcaf9ca15af3e3da/parser/cmds/train.py#L73

see DataParallel — PyTorch 1.7.0 documentation

https://github.com/yzhangcs/crfpar/blob/8abb95f177e5cbf4d7ebc494bcaf9ca15af3e3da/parser/model.py#L77

DataParallel scatter inputs to different gpu which means words's max length may not match the words.shape[1]. Thus, calculating lens from mask will lead to unmatch max sequence length.

https://github.com/yzhangcs/crfpar/blob/8abb95f177e5cbf4d7ebc494bcaf9ca15af3e3da/parser/model.py#L88

rebuild feat_embed from lens could lead to unmatch dim1, namely the max sequence length dim of feat_embed, resulting in the above error.

possible solution

  1. use only one gpu

  2. ~pass lens as model input instead of calculating from words.~ this wont work, use https://github.com/yzhangcs/crfpar/issues/6#issuecomment-723940212 instead

yzhangcs commented 3 years ago

Thank you for reporting the bug. I will fix this later. You can use pad instead, which provide an arg of total_length.