bug: dim[1] unmatch on DataParallel

background

--feat char with two gpu

error

  File "/data/user/jupyter-ws/sematic/crfpar/parser/model.py", line 98, in forward
    word_embed, feat_embed = self.embed_dropout(word_embed, feat_embed)
  File "/home/user/anaconda3/envs/crfpar/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/data/user/jupyter-ws/sematic/crfpar/parser/modules/dropout.py", line 56, in forward
    total = sum(masks)
RuntimeError: The size of tensor a (126) must match the size of tensor b (125) at non-singleton dimension 1

analyse

https://github.com/yzhangcs/crfpar/blob/8abb95f177e5cbf4d7ebc494bcaf9ca15af3e3da/parser/cmds/train.py#L73

see DataParallel — PyTorch 1.7.0 documentation

https://github.com/yzhangcs/crfpar/blob/8abb95f177e5cbf4d7ebc494bcaf9ca15af3e3da/parser/model.py#L77

DataParallel scatter inputs to different gpu which means words's max length may not match the words.shape[1]. Thus, calculating lens from mask will lead to unmatch max sequence length.

https://github.com/yzhangcs/crfpar/blob/8abb95f177e5cbf4d7ebc494bcaf9ca15af3e3da/parser/model.py#L88

rebuild feat_embed from lens could lead to unmatch dim1, namely the max sequence length dim of feat_embed, resulting in the above error.

possible solution

use only one gpu
~pass lens as model input instead of calculating from words.~ this wont work, use https://github.com/yzhangcs/crfpar/issues/6#issuecomment-723940212 instead

yzhangcs / crfpar

bug: dim[1] unmatch on DataParallel #6

background

error

analyse

possible solution