yangzhipeng1108 / DeepSpeed-Chat-ChatGLM

43 stars 7 forks source link

RuntimeError: still have inflight params [<bound method Init._convert_to_deepspeed_param.<locals>.ds_summary of Parameter #7

Closed ZJXNEFU closed 1 year ago

ZJXNEFU commented 1 year ago

In step3, occured the follow error

  File "/root/miniconda3/envs/dschat/lib/python3.9/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 186, in reset_step
    self.get_param_coordinator(training=False).reset_step()
  File "/root/miniconda3/envs/dschat/lib/python3.9/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 186, in reset_step
    raise RuntimeError(f"still have inflight params "RuntimeError:
still have inflight params [<bound method Init._convert_to_deepspeed_param.<locals>.ds_summary of Parameter containing:
yangzhipeng1108 commented 1 year ago

不太清楚,你的报错

ZJXNEFU commented 1 year ago

所有的日志在这里

  File "/root/miniconda3/envs/dschat/lib/python3.9/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 329, in _end_of_forward_hook
ret_val = func(*args, **kwargs)
  File "/root/miniconda3/envs/dschat/lib/python3.9/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 329, in _end_of_forward_hook
    self.get_param_coordinator(training=False).reset_step()
  File "/root/miniconda3/envs/dschat/lib/python3.9/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 186, in reset_step
    self.get_param_coordinator(training=False).reset_step()
  File "/root/miniconda3/envs/dschat/lib/python3.9/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 186, in reset_step
    raise RuntimeError(f"still have inflight params "
    raise RuntimeError(f"still have inflight params "RuntimeError:
still have inflight params [<bound method Init._convert_to_deepspeed_param.<locals>.ds_summary of Parameter containing:
tensor([[ 0.0013, -0.0014,  0.0025,  ...,  0.0007, -0.0027, -0.0055],
        [ 0.0026,  0.0035,  0.0003,  ...,  0.0018,  0.0021, -0.0041],
        [ 0.0116,  0.0205, -0.0006,  ..., -0.0009,  0.0020,  0.0120],
        ...,
        [ 0.0058,  0.0279, -0.0001,  ...,  0.0261, -0.0128,  0.0019],
        [ 0.0040,  0.0012,  0.0173,  ..., -0.0191,  0.0126, -0.0192],
        [-0.0190,  0.0046, -0.0148,  ..., -0.0181,  0.0168,  0.0344]],
       device='cuda:1', dtype=torch.float16, requires_grad=True)>, <bound method Init._convert_to_deepspeed_param.<locals>.ds_summary of Parameter containing:
tensor([[-0.0084, -0.0384, -0.0023,  ..., -0.0134, -0.0236, -0.0379],
        [-0.0055, -0.0047, -0.0119,  ...,  0.0226,  0.0031, -0.0503],
        [-0.0258,  0.0072,  0.0002,  ...,  0.0017, -0.0178, -0.0003],
        ...,
        [-0.0304,  0.0079, -0.0116,  ...,  0.0218,  0.0223,  0.0403],
        [-0.0183, -0.0341,  0.0096,  ...,  0.0475,  0.0385,  0.0127],
        [-0.0378, -0.0103,  0.0185,  ...,  0.0053,  0.0032, -0.0298]],
       device='cuda:1', dtype=torch.float16, requires_grad=True)>]RuntimeError
: still have inflight params [<bound method Init._convert_to_deepspeed_param.<locals>.ds_summary of Parameter containing:
tensor([[ 0.0112,  0.0361,  0.0010,  ...,  0.0086,  0.0147,  0.0186],
        [-0.0133, -0.0179, -0.0212,  ...,  0.0334,  0.0030, -0.0277],
        [-0.0127, -0.0009, -0.0359,  ...,  0.0013, -0.0176,  0.0163],
        ...,
        [ 0.0058,  0.0279, -0.0001,  ...,  0.0261, -0.0128,  0.0019],
        [ 0.0040,  0.0012,  0.0173,  ..., -0.0191,  0.0126, -0.0192],
        [-0.0190,  0.0046, -0.0148,  ..., -0.0181,  0.0168,  0.0344]],
       device='cuda:7', dtype=torch.float16, requires_grad=True)>, <bound method Init._convert_to_deepspeed_param.<locals>.ds_summary of Parameter containing:
tensor([[-0.0084, -0.0384, -0.0023,  ..., -0.0134, -0.0236, -0.0379],
        [-0.0055, -0.0047, -0.0119,  ...,  0.0226,  0.0031, -0.0503],
        [-0.0258,  0.0072,  0.0002,  ...,  0.0017, -0.0178, -0.0003],
        ...,
        [-0.0304,  0.0079, -0.0116,  ...,  0.0218,  0.0223,  0.0403],
        [-0.0183, -0.0341,  0.0096,  ...,  0.0475,  0.0385,  0.0127],
        [-0.0378, -0.0103,  0.0185,  ...,  0.0053,  0.0032, -0.0298]],
       device='cuda:7', dtype=torch.float16, requires_grad=True)>]
yangzhipeng1108 commented 1 year ago

请参考这里,https://github.com/microsoft/DeepSpeed/issues/3156