Closed ZJXNEFU closed 1 year ago
不太清楚,你的报错
所有的日志在这里
File "/root/miniconda3/envs/dschat/lib/python3.9/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 329, in _end_of_forward_hook
ret_val = func(*args, **kwargs)
File "/root/miniconda3/envs/dschat/lib/python3.9/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 329, in _end_of_forward_hook
self.get_param_coordinator(training=False).reset_step()
File "/root/miniconda3/envs/dschat/lib/python3.9/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 186, in reset_step
self.get_param_coordinator(training=False).reset_step()
File "/root/miniconda3/envs/dschat/lib/python3.9/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 186, in reset_step
raise RuntimeError(f"still have inflight params "
raise RuntimeError(f"still have inflight params "RuntimeError:
still have inflight params [<bound method Init._convert_to_deepspeed_param.<locals>.ds_summary of Parameter containing:
tensor([[ 0.0013, -0.0014, 0.0025, ..., 0.0007, -0.0027, -0.0055],
[ 0.0026, 0.0035, 0.0003, ..., 0.0018, 0.0021, -0.0041],
[ 0.0116, 0.0205, -0.0006, ..., -0.0009, 0.0020, 0.0120],
...,
[ 0.0058, 0.0279, -0.0001, ..., 0.0261, -0.0128, 0.0019],
[ 0.0040, 0.0012, 0.0173, ..., -0.0191, 0.0126, -0.0192],
[-0.0190, 0.0046, -0.0148, ..., -0.0181, 0.0168, 0.0344]],
device='cuda:1', dtype=torch.float16, requires_grad=True)>, <bound method Init._convert_to_deepspeed_param.<locals>.ds_summary of Parameter containing:
tensor([[-0.0084, -0.0384, -0.0023, ..., -0.0134, -0.0236, -0.0379],
[-0.0055, -0.0047, -0.0119, ..., 0.0226, 0.0031, -0.0503],
[-0.0258, 0.0072, 0.0002, ..., 0.0017, -0.0178, -0.0003],
...,
[-0.0304, 0.0079, -0.0116, ..., 0.0218, 0.0223, 0.0403],
[-0.0183, -0.0341, 0.0096, ..., 0.0475, 0.0385, 0.0127],
[-0.0378, -0.0103, 0.0185, ..., 0.0053, 0.0032, -0.0298]],
device='cuda:1', dtype=torch.float16, requires_grad=True)>]RuntimeError
: still have inflight params [<bound method Init._convert_to_deepspeed_param.<locals>.ds_summary of Parameter containing:
tensor([[ 0.0112, 0.0361, 0.0010, ..., 0.0086, 0.0147, 0.0186],
[-0.0133, -0.0179, -0.0212, ..., 0.0334, 0.0030, -0.0277],
[-0.0127, -0.0009, -0.0359, ..., 0.0013, -0.0176, 0.0163],
...,
[ 0.0058, 0.0279, -0.0001, ..., 0.0261, -0.0128, 0.0019],
[ 0.0040, 0.0012, 0.0173, ..., -0.0191, 0.0126, -0.0192],
[-0.0190, 0.0046, -0.0148, ..., -0.0181, 0.0168, 0.0344]],
device='cuda:7', dtype=torch.float16, requires_grad=True)>, <bound method Init._convert_to_deepspeed_param.<locals>.ds_summary of Parameter containing:
tensor([[-0.0084, -0.0384, -0.0023, ..., -0.0134, -0.0236, -0.0379],
[-0.0055, -0.0047, -0.0119, ..., 0.0226, 0.0031, -0.0503],
[-0.0258, 0.0072, 0.0002, ..., 0.0017, -0.0178, -0.0003],
...,
[-0.0304, 0.0079, -0.0116, ..., 0.0218, 0.0223, 0.0403],
[-0.0183, -0.0341, 0.0096, ..., 0.0475, 0.0385, 0.0127],
[-0.0378, -0.0103, 0.0185, ..., 0.0053, 0.0032, -0.0298]],
device='cuda:7', dtype=torch.float16, requires_grad=True)>]
In step3, occured the follow error