Things slow down when I use DDP

yzhangcs / parser

:rocket: State-of-the-art parsers for natural language.

https://parser.yzhang.site/

MIT License

827 stars 139 forks source link

Things slow down when I use DDP #55

Closed zeeshansayyed closed 3 years ago

zeeshansayyed commented 3 years ago

Hello, thanks for sharing this project. I was trying to train the parser using DDP as is shown in the ReadMe. Unfortunately, this is twice as slow compared to using a single GPU. On a single GPU, it takes about 25 seconds whereas when I use two GPUs (installed on the same machine), it just seems to pause at the start of a new epoch and hence takes about 58 seconds.

I understand that this could possible not be an issue with your code, but was wondering if you could provide any pointers. I am using exactly the same style of command that is given in the ReadMe.

yzhangcs commented 3 years ago

@zeeshansayyed I think the acceleration brought by DDP is very limited since the batch size of the training set is 1/n (num of devices you used, so this allows you to train with large batch on small GPUs like 1080ti) of that on single device. But practically, it can slightly speed up the training. I'm not sure why DDP slowed down your training, could you give me more details?

zeeshansayyed commented 3 years ago

Thanks @yzhangcs To be honest, I don't know what the issue was. A reboot fixed it. What was happening was that the GPU usage of all GPUs would max out at 100% and it would get stuck there. I think the issue is not with your code but either some kind of race condition someplace else. I will close this for now and will open it again if I am able to reproduce the issue.