Multi-gpu slower than single-gpu

Hi, I found that with same hyper-parameters but different num_core_per_host (num_core_per_host=1 for single-gpu and num_core_per_host=6 for multi-gpu), global_step/sec of multi-gpu is slightly fewer than that of single-gpu. num_core_per_host=6:

INFO:tensorflow:global_step/sec: 1.09456
INFO:tensorflow:loss = 1.490116e-08, step = 401200 (91.361 sec)

num_core_per_host=1:

INFO:tensorflow:global_step/sec: 1.21364
INFO:tensorflow:loss = 0.053051353, step = 62400 (82.396 sec)

Is this phenomenon reasonable and why?

System Information: cuda V10.0.130 cudnn 7.4.1 nccl 2.6.4 tensorflow-gpu 1.13.1 (from pip in conda virtual environment)

Best Regards

zihangdai / xlnet

Multi-gpu slower than single-gpu #269