Open eturner303 opened 7 years ago
I am investigating this issue. It took me around 10 hours to run 200 epochs on a Pascal GPU.
There are mainly three reasons in my opinion:
training facades took 10 hours on GTX1080 ~ 180 sec per epoch
My test with GRID K520 GPU (Amazon g2.2xlarge) using my own dataset shows that pix2pix/Torch runs about 30 times faster than pix2pix/Tensorflow version. Monitoring using 'watch nvidia-smi' shows that the Tensorflow version is not using the GPU at all.
@kaihuchen sorry for the obvious question, but did you install "tensorflow-gpu"?
@kaihuchen I'm sure that I'm training this code with GPU. Can you tell me how you installed tensorflow? sidenote: 看來您是畢業自台灣清華大學的學長 😄
@eyaler I've updated the codebase alot recently (which gain speed comparable to torch version, will upload later)
@yenchenlin My bad! I have many servers and it would seem that I did the test on a server with the CPU version of the tensorflow, and not the GPU one.
@eyaler I also had a tensorflow implementation here. It takes me 43 seconds every epoch (400 iterations of batch=1 on facades dataset) on GTX1080, while the torch version takes 42 seconds.
Thanks @ppwwyyxx for the info!
@eyaler I think currently the code mentioned above works better! However, I'll still update code here in these 3 days.
@yenchenlin Any update on this? I do not see any recent commits pertaining to speed. Otherwise, I am tempted be forced to use the code provided by @ppwwyyxx. I have tested the Tensorpack implementation and it 4-5X faster and uses approximately 1/3 the memory of this implementation.
The code looks clean and straight forward... I really can't get my head around the reason why it's slow... It's pretty much a standard GAN so why is it so slow?! answer to this question has become one of the reasons I check this thread every now and then...
I have one idea.
Feed_dicts are incredibly awefully slows. We should do what Tensorpack does and load say 50 images at a time, keep them in a queue of numpy arrays and then feed them in with a queue runner. This alone might be responsible for the speed difference since it doubles the number of copies needed and causes a lot of expensive switching between Python and Tensorflow C code.
Reference to issue from Tensorflow: https://github.com/tensorflow/tensorflow/issues/2919
@Skylion007 hmmm... How about the fact that this network is using a fully connected layer in the discriminator... last I checked Tensorpack uses a 1x1 Convolution in the last layer (instead of a fully connected layer)... couldn't it be because of this?
That's another issue, there was a pull request to address this, but it was rejected because it made the edges sightly more blurry. I'm open to try to that and see if it improves the speed. You want to try experimenting with that pull request and see if it yields any results? My GPU is currently in use by another experiment.
My GPU is currently in use by another experiment.
That's exactly my case too! I've been running one for 3 days and last night it started showing acceptable improvements, I really don't want to stop it for at least 3 more days...
The graphs for each network look very different as well. @ppwwyyxx implementation's graph looks like this for instance while the network in this repo seems to have alot of dependencies so much so that the graph looks more like a straight line than tree. A very different appearance from the one below:
Not entirely sure how much of that is due to good Tensorboard formatting and how much of that is a fundamental differences in the architecture between the networks.
@Skylion007 Tensorboard tends to organize ops under the same name scope together, so what you see in the above figure isn't the real architecture but more about summaries and utilities. You can open the "gen" and "discrim" block in the above figure, and they will contain the model architecture for generator and discriminator.
Yeah, I see that now. I am just so confused why the other code is so much faster. I just discovered Tensorboard so I was trying to see what I could gain from it. I will say that the GPU memory use is much higher in this implementation. I am really curious why that would be the case. That could explain why it's slower maybe. Any ideas @ppwwyyxx ? Any special tricks your code is doing?
@Skylion007 it's probably the Fully Connected Layer... Those things takes a lot of memory...
with (2) and (3) i could bring the epoch time down from 180s to 110s (facades on GTX1080). also doing (5) brought it down to 85s. still a factor of 2 too slow.
Thanks @eyaler , it's 2. and 3 IMO, and 2 is a crucial point.
I'm really sorry that I'm dealing with some other annoying stuffs recently 😭
@eyaler This was an eye opener... I had so many misconceptions about performance in this project! Thanks for the time... please keep going on!
can anyone tell me why we do this:
self.fake_AB = tf.concat(3, [self.real_A, self.fake_B])
self.D_, self.D_logits_ = self.discriminator(self.fake_AB, reuse=True)
why do we concat real_A
and fake_B
and give them BOTH to the discriminator while what we want is to give it just one image (the generated fake: self.fake_B
) ?
doesn't this force the discriminator to accepts dual images (one half the real image and the other half the generated one) and double the time needed to process them?
Hello @Neltherion, please see the image from paper:
Hmm... You're right... and just giving the fake images to the Discriminator is probably not enough... my bad! thanks for the quick reply...
Normally, conditional GAN will send the conditional data (e.g., class, attribute, text, image) together with the synthesized image to the discriminator. See this paper for a more complicated discriminator.
some benchmarks for the community:
image_iterations/sec: 5.2 phillipi K80/torch/cuda8 1.1 yenchenlin K80/tf0.12.1/cuda7.5 1.2 yenchenlin K80/tf0.12.1/cuda8 1.2 yenchenlin K80/tf1.0/cuda8 2.2 yenchenlin 1080/tf0.12.0/cuda8 2.3 yenchenlin_mod K80/tf0.12.1/cuda7.5 2.5 yenchenlin_mod K80/tf0.12.1/cuda8 2.5 yenchenlin_mod K80/tf1.0/cuda8 4.7 yenchenlin_mod 1080/tf0.12.0/cuda8 4.7 affinelayer K80/tf1.0/cuda8 5.5 tensorpack K80/tf1.0/cuda8
so seems that tensorpack is the fastest, and that 1080 is twice as fast as K80
all experiments are on the facades dataset and use cudnn 5.1
phillipi = https://github.com/phillipi/pix2pix yenchenlin = https://github.com/yenchenlin/pix2pix-tensorflow yenchenlin_mod = https://github.com/yenchenlin/pix2pix-tensorflow/issues/4#issuecomment-273221087 tensorpack = https://github.com/ppwwyyxx/tensorpack affinelayer = https://github.com/affinelayer/pix2pix-tensorflow
Curious what sort of train times you're seeing with this implementation.
I'm using a GRID K520 GPU (Amazon g2.2xlarge) -- i'm seeing each Epoch take around 1200 seconds, which seems wrong.
From the original paper:
"Data requirements and speed We note that decent results can often be obtained even on small datasets. Our facade training set consists of just 400 images (see results in Figure 12), and the day to night training set consists of only 91 unique webcams (see results in Figure 13). On datasets of this size, training can be very fast: for example, the results shown in Figure 12 took less than two hours of training on a single Pascal Titan X GPU."
Granted I'm not using a Pascal GPU -- which as 2496 CUDA cores, but the g2.2xlarge has around 1500 CUDA cores. At the current rate 200 epochs would take 3 days, as opposed to the 2 hours quoted in the original paper.
Are you seeing similar train times when running this code? Wondering why there is such a discrepancy compared to the original paper/Torch implementation