yenchenlin / pix2pix-tensorflow

TensorFlow implementation of "Image-to-Image Translation Using Conditional Adversarial Networks".
MIT License
939 stars 300 forks source link

Train time? #4

Open eturner303 opened 7 years ago

eturner303 commented 7 years ago

Curious what sort of train times you're seeing with this implementation.

I'm using a GRID K520 GPU (Amazon g2.2xlarge) -- i'm seeing each Epoch take around 1200 seconds, which seems wrong.

From the original paper:

"Data requirements and speed We note that decent results can often be obtained even on small datasets. Our facade training set consists of just 400 images (see results in Figure 12), and the day to night training set consists of only 91 unique webcams (see results in Figure 13). On datasets of this size, training can be very fast: for example, the results shown in Figure 12 took less than two hours of training on a single Pascal Titan X GPU."

Granted I'm not using a Pascal GPU -- which as 2496 CUDA cores, but the g2.2xlarge has around 1500 CUDA cores. At the current rate 200 epochs would take 3 days, as opposed to the 2 hours quoted in the original paper.

Are you seeing similar train times when running this code? Wondering why there is such a discrepancy compared to the original paper/Torch implementation

yenchenlin commented 7 years ago

I am investigating this issue. It took me around 10 hours to run 200 epochs on a Pascal GPU.

There are mainly three reasons in my opinion:

  1. In this implementation (inherited from DCGAN-tensorflow), generator needs to update twice in each iteration, which slows down the training process a lot.
  2. Since the project is inherited from DCGAN-tensorflow, it uses fully connected layer in the discriminator.
  3. The data preprocessing step is currently performed on the fly during training, which may can be enhanced.
eyaler commented 7 years ago

training facades took 10 hours on GTX1080 ~ 180 sec per epoch

kaihuchen commented 7 years ago

My test with GRID K520 GPU (Amazon g2.2xlarge) using my own dataset shows that pix2pix/Torch runs about 30 times faster than pix2pix/Tensorflow version. Monitoring using 'watch nvidia-smi' shows that the Tensorflow version is not using the GPU at all.

eyaler commented 7 years ago

@kaihuchen sorry for the obvious question, but did you install "tensorflow-gpu"?

yenchenlin commented 7 years ago

@kaihuchen I'm sure that I'm training this code with GPU. Can you tell me how you installed tensorflow? sidenote: 看來您是畢業自台灣清華大學的學長 😄

@eyaler I've updated the codebase alot recently (which gain speed comparable to torch version, will upload later)

kaihuchen commented 7 years ago

@yenchenlin My bad! I have many servers and it would seem that I did the test on a server with the CPU version of the tensorflow, and not the GPU one.

ppwwyyxx commented 7 years ago

@eyaler I also had a tensorflow implementation here. It takes me 43 seconds every epoch (400 iterations of batch=1 on facades dataset) on GTX1080, while the torch version takes 42 seconds.

yenchenlin commented 7 years ago

Thanks @ppwwyyxx for the info!

@eyaler I think currently the code mentioned above works better! However, I'll still update code here in these 3 days.

Skylion007 commented 7 years ago

@yenchenlin Any update on this? I do not see any recent commits pertaining to speed. Otherwise, I am tempted be forced to use the code provided by @ppwwyyxx. I have tested the Tensorpack implementation and it 4-5X faster and uses approximately 1/3 the memory of this implementation.

Neltherion commented 7 years ago

The code looks clean and straight forward... I really can't get my head around the reason why it's slow... It's pretty much a standard GAN so why is it so slow?! answer to this question has become one of the reasons I check this thread every now and then...

Skylion007 commented 7 years ago

I have one idea.

Feed_dicts are incredibly awefully slows. We should do what Tensorpack does and load say 50 images at a time, keep them in a queue of numpy arrays and then feed them in with a queue runner. This alone might be responsible for the speed difference since it doubles the number of copies needed and causes a lot of expensive switching between Python and Tensorflow C code.

Reference to issue from Tensorflow: https://github.com/tensorflow/tensorflow/issues/2919

Neltherion commented 7 years ago

@Skylion007 hmmm... How about the fact that this network is using a fully connected layer in the discriminator... last I checked Tensorpack uses a 1x1 Convolution in the last layer (instead of a fully connected layer)... couldn't it be because of this?

Skylion007 commented 7 years ago

That's another issue, there was a pull request to address this, but it was rejected because it made the edges sightly more blurry. I'm open to try to that and see if it improves the speed. You want to try experimenting with that pull request and see if it yields any results? My GPU is currently in use by another experiment.

Neltherion commented 7 years ago

My GPU is currently in use by another experiment.

That's exactly my case too! I've been running one for 3 days and last night it started showing acceptable improvements, I really don't want to stop it for at least 3 more days...

Skylion007 commented 7 years ago

The graphs for each network look very different as well. @ppwwyyxx implementation's graph looks like this for instance while the network in this repo seems to have alot of dependencies so much so that the graph looks more like a straight line than tree. A very different appearance from the one below: image

Not entirely sure how much of that is due to good Tensorboard formatting and how much of that is a fundamental differences in the architecture between the networks.

ppwwyyxx commented 7 years ago

@Skylion007 Tensorboard tends to organize ops under the same name scope together, so what you see in the above figure isn't the real architecture but more about summaries and utilities. You can open the "gen" and "discrim" block in the above figure, and they will contain the model architecture for generator and discriminator.

Skylion007 commented 7 years ago

Yeah, I see that now. I am just so confused why the other code is so much faster. I just discovered Tensorboard so I was trying to see what I could gain from it. I will say that the GPU memory use is much higher in this implementation. I am really curious why that would be the case. That could explain why it's slower maybe. Any ideas @ppwwyyxx ? Any special tricks your code is doing?

Neltherion commented 7 years ago

@Skylion007 it's probably the Fully Connected Layer... Those things takes a lot of memory...

eyaler commented 7 years ago
  1. changing the last layer from fully connected to a convolution as in the original pix2pix implementation did not give me any speedup
  2. i think we should not run the G optim twice. it is against common wisdom to try to balance D and G by hand, and even some suggest do train D twice and G once.
  3. preproc alone can take up to ~50% of epoch time (in a specific case i had) - should be done only once before train.
  4. i tried holding all facade train images in memory (instead of loading preprocessed versions from ssd disk) - this did not help (this way is not scalable but could be done in chunks)
  5. not evaluating losses after each batch - i assume there is a better way to get this from the train run()?

with (2) and (3) i could bring the epoch time down from 180s to 110s (facades on GTX1080). also doing (5) brought it down to 85s. still a factor of 2 too slow.

yenchenlin commented 7 years ago

Thanks @eyaler , it's 2. and 3 IMO, and 2 is a crucial point.

I'm really sorry that I'm dealing with some other annoying stuffs recently 😭

Neltherion commented 7 years ago

@eyaler This was an eye opener... I had so many misconceptions about performance in this project! Thanks for the time... please keep going on!

Neltherion commented 7 years ago

can anyone tell me why we do this:

        self.fake_AB = tf.concat(3, [self.real_A, self.fake_B])
        self.D_, self.D_logits_ = self.discriminator(self.fake_AB, reuse=True)

why do we concat real_A and fake_B and give them BOTH to the discriminator while what we want is to give it just one image (the generated fake: self.fake_B) ?

doesn't this force the discriminator to accepts dual images (one half the real image and the other half the generated one) and double the time needed to process them?

yenchenlin commented 7 years ago

Hello @Neltherion, please see the image from paper:

screen shot 2017-01-21 at 8 56 56 pm

Neltherion commented 7 years ago

Hmm... You're right... and just giving the fake images to the Discriminator is probably not enough... my bad! thanks for the quick reply...

yenchenlin commented 7 years ago

Normally, conditional GAN will send the conditional data (e.g., class, attribute, text, image) together with the synthesized image to the discriminator. See this paper for a more complicated discriminator.

eyaler commented 7 years ago

some benchmarks for the community:

image_iterations/sec: 5.2 phillipi K80/torch/cuda8 1.1 yenchenlin K80/tf0.12.1/cuda7.5 1.2 yenchenlin K80/tf0.12.1/cuda8 1.2 yenchenlin K80/tf1.0/cuda8 2.2 yenchenlin 1080/tf0.12.0/cuda8 2.3 yenchenlin_mod K80/tf0.12.1/cuda7.5 2.5 yenchenlin_mod K80/tf0.12.1/cuda8 2.5 yenchenlin_mod K80/tf1.0/cuda8 4.7 yenchenlin_mod 1080/tf0.12.0/cuda8 4.7 affinelayer K80/tf1.0/cuda8 5.5 tensorpack K80/tf1.0/cuda8

so seems that tensorpack is the fastest, and that 1080 is twice as fast as K80

all experiments are on the facades dataset and use cudnn 5.1

phillipi = https://github.com/phillipi/pix2pix yenchenlin = https://github.com/yenchenlin/pix2pix-tensorflow yenchenlin_mod = https://github.com/yenchenlin/pix2pix-tensorflow/issues/4#issuecomment-273221087 tensorpack = https://github.com/ppwwyyxx/tensorpack affinelayer = https://github.com/affinelayer/pix2pix-tensorflow