Question about Performance: gluon vs from scratch

sebastian-lapuschkin commented 7 years ago

Running the tutorials in different device context, I noticed that on my workstation (20 core Intel Xeon@2.8Ghz), when using the cpu, that the code written from scratch tends to only use one cpu core at a time while the rest idle. When using gluon, the same context declaration will use all of the CPU cores fully.

However, when using the same model (I adapted the number of hidden neurons in tutorials P03-C01-scratch and P03-C02-gluon to 256 and 128, used ReLU activations for both scripts) and using the same evaluation function (the one for the "from scratch" tutorial) and measuring the time required to run training on an epoch of data (after resetting the iterator), I found that the gluon-based code takes slightly more than double the time per epoch, even though all CPU cores are in use.

I have obtained similar results for the tutorials P04- , dealing with CNN architectures: The gluon.nn. - based model requires 4.8 times the time per epoch, while the model inheriting gluon.Block "only" recquires roughly double the time per epoch, in comparison to the from scratch built model.

Is this a hardware related issue (e.g. an i7 or a AMD processor would benefit from gluon and experience no penalties), are the models too shallow for efficient parallelization or is mxnet optimized poorly regarding the use of multiple CPU cores? Why is the use of gluon, despite using all CPU cores instead of just one, so much slower?

Execution of all notebooks using a GPU-context results in a slight run time advantage when using gluon. inheriting a model from gluon.Block instead of building the same model with gluon.nn. is about 20% faster in training for the P04- tutorials.

zackchase commented 7 years ago

Thanks Sebastian for the feedback! We're looking into this issue and will get back to you. Looping in @piiswrong @mli @smolix

piiswrong commented 7 years ago

The tutorial networks are so small that most of the time cost is from python overhead. Gluon is slower because it has more abstraction.

Depending on build flags, some operators use single core while some others use multiple cores through openmp

smolix commented 7 years ago

@sebastian-lapuschkin - Indeed, did you run this on a Mac, compiled with clang as opposed to gcc 6 or higher (with openmp activated)?

sebastian-lapuschkin commented 7 years ago

I did not build mxnet myself, but rather pip-installed it via

pip install --user --pre mxnet-cu80

Is there some leeway for improvement when installing via pip?

The machine I have tested on is running Ubuntu 16.04

mli commented 7 years ago

@sebastian-lapuschkin it might due to most metric in mxnet are implemented through numpy. while numpy maybe not multithread efficient.

zackchase commented 7 years ago

@mli are the metrics converted to mx.ndarray yet? When should we close this issue?

zackchase commented 6 years ago

Closing. All should be copacetic on this by now...

zackchase / mxnet-the-straight-dope

Question about Performance: gluon vs from scratch #19