Closed sebastian-lapuschkin closed 6 years ago
Thanks Sebastian for the feedback! We're looking into this issue and will get back to you. Looping in @piiswrong @mli @smolix
The tutorial networks are so small that most of the time cost is from python overhead. Gluon is slower because it has more abstraction.
Depending on build flags, some operators use single core while some others use multiple cores through openmp
@sebastian-lapuschkin - Indeed, did you run this on a Mac, compiled with clang as opposed to gcc 6 or higher (with openmp activated)?
I did not build mxnet myself, but rather pip-installed it via
pip install --user --pre mxnet-cu80
Is there some leeway for improvement when installing via pip?
The machine I have tested on is running Ubuntu 16.04
@sebastian-lapuschkin it might due to most metric
in mxnet are implemented through numpy. while numpy maybe not multithread efficient.
@mli are the metrics converted to mx.ndarray yet? When should we close this issue?
Closing. All should be copacetic on this by now...
Running the tutorials in different device context, I noticed that on my workstation (20 core Intel Xeon@2.8Ghz), when using the cpu, that the code written from scratch tends to only use one cpu core at a time while the rest idle. When using gluon, the same context declaration will use all of the CPU cores fully.
However, when using the same model (I adapted the number of hidden neurons in tutorials P03-C01-scratch and P03-C02-gluon to 256 and 128, used ReLU activations for both scripts) and using the same evaluation function (the one for the "from scratch" tutorial) and measuring the time required to run training on an epoch of data (after resetting the iterator), I found that the gluon-based code takes slightly more than double the time per epoch, even though all CPU cores are in use.
I have obtained similar results for the tutorials P04- , dealing with CNN architectures: The gluon.nn. - based model requires 4.8 times the time per epoch, while the model inheriting gluon.Block "only" recquires roughly double the time per epoch, in comparison to the from scratch built model.
Is this a hardware related issue (e.g. an i7 or a AMD processor would benefit from gluon and experience no penalties), are the models too shallow for efficient parallelization or is mxnet optimized poorly regarding the use of multiple CPU cores? Why is the use of gluon, despite using all CPU cores instead of just one, so much slower?
Execution of all notebooks using a GPU-context results in a slight run time advantage when using gluon. inheriting a model from gluon.Block instead of building the same model with gluon.nn. is about 20% faster in training for the P04- tutorials.