Does setting a lower number of cores help?

mratsim commented 6 years ago

Super useful analysis.

I'm trying to get a sense of where the overhead comes from.

Question 1: On the i7-5960X, does OpenMP create 8 or 16 threads?

Question 2: If we use the same number of threads on the Platinum 8180 and the E5-2699v4 as on the i7-5960X do we get the same openmp thresholds?

MlWoo commented 6 years ago

@mratsim A1: There is 8 cores and 16 HyperThreads on i7-5960X. I tend to turn off hyperthread in bios because HT scheduled by OS is also an overhead.
A2: I am afraid not. You must have a concept the threashold is a trade-off between overhead of OpenMP and effective computation. The two variables will vary in different platform. It is hard to give an quantilized formula. Actually you can use the scripts to estimate the threshold by compile two versions with different openmp thresholds.

mratsim commented 6 years ago

A1: Interesting, @laurae2 found out that HT helps a lot see article, especially on XGBoost/LightGBM workflow.

A2: Thanks, I'll get some Nim scripts base on those running in the future. Unfortunately PyTorch is moving super fast and tracking down the commit before those changes https://github.com/pytorch/pytorch/pull/2764 and https://github.com/pytorch/pytorch/pull/5584 (?).

Instead of thresholding based on the total amount of work and the CPU type, I will guarantee a minimum grain size for each CPU, and select the number of threads at runtime. This is assuming that OpenMP overhead is linear with the number of threads to create (which might not be true). I.e. the formula for number of threads is min(omp_get_max_threads(), max(1, ompsize div omp_grain_size)) and the whole implementation in Nim for simple parallel for and for parallel chunks/block ranges processing: https://github.com/numforge/laser/blob/9fd7d7cea6b6f700216b9a733f33e1233f25066f/laser/openmp/omp_parallel.nim

MlWoo commented 6 years ago

@mratsim A1: Why did the author name title of article as "Destroying the Myth of “number of threads = number of physical cores". "number of threads = number of physical cores" is used in traditional HPC very widely. I am not familar to XGBoost/LightGBM. Howerver, I am sure that the two model are not "traditional" HPC because the two is based on tree-searching (the more threads created, the more path could be calculate until the overhead it too high.) but not on heavy math operation(Only one node in computing graph of DL models is calculated once, and the operation is mainly of heavy math operation).

A2: Nope. you only compile the latest version of pytorch to get two versions with two threasholds(one is very, very large and another is very, very small). You can read #1 to get more details.

The assumption is not correct. The overhead of one HT creation is also dependent on the freq of CPU. The freq of CPU is dependent on so many factors. But I think you can still assumpt that in the same CPU Model if you want to apprpximate it. But you should still calibrate the formula in different CPU Models.

BTW, I am not in Intel now. I have no time and machines to testify my opinions. Maybe some opinions are not very correct. I think @mingfeima could give you more help in these problems.

Laurae2 commented 6 years ago

@MlWoo An hyperthread provides a practical 35-40% performance boost in applications per thread. It requires the overhead of the additional thread to be lower than that performance boost, which is rare to happen for a single application for many cores unless you don't hit another bottleneck (usually, memory bandwidth or cache).

That post was specifically about the typical Excel users who got brainwashed by the common myth of using the same number of cores for a process as the number of physical cores (when hyperthreading was introduced by Intel, it was so poorly executed it was better disabling it on all machines).

This is assuming that OpenMP overhead is linear with the number of threads to create (which might not be true).

OpenMP overhead is not linear. There is a dependency with the turbo boost frequency, the chunk size, the parallel scheduler type, last level cache size, and other stuff related to it (NUMA node, Sub NUMA Clustering if activated, etc.) along with thermal constraints and power constraints.

But I think you can still assumpt that in the same CPU Model if you want to apprpximate it. But you should still calibrate the formula in different CPU Models.

Calibration should be done and applied to a single model generation and similar bandwidth. For instance, a dual Xeon Platinum 8180 has:

56 cores / 112 threads
with 12 DIMMs (2x 6 channels), a theoretical 240 GBps bandwidth (remove DIMMs for lower bandwidth but better latency)
runs at 3.8 GHz on a single core load (3.6 GHz for AVX2, 3.5 GHz for AVX-512)
runs at 3.2 GHz on all cores load
runs at 2.8 GHz on all cores load with AVX2
runs at 2.3 GHz on all cores load with AVX-512

Which means it's not going to be linear at all for that CPU model. Even worse is if specific settings for certain environments are activated in BIOS, such as:

HPC mode: disable hyperthreading, etc.
Low latency mode: force very low number of cores (disable many of the cores), force turbo boost lock
Extended frequency mode: force turbo to lock at the speed of a single core turbo (that Xeon Platinum 8180 would be at 3.8/3.6/3.5 GHz all cores with it instead of 3.2/2.8/2.3 GHz, assuming unlimited thermal and power constraints), sort of banned practice by Intel nowadays since Xeon v3 (not a banned practice for consumer motherboard SKUs)

If we use the same number of threads on the Platinum 8180 and the E5-2699v4 as on the i7-5960X do we get the same openmp thresholds?

No, they do not generalize at all. The tests have to be re-done specifically for the CPUs as they behave differently (and they behave differently depending on the motherboard you put them in, along with BIOS settings and the hardware they have access to, specifically RAM and their physical location).

MlWoo commented 6 years ago

@Laurae2 I agree with you.

You'd better turn off HT in HPC mode such as training DL model.
I do not think the preset of openmp threashold (fixed or formulated) in program could achieves the best perf of CPU once for all. You have to do a lot benchmarking job to calibrate the values on some specific program. As you mentioned, the cores utilization, the simd instructions usage, temperature etc will result in downclocking on CPU.

mratsim commented 6 years ago

Thank you both.

Both threshold and grain_size in my own library are compile-time defined by an environment variable with hopefully sane defaults. Those in quest of perf can always recompile and adjust those variables for their CPU family.

zy97140 / omp-benchmark-for-pytorch

Does setting a lower number of cores help? #2