Allow to hint number of threads for CPU MLContext

huningxin commented 1 year ago

Framework use cases

The multi-cores architecture is widely available in modern CPUs that are commonly utilized by ML frameworks to parallelize the operator computation when inferring a model.

However, the number of threads (degree of parallelism) configuration may depend on different usage scenarios, e.g., for small models, single-threaded execution may be preferred because the task scheduling overhead may take over the speedup of parallel execution.

So, the ML frameworks usually allow users to control the number of threads according to their requirement. For example, the ONNXRuntime allows to configure intra_op_num_threads of CPU execution provider. TensorFlow-Lite provides setNumThreads method for its interpreter.

Native ML APIs

The native CPU ML API/lib commonly employ a threadpool for thread-level parallelism. The threadpool usually allows to configure the number of threads in this pool, for example:

XNNPACK utilizes pthreadpool for that allows to configure threads_count when creating the thread pool.

MLAS utilizes onnxruntime::concurrency::ThreadPool that allows to constructs a thread pool for running with degree_of_parallelism threads.

BNNS allows to set n_threads that controls the number of worker threads to execute an kernel.

Other references

Model Loader API already extends the MLContextOptions with numThreads that allows the JS code to set the number of thread to use when computing a model.

Proposal

WebNN may adopt the MLContextOptions.numThreads extension and allow frameworks to hint the number of threads to run operators in parallel for CPU MLContext.

/cc @pyu10055 @wacky6

fdwr commented 10 months ago

MLContextOptions.threadCount seems a useful option to me, as an ignorable hint at least. Do JS users have enough information to set an appropriate value for it? Setting a higher thread count than actual cores would be useless (but maybe the API would just clamp to the actual count). The two most useful and most commonly set values would presumably be either 1 or the number of physical cores.

(for naming, I'd follow the "use whole words" identifier advice and avoid fragments like "num")

dontcallmedom commented 10 months ago

if it's a hint rather than a configuration, and if setting the exact number depends on information not exposed to developers (the # of core), maybe a better approach would be an enum that hints towards single or multi-threaded execution?

wacky6 commented 10 months ago

navigator.hardwareConcurrency is available to developer, so I think "# of cores" is a known value?

Single vs multi-thread doesn't provide sufficient granularity.

@huningxin Is there public data we can share here? Like a X-Y plot of (thread count, performance) for the models we want to support?

dontcallmedom commented 10 months ago

my understanding is that hardwareConcurrency is not supported in Safari, so it may still be advantageous not to rely on having that number exposed; but I defer to actual experts on the level of granularity that would be needed for effective optimization.

huningxin commented 10 months ago

@huningxin Is there public data we can share here? Like a X-Y plot of (thread count, performance) for the models we want to support?

Yes. We collected inference latency of some MediaPipe models on Chromium WebNN XNNPACK CPU prototype with different number of threads setting (1, 2 and 4).

According to the current Chromium prototype implementation, the threads number is capped to the minimum value of 4 and system available cores. And because the parallel inference jobs are scheduled by Chromium's ThreadPool, there is no guarantee that the number of threads set by user would be allocated.

In the following chart, the multi-threads inference speedup is normalized to single-thread (numThreads=1) performance. As the chart illustrates, for some models, such as SelfieSegmenter (landscape), MobileNetV3 (small_075), BlazeFace (short-range), Blendshape and FaceDetector, setting more number of threads doesn't help. These models are usually small.

dontcallmedom commented 10 months ago

if I read the chart correctly, there is only one case where setting the number of threads to something different from 1 or max leads to better performance (GestureClassifier) - can anyone hint as to why 2 threads are optimal for that particular model?

huningxin commented 9 months ago

@dontcallmedom

can anyone hint as to why 2 threads are optimal for that particular model?

I suppose the context switching / job scheduling overhead would take over the inference time reduction by adding two more threads / jobs for that particular model.

fdwr commented 9 months ago

can anyone hint as to why 2 threads are optimal for that particular model?

🤔 It's also possible due to graph topology that an odd number of threads assigns nodes such that more sequential dependencies occur with 3 threads (edit: oops, you said 4 above), whereas with 2 threads, more long-running operators happen to align nicely.

@huningxin : Would this new MLContextOptions.threadCount represent interoperator threading or intraoperator threading? (or really however the backend chooses to interpret it?)

huningxin commented 9 months ago

@fdwr

🤔 It's also possible due to graph topology that an odd number of threads assigns nodes such that more sequential dependencies occur with 3 threads, whereas with 2 threads, more long-running operators happen to align nicely.

This seem to be possible, although we didn't test with 3 threads.

@huningxin : Would this new MLContextOptions.threadCount represent interoperator threading or intraoperator threading? (or really however the backend chooses to interpret it?)

This is a good point. The current prototype implementation interprets it as intra-operator threading. Should we allow developers to hint inter-operator threading and intra-operator threading separately?

webmachinelearning / webnn