Open huningxin opened 1 year ago
MLContextOptions.threadCount
seems a useful option to me, as an ignorable hint at least. Do JS users have enough information to set an appropriate value for it? Setting a higher thread count than actual cores would be useless (but maybe the API would just clamp to the actual count). The two most useful and most commonly set values would presumably be either 1 or the number of physical cores.
(for naming, I'd follow the "use whole words" identifier advice and avoid fragments like "num")
if it's a hint rather than a configuration, and if setting the exact number depends on information not exposed to developers (the # of core), maybe a better approach would be an enum that hints towards single or multi-threaded execution?
navigator.hardwareConcurrency
is available to developer, so I think "# of cores" is a known value?
Single vs multi-thread doesn't provide sufficient granularity.
@huningxin Is there public data we can share here? Like a X-Y plot of (thread count, performance) for the models we want to support?
my understanding is that hardwareConcurrency
is not supported in Safari, so it may still be advantageous not to rely on having that number exposed; but I defer to actual experts on the level of granularity that would be needed for effective optimization.
@huningxin Is there public data we can share here? Like a X-Y plot of (thread count, performance) for the models we want to support?
Yes. We collected inference latency of some MediaPipe models on Chromium WebNN XNNPACK CPU prototype with different number of threads setting (1, 2 and 4).
According to the current Chromium prototype implementation, the threads number is capped to the minimum value of 4 and system available cores. And because the parallel inference jobs are scheduled by Chromium's ThreadPool, there is no guarantee that the number of threads set by user would be allocated.
In the following chart, the multi-threads inference speedup is normalized to single-thread (numThreads=1) performance. As the chart illustrates, for some models, such as SelfieSegmenter (landscape), MobileNetV3 (small_075), BlazeFace (short-range), Blendshape and FaceDetector, setting more number of threads doesn't help. These models are usually small.
if I read the chart correctly, there is only one case where setting the number of threads to something different from 1 or max leads to better performance (GestureClassifier) - can anyone hint as to why 2 threads are optimal for that particular model?
@dontcallmedom
can anyone hint as to why 2 threads are optimal for that particular model?
I suppose the context switching / job scheduling overhead would take over the inference time reduction by adding two more threads / jobs for that particular model.
can anyone hint as to why 2 threads are optimal for that particular model?
🤔 It's also possible due to graph topology that an odd number of threads assigns nodes such that more sequential dependencies occur with 3 threads (edit: oops, you said 4 above), whereas with 2 threads, more long-running operators happen to align nicely.
@huningxin : Would this new MLContextOptions.threadCount
represent interoperator threading or intraoperator threading? (or really however the backend chooses to interpret it?)
@fdwr
🤔 It's also possible due to graph topology that an odd number of threads assigns nodes such that more sequential dependencies occur with 3 threads, whereas with 2 threads, more long-running operators happen to align nicely.
This seem to be possible, although we didn't test with 3 threads.
@huningxin : Would this new
MLContextOptions.threadCount
represent interoperator threading or intraoperator threading? (or really however the backend chooses to interpret it?)
This is a good point. The current prototype implementation interprets it as intra-operator threading. Should we allow developers to hint inter-operator threading and intra-operator threading separately?
Framework use cases
The multi-cores architecture is widely available in modern CPUs that are commonly utilized by ML frameworks to parallelize the operator computation when inferring a model.
However, the number of threads (degree of parallelism) configuration may depend on different usage scenarios, e.g., for small models, single-threaded execution may be preferred because the task scheduling overhead may take over the speedup of parallel execution.
So, the ML frameworks usually allow users to control the number of threads according to their requirement. For example, the ONNXRuntime allows to configure
intra_op_num_threads
of CPU execution provider. TensorFlow-Lite providessetNumThreads
method for its interpreter.Native ML APIs
The native CPU ML API/lib commonly employ a threadpool for thread-level parallelism. The threadpool usually allows to configure the number of threads in this pool, for example:
XNNPACK utilizes pthreadpool for that allows to configure
threads_count
when creating the thread pool.MLAS utilizes
onnxruntime::concurrency::ThreadPool
that allows to constructs a thread pool for running withdegree_of_parallelism
threads.BNNS allows to set
n_threads
that controls the number of worker threads to execute an kernel.Other references
Model Loader API already extends the
MLContextOptions
withnumThreads
that allows the JS code to set the number of thread to use when computing a model.Proposal
WebNN may adopt the
MLContextOptions.numThreads
extension and allow frameworks to hint the number of threads to run operators in parallel for CPU MLContext./cc @pyu10055 @wacky6