MLContextOptions.deviceType seems unnecessary outside of conformance testing

WebKit has a few objections to MLContextOptions.deviceType:

MLContextOptions.deviceType is currently unimplementable currently via CoreML

CoreML is the only public framework on macOS for running computations on the NPU / ANE. The reason as pointed out https://github.com/webmachinelearning/webnn/issues/623#issuecomment-2025767956 and possibly elsewhere, CoreML allows you to specify MLComputeUnitsCPUOnly, MLComputeUnitsCPUAndGPU, MLComputeUnitsAll, and MLComputeUnitsCPUAndNeuralEngine, but not MLComputeUnitsNeuralEngine only for instance. Even assuming we could request CoreML to add support for only the ANE, that support would be limited to a future version of macOS and therefore could only be supported on currently unreleased versions of macOS.

Also note, MLComputeUnitsCPUAndNeuralEngine, is only available on macOS 13.0 and later and I'm not sure if some browser vendors want to support earlier versions of macOS.

MLContextOptions.deviceType, as currently specified, would lead to additional fragmentation, works on some devices, fails on others

Under Section 4 we see the wording:

if this type cannot be satisfied, an "OperationError" DOMException is thrown,

It would be preferable to fallback instead of failing, otherwise a website which specifies NPU and runs on a variety of NPUs, but maybe not the NPU of a certain device, the website would not run if the author does not handle fallback. This seems unnecessary, when the browser could fallback to the GPU or CPU themselves.

The browser has better insight into workloads than the website author

It was pointed out in https://github.com/webmachinelearning/webnn/pull/322 that a website author may want their models to not run on the GPU so they don't impact WebGPU rendering performance. While a good idea in theory, the browser implementation and native frameworks have much greater insight into the graphics chip utilization and the most appropriate resource to run a model on without impacting target frameworks.

The answer to this question, where to run the model, can not be easily decided by the website as the website does not know the underlying hardware it is running on due to privacy reasons and possibly other reasons. Additionally, while browser vendor can trivially compute the amount of time a GPUCommandBuffer takes to submit from WebGPU, such timing is not available to the website. For example, if all the GPUCommandBuffers in the WebGPU workload complete in <3ms on a Mac Studio, and the website calls requestAnimationFrame every ~16ms, then restricting the WebNN model from running on the GPU seems unnecessary. However on a less powerful device, like an iPhone 12 mini, if the WebGPU workload is falling behind the requested update rate, the browser implementation may move the model processing to the device least in use. To the website, a browser implementation may present both devices as having the same capabilities for privacy reasons, so there is no way for the website to make the correct decision which device to run on.

Often with graphics intensive WebGPU applications, we find the GPU is still the least utilized processor as the CPU overhead can be high. Long running shaders or complex compute kernels are two exceptions. In any case this is device dependent and not something a website author can easily predict.

To summarize it seems MLDeviceType is useful for testing in browser implementations, but seems counter-productive for framework authors and general websites / wasm applications.

To summarize the points above:

deviceType option is hard to standardize because of the heterogeneity of the compute units across various platforms, and even across their versions.
fallback is preferable instead of failing, and implementations/the underlying platforms should determine the fallback type based on runtime information.
implementations, browser, OS have better grasp of the system/compute/runtime/apps state then websites, therefore control should be relished to them.

Some of these points were mentioned in #696 (adding "npu" to deviceType) and discussed in #623 (support for NPU, QDQ, minimum op set).

We had a short discussion on this during the last call (section 4). Some questions for further consideration:

are there use cases when web applications are part of larger solutions and would like to affect the type of AI accelerators used with the model selected (by them or by their user). For reasons above, and also because of security issues, this could possibly be only a hint.
If there are such use cases, can we defer them to non-breaking future changes in the API or its behavior?
How do you propose to change the API: would we keep/change the powerPreference ContextOptions and remove deviceType like in #322?
Or, is it possible we keep / standardize better the deviceType (name can also change) to better reflect the use case/developer intentions, without getting into possible fragmentation issues? What about calling it devicePreference to emphasize this is a preference, not a required option? Handle it as a hint, and map it to the real underlying compute structure the best way the implementation/underlying platform decides to do? For instance, "npu" preference might mean NPU on one platform, NPU+CPU on another, or the nearest fallback (which can be GPU). We might add an informative mapping array to the spec for known frameworks.
Or, remove both and rethink if we could possibly pass any other possible extra (standardized) information about the model/graph that might help the underlying impl/platform figure out the best how to run the graph, and/or express developer intentions for compute preferences?

I wonder if we could achieve some of the use cases by generalizing the concept and abstracting it a little?

E.g. (not a formal proposal),

interface ML {
  Promise<MLContext> createContext(optional MLContextOptions options = {});
  Promise<MLContext> createContext(GPUDevice gpuDevice);
};

becomes:

interface ML {
  Promise<MLContext> createContext(optional MLContextOptions options = {});
  Promise<MLContext> createContext(GPUDevice gpuDevice);

  record<USVString, MLOpSupportLimits> opSupportLimitsPerAdapter();
};

where an implementation may choose to implement opSupportLimitsPerAdapter in a way such that it returns a dictionary of MLOpSupportLimits for each compute device on a system (CPU, GPU, NPU).

Alternatively an implementation may choose to return a single value in the dictionary. Furthermore, an implementation may choose to return values which preserve battery life, offer higher performance at the expense of battery life, avoid impacting other NPU or GPU operations, etc.

Then, MLContextOptions can become:

dictionary MLContextOptions {
  MLOpSupportLimits requiredLimits = {};
};

where requiredLimits corresponds to a value returned by opSupportLimitsPerAdapter and indicates to the implementation which limits will be required during the lifetime of the MLContext.

Then the website/app author / framework developer chooses the set of MLOpSupportLimits that maps most directly to their needs and passes it to the ML.createContext call.

I think that would support the DirectML use case where an implementation needs to know which device the tensors will run on while supporting CoreML which doesn't allow requiring the NPU. Additionally, it can preserve privacy as an implementation can decide the granularity of the information returned from opSupportLimitsPerAdapter

I'd be fine with requiredLimits, but we have a problem with enumerating devices.

The spec currently says that as a security mitigation,

In order to not allow an attacker to target a specific implementation that may contain a flaw, the § 6.2 Device Selection mechanism is a hint only, and the concrete device selection is left to the implementation - a user agent could for instance choose never to run a model on a device with known vulnerabilities. As a further mitigation, no device enumeration mechanism is defined.

So if possible, we are trying to avoid exposing system specific information to web pages. IMHO we already have stretched the limits of being vulnerable to fingerprinting in PR #755 that solves issue #463.

I wonder if we could achieve the use cases with a preference/hint from the web page, and an algorithm/policy documented in the spec. Specifically, could we distill a simplifying pattern across models in mapping the web page needs to op support limits as formulated below.

the website/app author / framework developer chooses the set of MLOpSupportLimits that maps most directly to their needs and passes it to the ML.createContext call.

webmachinelearning / webnn