webmachinelearning / webnn

🧠 Web Neural Network API
https://www.w3.org/TR/webnn/
Other
371 stars 46 forks source link

Google Chrome Feedback on WebNN: aiming for broad device coverage and maintainability #453

Open vsekhar opened 1 year ago

vsekhar commented 1 year ago

This issue is a high-level summary of the Chrome team's feedback on WebNN, posting here for further discussion with WG members.

--

Google strongly supports the work of the Web ML WG to bring on-device ML capabilities to the open Web and we recognize the long-term contributions from many participants, in particular Intel who spearheaded the WebNN effort.

Since browser vendors will need to keep the resulting API up-to-date over many years, Google feels the proposal warrants special scrutiny to ensure the API remains relevant while imposing a manageable long-term support and maintenance cost, including after the initial WG contributors may have moved on to other projects.

To that end, several senior technical staff members on the Google Chrome team who are familiar with Web APIs, the Web standards process, and the technical implementation of various advanced browser APIs and capabilities, have carefully reviewed the WebNN proposal. This document summarizes their feedback. While we draw on the expertise of other ML research and infrastructure teams at Google (e.g. those working on TensorFlow, TensorFlow Lite, JAX, OpenXLA), we do not aim to speak for them or their projects.

Our feedback on the WebNN proposal is informed by our observation that, for new or single-vendor OS APIs or hardware accelerators, we must assume that most Web users don't have them. While we too aim to create compelling and performant experiences for users of the latest hardware and OS platforms, we have an obligation to ensure a workable experience for other users as well.

Our goal for an ML API for the Web is not to demonstrate performance with specific accelerators or drivers. Instead, Chrome's goal is to achieve 80% of a device's theoretical hardware-accelerated ML runtime performance across 80% of devices on the Web, and to do so while imposing a manageable long-term support burden on browser vendors. Users of other devices, with hardware accelerators or architectures that differ significantly from the mainstream and are not integrated by browser or OS vendors, should still benefit from workable execution of ML models on the CPU and GPU.

The ML ecosystem is still rapidly evolving, making it difficult for any API to keep up. For example, the long short-term memory (LSTM) approach to ML has already been obsoleted by Transformers, and Softmax has been succeeded by various approximate and access-efficient versions and implementations. Accelerators and hardware architectures continue to evolve as well.

Consider what would be involved in adding a new high level operator like FlashAttention to the current API. Implementers would need to connect it to each equivalent OS API operator (where it exists) or implement it as a GPU shader (when a GPU is available) or emulate it in CPU code. Current plans across the ecosystem for new models, operations, and hardware may already present an intractable roadmap for WebNN implementers who prioritize broad reach.

To address this issue, we favor adopting RISC-style tensor operations in the mathematical domain, drawing on the basics of tensor math that are unlikely to change in the near term, in contrast to the less stable higher-level CISC-style operations like HardSwish or SoftMax that are often obsoleted by new operations. The ML community is building consensus around certain low-level operator sets that work across frameworks and tool chains and we believe this work could benefit WebNN, particularly those operator sets that specifically target long-term stability.

We recognize that, in their current form, OS APIs for ML may not yet be conducive to RISC-style tensor operations. However we hope the WebNN effort will produce an API design that is performant, portable and stable, and that it will in turn have a positive influence on the evolution and long-term maintainability of OS APIs as well. We expect to evolve our own OS APIs in this way as well.

Based on the above, we recommend building on the WebNN proposal in the following ways:

  1. Request public positions from major browser implementers on the WebNN spec as currently proposed
  2. Reduce the long term support burden of WebNN by streamlining the API surface
    • Consider evolving towards operator sets emerging from the ML community, especially those targeting long-term stability
    • Remove model-specific instructions like LSTM and gru
    • Remove CISC-style operators like hardSwish and softmax
    • Limit tensor layout specifications to functions that read or write buffers
    • Complete the set of basic scalar and tensor math operations
  3. Demonstrate WebNN performance for CPU and GPU execution across multiple OS platforms
    • Suggestion: consider implementing WebNN as a polyfill on top of WebAssembly and WebGPU to reuse the compatibility work already done for these APIs
  4. Demonstrate WebNN performance gains utilizing OS- and hardware-specific optimizations
    • Extend WebNN implementations in a pluggable fashion, where HW and OS vendors contribute, maintain and deprecate backends for their platforms
    • Gated on demonstrated performance gains on targeted platforms and no regressions or performance cliffs on other platforms or fallback WebAssembly/WebGPU implementation

With regards to OS- and hardware-specific optimizations, we further propose an engineering approach that clearly demonstrates the value to Web users for the ecosystem to adopt and maintain them over the long-term:

  1. Select 2-5 demonstrative ML models, for example (source):
    • Segment Anything
    • Stable Diffusion
    • Whisper Tiny
  2. Run on a demonstrative set of platforms with accelerator hardware:
    • Apple Neural Engine on M2 Macbook Pro via CoreML
    • Intel VPU on Meteor Lake desktop via DirectML
    • Mainstream mobile devices running iOS and Android
    • ... other platforms at the suggestion of the Working Group
  3. Evaluate latency, throughput and power efficiency between:
    • Lowering WebNN for execution on typical CPUs and GPUs on the above platforms
    • Lowering WebNN for execution on hardware accelerators on the above platforms

We look forward to continuing the discussion with the WG participants to deploy powerful ML capabilities across the Web's many platforms and benefitting all of the Web's users.

anssiko commented 1 year ago

Thank you @vsekhar and @inexorabletash. We discussed this feedback at WebML WG Teleconference – 24 August 2023.

I encourage the WG to use this issue for general discussion and cross-link to this issue from topic-specific GH issues as appropriate.

anssiko commented 5 months ago

I observe a subset of these recommendations have been or are being discussed in topic-specific issues, currently open are e.g. #456 and #573. Furthermore, the group has focused on the models and hardware targets mentioned in this high-level summary, in both specification and prototyping efforts.

I would like to revisit this high-level issue in our future meeting to see what has been done, what remains to be done, and to discuss any new information that may have come up since. And to understand whether a revision to this high-level summary would be appropriate.

anssiko commented 4 months ago

To follow up on https://www.w3.org/2024/05/16-webmachinelearning-minutes.html#t09, I'd ask @mwyrzykowski to file an issue for WebKit's standards position repo. Mike is an active WG participant, informed of both this API and Apple platforms, and as such well positioned to file the issue in a way it includes appropriate details important to Apple.

@inexorabletash, I'd like to invite a Mozillian(s) to our meeting and have a discussion on this topic and familiarize them with the API. Someone from the WebGPU-land might be interested?