Hybrid AI Exploration

Authors

Michael McCool
Geoff Gustafson
Sudeep Divakaran

Introduction

ML on the client supports many use cases better than server-based approaches, and with lower cost for the application provider. However, clients can vary significantly in capabilities. A hybrid approach that can flexibly shift work between server and client can support elasticity and avoid the problem of developers targeting only the weakest clients’ capabilities.

The overall goal of hybrid AI is to maximize the user experience in machine learning applications by providing the web developer the tools to manage the distribution of data and compute resources between servers and the client.

For example, ML models are large. This creates network cost, transfer time, and storage problems. As mentioned, client capabilities can vary. This creates adaptation, partitioning, and versioning problems. We would like to discuss potential solutions to these problems, such as shared caches, progressive model updates, and capability/requirements negotiation.

Requirements and Goals

For the end user, most of the existing WebNN use cases share common user requirements:

Enhance User Experience
- Reduce load times
- Meet latency targets for human interactions
Portability and Elasticity
- Minimize compute, storage, and network transfer costs
- Support clients of different capability levels, including older/newer clients
- Adaptive to varying resource availability
Data Privacy
- User choice for location of data storage and computation
- Video and audio streams both high bandwidth and generally private
- Personal data (personally-identifiable information)
- Confidential business information

Even though it is not a primary requirement, developer ease of use is a factor for adoption. An approach that easily allows a developer to shift load between the server and the client using simple, consistent abstractions will allow for more Hybrid AI applications to be developed faster than one with completely different programming models.

Open Issues

Current implementations of hybrid AI applications (see User Research and References) have the following problems when targeting many of the WebNN use cases:

If the model runs on the server, then large amounts of (possibly private) data may need to be streamed to the server. This incurs a per-use latency.
If the model runs on the client, large models need to be downloaded, possibly multiple times in different contexts. This incurs a startup latency.
Users need control over private data, so the choice of whether or not a model needs to run on the client may have to be overridden by the user's preferences in some cases.
Clients vary in capabilities, so the developer does not know in advance how to split up the computational work.
Models are large and can consume significant storage on the client, which needs to be managed.
Applications may use multiple models that need to communicate with each other, but may each run on either the client or server.
Multiple applications may be present and need to share resources such as storage, memory, and compute. This may cause the actual capabilities of the client to vary over time.
It may be necessary to hide the exact capabilities of the client from the developer to avoid fingerprinting. However, the platform must be able to match models with client capabilities. An application may provide a choice of multiple models to support elasticity. See Performance Adaptation.

Non-goals

Protecting proprietary models downloaded to the client from interception is a non-goal (but may be addressed in implementations or later work).
Automatic factoring of models is a non-goal. The developer needs to break models into pieces that are managed atomically, each of which runs on either the server or client.
Automatic optimization of models is a non-goal. The developer needs to consider how to minimize the size of models; however they may be able to use generic features expected on clients such as small data types.
Model training is a non-goal. The system will be focused on inference. However, fine tuning may be used in limited circumstances.
Complete modelling of the client’s capabilities is a non-goal.
Extreme scalability is a non-goal. While there may be multiple applications on the client there should be only a handful that a single user can use at once.
Extreme performance is a non-goal. While important, other goals such as portability and security, which are also important to the user experience, create trade-offs.
Managing models outside the web client is a non-goal. For example, internal platform models may be present and used to support system features but the proposed system would not manage them. However, access to those separately managed platform models may be useful.

User Research and References

Existing WebNN Use Cases - Set of agreed-upon use cases for WebML. Most of these have latency or privacy requirements and several require large models (e.g. language translation).
Storage APIs for caching and sharing large models across origins discussion in the WebML WG - Some previous discussion within the WebML on the problem of model caching. Includes discussion of experience with prototype and highlights cross-site sharing as a key issue.
Google on Hybrid AI - A general discussion of Hybrid AI based on using smaller models on client and falling back to server only when necessary. Mentions need to cache models, and potentially breaking up model between client and server. Mentions potential privacy benefits of partitioned models, surveys several applications already using hybrid approach.
Qualcomm - Getting Personal with On-Device AI - Describes several interesting personalized AI client use cases.
Moor Insights on AI PC - Describes the AI PC opportunity and lists several client AI applications, with a focus on Windows and Microsoft.
Priority of Constituencies - The general W3C framework for prioritizing requirements.
WNIG Cloud-Edge Coordination Use Cases - Two explicit ML use cases but others may have ML aspects, e.g. video editing. While these emphasize edge computing (offload from the client) several can also be interpreted as use cases simply needing additional performance on the client.

@mmccool As discussed in today's call, here are two examples of audio models that would benefit from improved storage and caching mechanisms, primarily due to their reliance on sub-models and/or adapters:

MMS:

mms-1b-all is a 1B parameter model that uses adapters (~2M parameters each) to enable automatic speech recognition across over 1000 languages.

SeamlessM4T:

As stated in the HF model docs:

SeamlessM4T-v2 enables multiple tasks without relying on separate models:

Speech-to-speech translation (S2ST)

Speech-to-text translation (S2TT)

Text-to-speech translation (T2ST)

Text-to-text translation (T2TT)

Automatic speech recognition (ASR)

SeamlessM4Tv2Model can perform all the above tasks, but each task also has its own dedicated sub-model.

Discussed on WebML WG Teleconference – 7 March 2024.

Thanks to the authors for the presentation and the entire group for your feedback that will inform the direction of this exploration.

I'm interested in the problem of sharing big / relatively big models across sites. Past a certain size, this problem puts the viability of client-side AI/ML under question, even if the device is more than capable of running the model.

While it's really hard to solve the generic problem of sharing common resources across origins, I'm hopeful that we can find a solution for AI/ML models. In particular, I believe that the following elements would help:

Understanding what the developer is trying to do among a list of predefined use cases.
An understanding of what a given model is known to be capable of / good at, under which circumstances (e.g. maybe limited to specific languages).
A way for the client (browser) to lookup the best model(s) for a given use case, modulo specific constraints (e.g. storage, performance related factors, languages, etc).
A way for the developer to suggest a model they would like to use, and their priorities between solving the use case with any relevant model vs. solving it with their preferred model (if any).

With these elements, one could design something where:

The use case can be addressed if the client has a model that's known to be good at said use case. The client could still decide to schedule a download for the suggested model if it's meaningfully better than what the client has handy (modulo storage, hardware capabilities, user preferences, etc).
If the client doesn't have any decent model, it can either find a model for the use case, or consider going with the model suggested by the developer. Because the download might take a while (or may not happen if the device doesn't have enough storage, insufficient hardware capabilities, etc), the developer would likely want to fallback to a server side option.

I believe that these elements would avoid some of the problems with trying to tack on cross-origin sharing on top of what currently exists for regular web resources:

when resources are specified by a URL, you end up with multiple locations for the same thing (i.e. self hosting, hosted by CDN X/Y/Z).
when resources are versionized, you end up with web sites & apps with hardcoded versions resulting in having many variants of a given resource used across the web, and without an easy way to understand backward/forward compatibility.
when a resource is specified by a location, you can't easily tell what the resource is, and how it compares to what you might already have. In some cases, the resources are bundled in a format that the client doesn't natively support which makes it even harder for the client to understand what's going on and how these might relate to previously downloaded resources.
when a resource is specified by a location (non canonical), you run the risk of ossification where it's hard to move the ecosystem toward a better option (or a safer option) when one arises.

I'm curious to heard what folks think about this high level approach.

Thank you @xenova and @KenjiBaheux for your insights, much appreciated. The project team has acknowledged your input and my expectation is the team will share updates on their progress in this issue and will check back with you. We may also schedule another group discussion in the near future.

On another related topic, to everyone watching, please note a newly published write-up Understanding and managing the impact of Machine Learning models on the Web by @dontcallmedom that welcomes review and feedback.

This document in part discusses topics that intersect with this Hybrid AI exploration and may provide complementary perspectives to this exploration, quoting:

When looking more specifically at the browser-mediated part of the Web which remains primarily a client/server architecture, AI models can be run either on the server-side or on the client-side (and somewhat more marginally at this point, in a hybrid-fashion between the two). On the client side, they can either be provided and operated by the browser (either at the user's request, or at the application's request), or entirely by the client-side application itself.

It's also worth noting that as AI systems are gaining rapid adoption, their intersection with the Web is bound to evolve and possibly trigger new systemic impact; for instance, emerging AI systems that combine Machine Learning models and content loaded from the Web in real-time may induce revisiting in depth the role and user experience of Web browsers in consuming or searching content.

Thank you @dontcallmedom for producing this document.

Hello all, thanks for the great discussions above, I'm Jason Mayes, Web AI Lead at Google - just wanted to weigh in with some thoughts I have been thinking about the past few years given we now have this centralized space for discussion:

I believe there are 2 forms of Hybrid AI that will naturally exist:

a) In the first instance you have models either running on client side machine or server. Some sort of check occurs to see if machine is powerful enough to run a given model, the model is downloaded, and inference happens entirely locally on device. If the device is not powerful enough, those lower power devices fall back to a server side API for inference (hence hybrid approach).

b) In the second instance I also see a hybrid approach evolving in the name of model security, where by the model itself is split over the client and server. Lets take a simple multi layered perceptron style model. In this hypothetical example maybe you run the lower layers of the model on the client side, which allows you to somewhat encode the raw data into some high level embedding representation which is quite nice for the user as they gain some level of privacy of the raw data (though I guess a dedicated attacker could somehow reverse engineer depending on model architecture), and then the final classification head is kept on the server. This means if the model is stolen from the client side for a proprietary model it is not terribly useful to the person who stole it without the classification head. The benefits of this approach are that the company providing the service get model security while also offloading compute to the client side for significant cost savings (that will get better with time as hardware evolves), and client is not sending raw data to server (some level of privacy retained).

In both cases above, for things like Gen AI models eg LLMs this still represents a significant one time download for these larger models (though traditional AI models like body pose, segmentation, object detection etc can be pretty small and performant). I therefore think it makes a lot of sense for these larger more complex models to be part of the web browser natively and exposed to JS developers in such a way that:

a) They can call / query the model via a standardized API for common base models that can be relied upon b) Can be used across domains as the model is not tied to a domain c) Is opt in. I could envision a permission window (much like webcam access) for when a website needs to leverage such a model, it asks the user permission to download the 1GB file that is a one time thing and can be used by all websites that need to use an LLM. The website could request a specific model from a list of whitelisted models by that browser, or load the default if none specified.

Which then brings me on to my final point: If such an API was implemented this has the advantage that these resulting models could then be finetuned using LoRA weights which are much much smaller to download than the base model itself. Maybe a few MB in size only. Allowing websites to refine a model for a given context to perform well at a given task without GB downloads per site. The main concern that needs ironing out here though is that if the browser did not allow user to specify a model to download and use from some known whitelist, then if the browser changes or updates the model (or some other site overwrites it) then the specific prompt engineering needed for a model to work well could be different - leading to things not working as expected. This is something that would need more thought though I think with the right API and flow in place could work.

Cheers. Would love to hear your thoughts.

@jasonmayes thanks for joining the discussion and sharing your insights! There are many exciting opportunities to explore. I've asked the project team to follow up with a summary of feedback provided so we can refine the next steps together. I'll invite you to our future meetings when this topic is on the agenda next, as well as others interested.

The project team obviously agrees with your prediction that 2024 will be the year of Hybrid AI approaches. Your article is acknowledged in the references :-) Also thanks for your contributions over the years, including the Opportunities & Challenges for TensorFlow.js and beyond talk at our 2020 workshop that informed the creation of the WebML WG and influenced the technical direction of the WebNN API.

Looking forward to creating more awesome things together in this space.

Thanks for your input! Here is a summary of the comments above as we understand them. If there are any points we missed please let us know. It would also be helpful to know which of these are higher priority.

Models/Use Cases that would benefit from proposal:
- MMS – ASR
  - Large (1B parameters)
  - Uses adapters for different languages (2M each, over 1000 languages)
- SeamlessM4T – general speech tasks
  - Sub-models for specific tasks: S2ST, S2TT, T2ST, T2TT, ASR
  - Sub-model is a model component, e.g. a partial model
Pain points
- Sharing large models across sites
  - General solution is hard -> specific solution for ML probably easier and acceptable
- Generalizing models
  - Use case -> potential model choices
  - Model categories (e.g. any ASR model, any noise reducer, …)
  - Dev prioritization to select among options
  - Versioning (e.g. any model >= 1.3.* is acceptable)
  - However - prompt engineering may be specific to particular versions of LLMs
- Same resources
  - At different URLs -> duplicates in cache when using URL as key (e.g. Service Worker Cache)
    - Specification by URL -> ossification, ecosystem brittleness
    - Encoded in different formats
    - Optimized for different clients
- Want to expose built-in models
  - If models provided by system (OS or browser), should avoid downloads
- Need solution that can handle adapters
  - Multiple applications sharing a base model
  - Foundational models large, adapters are small
Comments
- Two forms of hybrid AI
  - Server OR client
    - E.g. server as fallback if client not powerful enough
    - Need to check if client is powerful enough before download
  - Split models
    - Can also enhance model security (complete model not downloaded)
    - Some privacy advantages for user (raw data not sent to cloud)

Discussed on WebML WG Teleconference – 21 March 2024.

Summary: Acknowledged the insightful feedback provided in this issue, noted the summary of the feedback is available. The project team to share a proposal for the initial technical approaches to be explored with a prototype for further review and comment.

We felt it would be helpful if we summarized the technical approach we are exploring (although – this is just a prototype to test-fly some ideas, and feedback is welcome), and then describe how this would address (some of) the points mentioned. There were several issues raised, and we feel we should prioritize. We will start with the “large model”, “cross-origin”, “adapter”, “models bound to URLs”, and the “built-in foundational model” problems.

Our basic idea is to cache individual nodes in the computational graph (specifically weight/bias tensors, the majority of the storage cost) separately, using keys (specifically, hashes) based on their content. This is similar to the Service Worker caches already tested, however we are looking at an approach that can be cross origin and is keyed by each node’s content, not the URL it is loaded from.

The advantage of this approach is that it can be implemented at the API level and so is independent of the serialization. We would be computing hashes over the buffer contents passed to the API, not the serialization string. This also would automatically account for models sharing components or foundational models, as long as e.g. adapters are expressed as part of the model graph (e.g. as constant tensor expressions) and not baked into other tensors. It would also optimize the downloading of sub-models if those share components (e.g. embedders/encoders/decoders).

The way the API works in practice would also support “built-in” models. Basically, the cache API would be extended to allow “loading” a particular node given its hash. If it exists in the cache, OR is a built-in model, the API call would succeed. If it is not available locally, then the call would fail. In this case, the application would catch the error and have to download that particular node. We are also considering extensions that would allow entire graphs or collections of nodes to be hashed and cached as a group (again, with “built-in” models behaving as if they were “already” in the cache – but caching full graphs in these cases would provide better protection).

This does, however, have its own problems. First, it leads to a need for “modular” file formats and representations of models so that nodes can be downloaded separately. It is, however, not too difficult to automatically expand current file formats into parts on the server. It also means for adapters to work with the cache developers should not bake them in, but express them as constant computations. On the other hand, this gives benefits like sharing parts among models and downloading parts in parallel. Finally, the client code needs to know the hashes of the nodes it wants. However, this is the same as needing to know the URLs of the model, but hashes avoid being tied to particular servers. In practice hashes can be baked into the code or stored in metadata files.

We feel that some of the other pain points mentioned, e.g. version management and “category-based” selection of models can be built on top of this capability. For example, a semantic versioning system could have a database of hashes and provide an interface to select a model from a version wildcard, e.g. “1.3.*”.

Do people feel this direction is worth exploring? Does anyone see any specific problems with the above approach?

@mmccool That is an interesting approach. I had not considered the hashing of sub graphs of the model to cache and then download only the subgraphs that are missing. If that can actually work in a way that is compatible with common converted model formats that could be interesting (I may be biased here due to my exposure with people I have interacted with so please do expand as needed) but right now I see that as:

Pytorch -> Microsoft ONNX Web
TensorFlow -> TensorFlow.js (or existing TJFS models that are already in that form - not from Python land)
Raw WebGPU or WASM implementations that are bespoke (I see a growing number of people from Rust language converting directly to WebGPU / WASM for example such as the Whisper Web Turbo project for speech recognition or MediaPipe's custom WASM/WebGPU implementation for Gemma LLM.

In the first 2 cases - this is likely more well defined, with point 3 being quite wide in how that may come about.

Again please do extrapolate from here though as I can only comment on the things I have seen myself - I am sure there may be others that emerge or exist that I am unaware of.

On that note however, it may be easier to offer an official "conversion" binary that can take saved models from these common formats and "compile them" to a web safe format to be used with this new Web AI standard that would work with such a proposed implementation, That way if something new comes along in the future it could be supported if critical mass for usage is obtained by the web community / proves to be useful. The downside is of course is as new things come one would need to add a conversion path if it was something substantially new/different where no other converters exist.

Agreed, we need to figure out how this will work with existing model representations and file formats. We are looking into the details of the systems you mentioned as well as how Hugging Face represents models. We want to avoid defining yet another model representation.

We are still finding our way around the various model representations and would appreciate any input or guidance you and others with more experience can provide. That said, we feel there are couple of possible approaches here.

It seems most existing file formats allow for separate storage of tensors and metadata/topology already, in fact this seems necessary for large models due to buffer size limitations. Most representations also seem flexible enough to accommodate additional metadata in their "header" (or whatever part of the representation gets loaded first before the weights are). So one option would be to add hash metadata to the headers of existing model representations. This can be done over time - if the hash metadata does not exist in a particular model representation, it will still work, but the browser may download the model redundantly. This will still populate the cache if the model is not already in it, which will benefit any later use, even for another site using the same model. However, developers should be motivated to add the necessary metadata to files since it will improve the user experience of their users by avoiding wait times for downloads. It seems that some representations already include hashes for validation purposes so this is not that large a change. Of course our hashes could also support validation of downloads. With this approach a "converter" could just "upgrade" a model by adding hash metadata.

If updating existing representations is not possible, or if a model with an "older" format is to be loaded, then hash metadata can be computed and stored separately, perhaps as part of a manifest file. Hugging Face already has JSON manifests for nodes, for example, and the hashes could be stored there. In this case the "converter" would just generate a relatively simple manifest file containing the hashes associated with each node and/or the entire graph. In practice, if only the hash for the entire graph is needed, it can be embedded in the JS code (just like the name of the model or URL would be).

It would be good to know if you had any particular models in mind to look at for test cases. For example, we have been looking at mistral-7b and models derived from it with adapters, and how this model is represented in different formats.

All comments and feedback welcome.

Per our discussion a dedicated repo has been created under the Web Machine Learning Community Group to continue this discussion in a structured manner (i.e. discussions split into topic-specific issues etc.):

🆕 https://github.com/webmachinelearning/hybrid-ai

Thanks everyone for your feedback and comments! Please watch the new repo.

I added a basic readme with ground rules. Simply put, the new repo is for discussion on Hybrid AI topics and possible specification incubation work that may follow needs a recharter.

@grgustaf and @mmccool please migrate applicable content from this proposal issue to the dedicated repo and loop interested folks in. You can close this issue when the migration is completed. Thank you!

We are moving this content to the above repo and reorganizing it. Please go to https://github.com/webmachinelearning/hybrid-ai for further comments. We will leave this issue open for now.

webmachinelearning / proposals