webmachinelearning / proposals

🚀 Proposals for future work
5 stars 2 forks source link

Hybrid AI Exploration #5

Open grgustaf opened 4 months ago

grgustaf commented 4 months ago

Hybrid AI Exploration

Authors

Introduction

ML on the client supports many use cases better than server-based approaches, and with lower cost for the application provider. However, clients can vary significantly in capabilities. A hybrid approach that can flexibly shift work between server and client can support elasticity and avoid the problem of developers targeting only the weakest clients’ capabilities.

The overall goal of hybrid AI is to maximize the user experience in machine learning applications by providing the web developer the tools to manage the distribution of data and compute resources between servers and the client.

For example, ML models are large. This creates network cost, transfer time, and storage problems. As mentioned, client capabilities can vary. This creates adaptation, partitioning, and versioning problems. We would like to discuss potential solutions to these problems, such as shared caches, progressive model updates, and capability/requirements negotiation.

Requirements and Goals

For the end user, most of the existing WebNN use cases share common user requirements:

Even though it is not a primary requirement, developer ease of use is a factor for adoption. An approach that easily allows a developer to shift load between the server and the client using simple, consistent abstractions will allow for more Hybrid AI applications to be developed faster than one with completely different programming models.

Open Issues

Current implementations of hybrid AI applications (see User Research and References) have the following problems when targeting many of the WebNN use cases:

Non-goals

User Research and References

mmccool commented 3 months ago

PDF of presentation slides: WebML Discussion - Hybrid AI for the Web - Slides.pdf

xenova commented 3 months ago

@mmccool As discussed in today's call, here are two examples of audio models that would benefit from improved storage and caching mechanisms, primarily due to their reliance on sub-models and/or adapters:

MMS:

mms-1b-all is a 1B parameter model that uses adapters (~2M parameters each) to enable automatic speech recognition across over 1000 languages.

SeamlessM4T:

As stated in the HF model docs:

SeamlessM4T-v2 enables multiple tasks without relying on separate models:

  • Speech-to-speech translation (S2ST)
  • Speech-to-text translation (S2TT)
  • Text-to-speech translation (T2ST)
  • Text-to-text translation (T2TT)
  • Automatic speech recognition (ASR)

SeamlessM4Tv2Model can perform all the above tasks, but each task also has its own dedicated sub-model.

anssiko commented 3 months ago

Discussed on WebML WG Teleconference – 7 March 2024.

Thanks to the authors for the presentation and the entire group for your feedback that will inform the direction of this exploration.

KenjiBaheux commented 3 months ago

I'm interested in the problem of sharing big / relatively big models across sites. Past a certain size, this problem puts the viability of client-side AI/ML under question, even if the device is more than capable of running the model.

While it's really hard to solve the generic problem of sharing common resources across origins, I'm hopeful that we can find a solution for AI/ML models. In particular, I believe that the following elements would help:

With these elements, one could design something where:

I believe that these elements would avoid some of the problems with trying to tack on cross-origin sharing on top of what currently exists for regular web resources:

I'm curious to heard what folks think about this high level approach.

anssiko commented 3 months ago

Thank you @xenova and @KenjiBaheux for your insights, much appreciated. The project team has acknowledged your input and my expectation is the team will share updates on their progress in this issue and will check back with you. We may also schedule another group discussion in the near future.

On another related topic, to everyone watching, please note a newly published write-up Understanding and managing the impact of Machine Learning models on the Web by @dontcallmedom that welcomes review and feedback.

This document in part discusses topics that intersect with this Hybrid AI exploration and may provide complementary perspectives to this exploration, quoting:

When looking more specifically at the browser-mediated part of the Web which remains primarily a client/server architecture, AI models can be run either on the server-side or on the client-side (and somewhat more marginally at this point, in a hybrid-fashion between the two). On the client side, they can either be provided and operated by the browser (either at the user's request, or at the application's request), or entirely by the client-side application itself.

It's also worth noting that as AI systems are gaining rapid adoption, their intersection with the Web is bound to evolve and possibly trigger new systemic impact; for instance, emerging AI systems that combine Machine Learning models and content loaded from the Web in real-time may induce revisiting in depth the role and user experience of Web browsers in consuming or searching content.

Thank you @dontcallmedom for producing this document.

jasonmayes commented 3 months ago

Hello all, thanks for the great discussions above, I'm Jason Mayes, Web AI Lead at Google - just wanted to weigh in with some thoughts I have been thinking about the past few years given we now have this centralized space for discussion:

  1. I believe there are 2 forms of Hybrid AI that will naturally exist:

a) In the first instance you have models either running on client side machine or server. Some sort of check occurs to see if machine is powerful enough to run a given model, the model is downloaded, and inference happens entirely locally on device. If the device is not powerful enough, those lower power devices fall back to a server side API for inference (hence hybrid approach).

b) In the second instance I also see a hybrid approach evolving in the name of model security, where by the model itself is split over the client and server. Lets take a simple multi layered perceptron style model. In this hypothetical example maybe you run the lower layers of the model on the client side, which allows you to somewhat encode the raw data into some high level embedding representation which is quite nice for the user as they gain some level of privacy of the raw data (though I guess a dedicated attacker could somehow reverse engineer depending on model architecture), and then the final classification head is kept on the server. This means if the model is stolen from the client side for a proprietary model it is not terribly useful to the person who stole it without the classification head. The benefits of this approach are that the company providing the service get model security while also offloading compute to the client side for significant cost savings (that will get better with time as hardware evolves), and client is not sending raw data to server (some level of privacy retained).

  1. In both cases above, for things like Gen AI models eg LLMs this still represents a significant one time download for these larger models (though traditional AI models like body pose, segmentation, object detection etc can be pretty small and performant). I therefore think it makes a lot of sense for these larger more complex models to be part of the web browser natively and exposed to JS developers in such a way that:

a) They can call / query the model via a standardized API for common base models that can be relied upon b) Can be used across domains as the model is not tied to a domain c) Is opt in. I could envision a permission window (much like webcam access) for when a website needs to leverage such a model, it asks the user permission to download the 1GB file that is a one time thing and can be used by all websites that need to use an LLM. The website could request a specific model from a list of whitelisted models by that browser, or load the default if none specified.

  1. Which then brings me on to my final point: If such an API was implemented this has the advantage that these resulting models could then be finetuned using LoRA weights which are much much smaller to download than the base model itself. Maybe a few MB in size only. Allowing websites to refine a model for a given context to perform well at a given task without GB downloads per site. The main concern that needs ironing out here though is that if the browser did not allow user to specify a model to download and use from some known whitelist, then if the browser changes or updates the model (or some other site overwrites it) then the specific prompt engineering needed for a model to work well could be different - leading to things not working as expected. This is something that would need more thought though I think with the right API and flow in place could work.

Cheers. Would love to hear your thoughts.

anssiko commented 3 months ago

@jasonmayes thanks for joining the discussion and sharing your insights! There are many exciting opportunities to explore. I've asked the project team to follow up with a summary of feedback provided so we can refine the next steps together. I'll invite you to our future meetings when this topic is on the agenda next, as well as others interested.

The project team obviously agrees with your prediction that 2024 will be the year of Hybrid AI approaches. Your article is acknowledged in the references :-) Also thanks for your contributions over the years, including the Opportunities & Challenges for TensorFlow.js and beyond talk at our 2020 workshop that informed the creation of the WebML WG and influenced the technical direction of the WebNN API.

Looking forward to creating more awesome things together in this space.

mmccool commented 3 months ago

Thanks for your input! Here is a summary of the comments above as we understand them. If there are any points we missed please let us know. It would also be helpful to know which of these are higher priority.

anssiko commented 3 months ago

Discussed on WebML WG Teleconference – 21 March 2024.

Summary: Acknowledged the insightful feedback provided in this issue, noted the summary of the feedback is available. The project team to share a proposal for the initial technical approaches to be explored with a prototype for further review and comment.

mmccool commented 3 months ago

We felt it would be helpful if we summarized the technical approach we are exploring (although – this is just a prototype to test-fly some ideas, and feedback is welcome), and then describe how this would address (some of) the points mentioned. There were several issues raised, and we feel we should prioritize. We will start with the “large model”, “cross-origin”, “adapter”, “models bound to URLs”, and the “built-in foundational model” problems.

Our basic idea is to cache individual nodes in the computational graph (specifically weight/bias tensors, the majority of the storage cost) separately, using keys (specifically, hashes) based on their content. This is similar to the Service Worker caches already tested, however we are looking at an approach that can be cross origin and is keyed by each node’s content, not the URL it is loaded from.

The advantage of this approach is that it can be implemented at the API level and so is independent of the serialization. We would be computing hashes over the buffer contents passed to the API, not the serialization string. This also would automatically account for models sharing components or foundational models, as long as e.g. adapters are expressed as part of the model graph (e.g. as constant tensor expressions) and not baked into other tensors. It would also optimize the downloading of sub-models if those share components (e.g. embedders/encoders/decoders).

The way the API works in practice would also support “built-in” models. Basically, the cache API would be extended to allow “loading” a particular node given its hash. If it exists in the cache, OR is a built-in model, the API call would succeed. If it is not available locally, then the call would fail. In this case, the application would catch the error and have to download that particular node. We are also considering extensions that would allow entire graphs or collections of nodes to be hashed and cached as a group (again, with “built-in” models behaving as if they were “already” in the cache – but caching full graphs in these cases would provide better protection).

This does, however, have its own problems. First, it leads to a need for “modular” file formats and representations of models so that nodes can be downloaded separately. It is, however, not too difficult to automatically expand current file formats into parts on the server. It also means for adapters to work with the cache developers should not bake them in, but express them as constant computations. On the other hand, this gives benefits like sharing parts among models and downloading parts in parallel. Finally, the client code needs to know the hashes of the nodes it wants. However, this is the same as needing to know the URLs of the model, but hashes avoid being tied to particular servers. In practice hashes can be baked into the code or stored in metadata files.

We feel that some of the other pain points mentioned, e.g. version management and “category-based” selection of models can be built on top of this capability. For example, a semantic versioning system could have a database of hashes and provide an interface to select a model from a version wildcard, e.g. “1.3.*”.

Do people feel this direction is worth exploring? Does anyone see any specific problems with the above approach?

jasonmayes commented 3 months ago

@mmccool That is an interesting approach. I had not considered the hashing of sub graphs of the model to cache and then download only the subgraphs that are missing. If that can actually work in a way that is compatible with common converted model formats that could be interesting (I may be biased here due to my exposure with people I have interacted with so please do expand as needed) but right now I see that as:

  1. Pytorch -> Microsoft ONNX Web
  2. TensorFlow -> TensorFlow.js (or existing TJFS models that are already in that form - not from Python land)
  3. Raw WebGPU or WASM implementations that are bespoke (I see a growing number of people from Rust language converting directly to WebGPU / WASM for example such as the Whisper Web Turbo project for speech recognition or MediaPipe's custom WASM/WebGPU implementation for Gemma LLM.

In the first 2 cases - this is likely more well defined, with point 3 being quite wide in how that may come about.

Again please do extrapolate from here though as I can only comment on the things I have seen myself - I am sure there may be others that emerge or exist that I am unaware of.

On that note however, it may be easier to offer an official "conversion" binary that can take saved models from these common formats and "compile them" to a web safe format to be used with this new Web AI standard that would work with such a proposed implementation, That way if something new comes along in the future it could be supported if critical mass for usage is obtained by the web community / proves to be useful. The downside is of course is as new things come one would need to add a conversion path if it was something substantially new/different where no other converters exist.

mmccool commented 3 months ago

Agreed, we need to figure out how this will work with existing model representations and file formats. We are looking into the details of the systems you mentioned as well as how Hugging Face represents models. We want to avoid defining yet another model representation.

We are still finding our way around the various model representations and would appreciate any input or guidance you and others with more experience can provide. That said, we feel there are couple of possible approaches here.

It seems most existing file formats allow for separate storage of tensors and metadata/topology already, in fact this seems necessary for large models due to buffer size limitations. Most representations also seem flexible enough to accommodate additional metadata in their "header" (or whatever part of the representation gets loaded first before the weights are). So one option would be to add hash metadata to the headers of existing model representations. This can be done over time - if the hash metadata does not exist in a particular model representation, it will still work, but the browser may download the model redundantly. This will still populate the cache if the model is not already in it, which will benefit any later use, even for another site using the same model. However, developers should be motivated to add the necessary metadata to files since it will improve the user experience of their users by avoiding wait times for downloads. It seems that some representations already include hashes for validation purposes so this is not that large a change. Of course our hashes could also support validation of downloads. With this approach a "converter" could just "upgrade" a model by adding hash metadata.

If updating existing representations is not possible, or if a model with an "older" format is to be loaded, then hash metadata can be computed and stored separately, perhaps as part of a manifest file. Hugging Face already has JSON manifests for nodes, for example, and the hashes could be stored there. In this case the "converter" would just generate a relatively simple manifest file containing the hashes associated with each node and/or the entire graph. In practice, if only the hash for the entire graph is needed, it can be embedded in the JS code (just like the name of the model or URL would be).

It would be good to know if you had any particular models in mind to look at for test cases. For example, we have been looking at mistral-7b and models derived from it with adapters, and how this model is represented in different formats.

All comments and feedback welcome.

anssiko commented 2 months ago

Per our discussion a dedicated repo has been created under the Web Machine Learning Community Group to continue this discussion in a structured manner (i.e. discussions split into topic-specific issues etc.):

🆕 https://github.com/webmachinelearning/hybrid-ai

Thanks everyone for your feedback and comments! Please watch the new repo.

I added a basic readme with ground rules. Simply put, the new repo is for discussion on Hybrid AI topics and possible specification incubation work that may follow needs a recharter.

@grgustaf and @mmccool please migrate applicable content from this proposal issue to the dedicated repo and loop interested folks in. You can close this issue when the migration is completed. Thank you!

mmccool commented 2 months ago

We are moving this content to the above repo and reorganizing it. Please go to https://github.com/webmachinelearning/hybrid-ai for further comments. We will leave this issue open for now.