Closed jbingham closed 4 years ago
Thanks @jbingham for reaching out and opening this discussion. I am interested in this topic. Especially, the explainer is extremely helpful to me, thanks for that.
The TensorFlow team raised some concerns that a graph API may not be the right level of abstraction
As my understanding, the concerns are mainly about the way that defines/constructs a graph/model by wiring operations, as EXAMPLE 2 shows in current spec.
If there is a consensus on model loading, a possible solution could be to allow creating a Model
object by loading a URL:
partial interface NeuralNetworkContext {
// Create a Model object by loading a file.
Promise<Model> createModel(DOMString modelUrl);
};
With this, the EXAMPLE 2 would be changed to:
const modelUrl = 'url/to/ml/model';
const model = await nn.createModel(modelUrl);
Then the model
object can be compiled and executed as same as EXAMPLE 3 and EXAMPLE 4 show.
What do you think?
Thank you @jbingham for the proposal. Tagging @RafaelCintron and @gregwhitworth from Microsoft.
All - please review the proposal and provide your feedback in this issue.
I added this issue as a discussion topic for our next call that takes place tomorrow: WebML CG Teleconference – 6 February 2020 - 15:00-16:00 UTC+0
The group has learned a lot since we last discussed this topic ~1 year ago in #3, so I feel now is a good time to revisit the issue. Given adequate support, I'll start the process to expand the Community Group's scope per the charter change process to explicitly bring this proposal in scope of this group.
We should definitely discuss this topic at our upcoming workshop in Berlin. It fits nicely into the workshop scope and touches many broad topics that have both near-term and long-term implications. It'd be great to have @jbingham give a lightning talk on the topic at the workshop. I encourage you to register and submit a position statement. Anyone else lurking on this repo should do the same :-)
Thanks @jbingham for putting together a very clearly articulated proposal.
I do agree that a loadmodel API is useful and I'm supportive of adding it. I also agree with @huningxin that it is complementary to the graph API and not a replacement of it, because from the design point of view, the graph API is just the next level of abstraction, and one that doesn't require a serialization format. Looking at it this way the model format is a convenience wrapper to help with packaging and distribution.
I think the more important discussion though is about whether the lowest-level primitive of this API should be at the operator level or something lower than the operator level, a question Jonathan rightfully raises in his proposal. This I believe is at the heart of this matter because this primitive is the contract that represents not only what the browser implementer must provide to the developers, but also what the underlying OS must provide to support the browsers. Also note that this issue applies even if we choose to have a single loadmodel API and without the graph API because the primitive would still live inside the file format that the loadmodel API would parse and subsequently execute.
I think although the operator-level primitives has its own share of evolution over the years, it does provide a reasonable contract between the provider and the consumer of ML. I'm one of the implementers of ML operators with DirectML so I understand this difficulties firsthand.
The operator-level primitives would allow the browsers the freedom to implement them as optimally as what the underlying OS can support while still producing consistent result across diverse OS and hardware platforms. Any ML expert will attest that nothing is more frustrating than having the same model behaving differently on different platform or hardware. In practice it'll be virtually impossible to test the model with every possible combination of OS and hardware, old and new.
The concern I have over a primitive, as a contract, at the level lower than the operator level, is that the lower you go, the harder the job of the browsers to provide reliable and consistent service will become. From the OS perspective, it is far more reliable to provide the browsers a contract that allows both the browsers and the OS to evolve in parallel over time; so as the new hardware or drivers show up, it automatically works with the existing browsers while the browsers continue to evolve and control how it wants to maintain or extend its commitment to the developers on its own timeline. This healthy ecosystem is hard to achieve if the browsers would insist on demanding a primitive at a level too low for the OS to do a good job at.
I think there are definitely considerations we can use in how we define the operators going forward so that it's more modular and easier to maintain over time. The current studies such as the conv2d compatibility exercise is helpful as a way to separate out core functionalities from the more auxiliary aspects of an operation already diversely defined by various API in the industry today. But to go lower to the level of a compiler's intermediate format maybe a bridge too far in my opinion.
I agree with @wchao1115 .
(a) Load-model vs the so-called "graph API" (I prefer to think of it as a model-builder API). A load-model API seems useful. But I agree that there is no conflict between the two, they can co-exist, if we think they are both useful. But if we decide to use just one of them, I am fine with that too.
(b) The "level" of the API: do we use a small set of primitive ops (what Jonathan calls instructions) or a large set of "big" ops? This definitely seems to be the harder, and more controversial, question.
With regards to the second question, let me mention one compromise that was adopted in ONNX (and I believe that the MLIR community is also thinking about something similar): ONNX has a notion of "functions": these are operations with a precisely defined semantics as well as a "lowering" into some expanded-code in terms of other operations. The idea is to enable an implementation that recognizes a function FooBar can use an optimized kernel/implementation for the op, while other implementations that do not recognize FooBar can simply replace FooBar by its expanded form. I think this is worth considering.
Thanks, everyone for the comments.
@huningxin , yes, i definitely think we can reuse some of the API work you've done already -- and we should, wherever it makes sense. I'm not sure if the compilation step is necessary for a model-loading API, or if it could be a no-op. If it's needed, the way you've defined it looks pretty good to me.
Thanks @anssiko , I've sent a position statement for the Berlin meeting.
And @wchao1115, thanks for the thoughtful comments. I agree that these APIs are not mutually exclusive, and a key question is what level of abstraction the model format is at.
Yes, it would be possible to implement a model loader API on top of a graph API, at least if the model format is a graph of operations, like ONNX or TFlite. If the model format is lower level, implementing it on top of a graph API might require decompiling the model into a graph of operations, and then recompiling it before executing. I'm not sure if that would always be possible, and it could incur some performance hit. Not sure how big.
And thanks, @gramalingam , for introducing the concept of ONNX functions. I'd like to learn more about that. It sounds like this might address the concern about custom ops needing to be defined in javascript, by providing a way for them to be lower level than that.
There are definitely pros and cons to big ops and lower-level representations. Let's discuss further!
yes, i definitely think we can reuse some of the API work you've done already -- and we should, wherever it makes sense.
Thanks @jbingham !
I'm not sure if the compilation step is necessary for a model-loading API, or if it could be a no-op. If it's needed, the way you've defined it looks pretty good to me.
IMO, in a model-loading API, the loaded model is supposed to be in a platform independent format, so developers would be able to inspect its representation. After compilation, the compiled model would be optimized by optimizer based on compilation preference. And its representation would be platform dependent and opaque to developers. So decoupling loaded model and compiled model seems to make sense to me.
Yes, agree that model formats need to be platform-independent.
Does the browser need to do any compilation and expose a javascript API for compilation, or is it sufficient to validate the file during the load step, and then hand it off to the system ML compiler/runner, like WinML, TFlite, or CoreML?
Separate question: If the system does not have an ML execution engine installed, what should the browser do?
Options:
1, 3. There's precedent for the browser saying no for things that aren't available, like media formats that aren't installed. An alert could be provided with a link to a separate bundle that implements the runner.
But browser vendors are not ML experts. Realistically, they would have to bundle code written by other teams. Eg, the Chrome team would look to the TensorFlow team for help, the Edge team would ask the WinML team, Safari would ask CoreML, and Firefox would look to one of the ML frameworks.
Any bundled implementation could get out of sync with the system version. It would likely be big, increasing the browser download size.
If the default implementation isn't performant, it might not just be slower; it might be infeasible to run. A model that takes 10x longer to execute, or soaks up all of the memory, might be best not to run at all. A server-based fallback, or a lighter model, might be required in practice.
Anyway, just some things to think about!
Thanks @jbingham , my comments,
Yes, agree that model formats need to be platform-independent.
Great. So do you think the Model
interface can expose some platform-independent properties of a model, like name
, format
, version
, inputs
and outputs
etc.,?
For model file validation and properties query, browser can also leverage OS ML API, e.g. LearningModel
of WinML and FlatBufferModel
of TF-Lite and MLModel
of CoreML.
Does the browser need to do any compilation and expose a javascript API for compilation, or is it sufficient to validate the file during the load step, and then hand it off to the system ML compiler/runner, like WinML, TFlite, or CoreML?
I agree browser should leverage OS ML compiler/runner to compile the model. The Compilation
interface allows configuring the options/preferences that help OS ML compiler decide the target device and optimization strategy, e.g. LearningModelSession
of WinML allows to configure deviceToRunOn
and learningModelSessionOptions
, Interpreter
of TF-Lite allows to configure options for GPU/NNAPI/DSP delegates, MLModel
of CoreML allows to configure MLPredictionOptions
So IMO, this process would have two steps: 1) load model: transfer the model file from network, leverage OS ML API to verify model file, create in-memory model and populate its properties. 2) compile model: configure compilation options, run OS ML compiler with model and options, get the compiled model.
@jbingham I agree with @huningxin that a separate compilation step is necessary and that it should belong to the OS layer. In practice, unless in a naive implementation, the OS rarely consumes the model as-is, but rather transforms it into a form that's more suitable for the target hardware before executing it. There are also situations where the OS is unable to fulfill the request on a target hardware due to mismatches between what the model requires to run and what the target hardware can support. In this situation, the OS would fail the request, and the browser or the javascript code should provide a fallback option.
Related to fallback options, I believe the browser must provide a reasonable fallback on the CPU by default unless the default behavior is overridden by the developer. There are cases where the model must be executed at a certain performance level or the user experience may not make sense e.g. a word processor web app running an NLP spellchecking model on the client. In cases like this, falling back to a suboptimal performance on a weak CPU due to the lack of sufficient GPU support may directly degrade the end user's experience to the point that the feature must instead be disabled.
@huningxin and @wchao1115 : Thanks for clarifying.
To recap: A compilation step will happen, implemented at the OS level. Model validation will also happen at the OS level.
I understand that the OS will use lots of ML model properties for validation and compilation. Now I'm trying to understand what is an implementation detail at the OS level, what the browser needs to know about, and what web developers need to know about.
Questions:
Maybe ideal would be if it's all at the OS level, and things are nice and simple for web developers and browser vendors. I expect that there will be some things that need to be exposed at those layers though.
Next topic:
My hunch is we'd want any CPU fallback implementation to be provided at the OS level as well. Reason: otherwise, if the browser has to ship with a full-functioned, compatible, performant CPU implementation, the requirement would be a big burden for browser vendors, because they would have to implement or bundle a very complex piece of code, which could be large on disk too.
Does that sound right?
@jbingham At a minimum, the web developer would need to understand how to use the model. This includes how to feed the model with the kind of data it is trained for, and how to interpret the model's output correctly.
The web browser, on the other hand, needs to be able to work with the underlying OS API to satisfy the model's requirements, which includes how to represent the data format used, how to carry out the operations expressed in the model graph, and how to respond to execution failures such as a missing functionality in the platform, to name a few. Again, this is orthogonal to the Javascript exposure i.e. the browser needs to be able to handle all of these aspects of model execution regardless of the exposure because all the OS API are OS-specific and the browser API is not.
I think there are good points to be made either for or against a CPU implementation in the OS. But I tend to agree that it may be more appropriate to leave it up to the OS how it should want to supply its CPU implementation to the browser e.g. ARM-based vs. Intel-based etc.
@wchao1115 Sounds good. Yes, agree the browser needs to know enough to call the underlying OS APIs, with whatever information they require. I guess it's going to vary by OS, and the browser version running on each OS will need to deal with it.
There are a few work streams here:
There is already an API (along with ongoing discussions) about Compilation/Execution. I assume that "Loading a model" needs to just introduce an alternative way to create a Model (by directly loading, instead of constructing it via a builder API), and that it should be able to share the Compilation/Execution API?
@jbingham , thanks for laying out the plan.
- Getting an end-to-end prototype working with one browser and OS. @huningxin is maybe done with this one :)
It makes sense to me.
The existing WebNN Proof-Of-Concept (POC) is based on Chromium and implements the model-builder API on top of Android NNAPI, Windows DirectML and macOS MPS/BNNS.
For the new model-loader API prototype, it would be based on WinML, TF-Lite and CoreML etc.,. So I'd like to propose to extend the existing Chromium-based POC and start with WinML.
What do you think?
@huningxin Architecturally, the LoadModel API should be implemented as a veneer above the graph builder API. The model format should simply be a serialization format of the operator graph expressible through the graph builder API. Only this way we could evolve both the graph API and its serialization format together and not separately. There could even be a separate offline tool that would parse the serialization format and automatically generate the Javascript code representing the graph programatically.
Based on discussion in this issue and on the call, we seem to have general agreement this group should incubate the "load model API" proposal alongside the existing graph builder API.
I took an action to look at what changes, if any, are needed in the Community Group Charter to make this "load model API" officially part of the Community Group deliverables.
Here's my summary:
The scope of work does not explicitly mention the "load model API". However, it can be reasonably argued the proposed "load model API" is just a convenience wrapper that given a model (~=serialization of the graph), constructs an in-memory representation of the graph. Given this, no changes required to the scope of work.
However, if this group in addition to the above, aspires to define a model format as @jbingham proposed, we may want to clarify the out of scope section a bit. Currently it says:
This Community Group does not attempt to mandate a specific neural network or Machine Learning model schema or format. Other groups are expected to address these requirements of this evolving area.
Suggestions welcome to help frame the model format work stream in more detail, relationship to MLIR etc.
@anssiko I think we've been using the term "model format" and "serialization format" interchangeably in this discussion so far because up to this point the distinction between these two notions needs not be made. Now that we're discussing what is in or out of scope for the work of this community group, I want to call attention to this distinction because it matters in our interpretation for what we want to consider as part of our work.
If we start by looking at the in-memory representation of a constructed ML graph and ask ourselves how should we serialize it so that we can load it back later, we'll find that there are many ways to serialize a graph, as protobuf, flatbuffers, JSON or even XML to name a few. Should we want to define what serialization format to use? What about compression, encryption or DRM? Those processes alter the serialization format as well.
One way to think about this is whether it makes more sense for this group to define what would be the in-memory representation of a model graph (aka the model format) while leaving the choice of what would be its serialization format to other group or standard body more directly involved in this area. Or, do we believe that the choice of the serialization format to use has significance to the success of the loadmodel API such that it also needs to be defined by us in conjunction to the model format.
@wchao1115 , thanks for your comments.
@huningxin Architecturally, the LoadModel API should be implemented as a veneer above the graph builder API.
Do you mean the LoadModel API should be implemented in the JavaScript based on the graph builder API? If the answer is yes, as part of the existing POC, we've prototyped the ONNXModelImporter.js and TFliteModelImporter.js. They are able to load ONNX or TF-Lite models and translate to WebNN graph for compilation and execution. In this approach, the JS model loaders are independent to underlying OS model loader API. I am not sure whether it is what @jbingham proposed.
@wchao1115 thanks for your feedback, it really helps frame this better. Updated proposal below.
To keep the scope reasonable, I'm hearing we should say:
All - further suggestions welcome. If no concerns heard, I'll push a PR to let you all review the proposal in more detail.
@huningxin I agree it makes sense from an implementation perspective to build the model loader API on top of an in-memory graph representation. As long as we're exploring both a model loader API and a graph API, doing this keeps them in sync.
One downside to using an in-memory graph representation as the implementation of the model loader API is that it may not be as expressive as the serialized source format. Eg, if the serialized format supports 250 ops, and the in-memory representation can't handle all of them, it might not be possible to use the intermediate representation for some ML models. Passing the model directly to the OS might enable all of the operations.
Eventually, if we decided to move ahead with a web standard for a model loader API, but not a graph API, we might be able to simplify the implementation by leaning more on the underlying OS APIs, and possibly not having an in-memory graph representation at all.
About the serialization format, @anssiko : At minimum, the model loader API requires us to choose one or more serialization formats to accept as an argument. If we determine that the developer needs to access any metadata, it might be necessary to agree on some standard attributes. I agree that it would be preferable for the serialization format to be defined by an external standards body, like tensorflow.org or ONNX.
I agree with @huningxin on the model importer approach as it ensures that there is only 1 canonical model format that the browser is required to support while all other external formats can still be used as long as it can be converted to the canonical format. I also think @jbingham raises a good point about the potential functionality gap and the importer's ability to keep up over time. On this I believe the obligation is on the webml work group to make sure that the canonical format is and remains functionally sufficient while working with vendors to ensure timely updates of the format importer libraries. I believe this is a tractable problem partly because the size of the inference operator set is significantly smaller than the training set. For example, the ONNX (DNN inference) spec as it is defined today is only a little over 100 operators in total, but it has already covered an extensive set of models both old and new in production today.
About the serialization format, @anssiko : At minimum, the model loader API requires us to choose one or more serialization formats to accept as an argument.
That sounds like a good requirement for the API. Taking an example from the world of image file formats, HTMLPictureElement
does exactly that but for images. (Not suggesting we go declarative, just drawing parallels.)
If we determine that the developer needs to access any metadata, it might be necessary to agree on some standard attributes.
The analog here would be the HTMLImageElement
and its Image()
constructor: the HTML spec does not prescribe which image file formats a browser must support, it just defines common attributes of the in-memory representation of an image, such as width and height.
In this context we'd specify an equivalent to HTMLImageElement
, say WebMLModel
, with common attributes that can be layered atop OS ML APIs.
I agree that it would be preferable for the serialization format to be defined by an external standards body, like tensorflow.org or ONNX.
This is similar to how image file formats are standardized and defined in their own groups outside W3C or WHATWG (two exceptions are PNG that was donated to W3C, and SVG defined in W3C).
Does this approach overgeneralize the problem too much? Any other caveats we haven't discussed yet?
All feedback welcome!
Hi @jbingham
One downside to using an in-memory graph representation as the implementation of the model loader API is that it may not be as expressive as the serialized source format.
I agree.
Eg, if the serialized format supports 250 ops, and the in-memory representation can't handle all of them,
For this case, I believe the model loader should be built into a ML library, like TF.js or ONNX.js. For example, apps would rely on TF.js to load a model in TensorFlow serialized format. TF.js should know how to handle all the ops of that format. If WebNN is available, TF.js would be able to identify a sub-graph supported by WebNN and create a WebNN in-memory graph. When execution, TF.js could offload the WebNN sub-graph to hardware acceleration and run other ops by its own kernels. This approach is discussed in custom op #6 .
it might not be possible to use the intermediate representation for some ML models.
I agree if the IR is below the operation abstraction level.
Passing the model directly to the OS might enable all of the operations.
I agree. Then the supported model formats would be OS dependent.
@anssiko Really interesting to think about the analogies to HTMLImageElement. It might be easiest for us at Google (on TensorFlow, Android, and ChromeOS) if the ML model format (like image formats) was up to the particular browser and platform. We might be able to move more quickly in that world.
I would also anticipate some resistance from us at Google (on Chrome) about browser vendors having no mandatory format to support, since a great promise of the web is interop.
In the case of images, in practice, all browsers support gif, jpeg, png, and more. For ML, that's possible too, though it seems potentially harder.
FYI, I've been reading up about XLA, since there's some support from PyTorch, ONNX, and TensorFlow already. I found the diagram on this page to be really helpful.
@jbingham I completely agree that the core promise of the web is interop, which is why a web format needs to be an open and a cross-platform format. In other word, a web developer must not have to worry about what OS they could run on as long as their browser supports it. Likewise for the browser vendors, they should not have to worry about what hardware they can run on as long as the OS supports it. If a web developer must know, ahead of time, what browser, OS or hardware they can or cannot run on, then there would be no advantage of programming to the web in the first place.
The comparison with the image formats is indeed interesting. At the time GIF was supported in the browser as the original web image format, it was already available in all the platforms the browser would run on at the time. I'm not aware of an ML format with such a level of support today. The platform and hardware diversity that exists today across PC and mobile form factors makes the shortcoming even more obvious and the job of this CG more challenging.
Thank you @nsthorat and @dsmilkov for your proposal on XLA-HLO during our CG meeting on 3/5/20. XLA is a great body of work from Google that I think could benefit the work of this CG. On that note I have a few questions if I may.
It looks like there's some work to add XLA support to both PyTorch and ONNX:
https://github.com/pytorch/xla https://github.com/onnx/onnx-xla
@wchao1115
Resolution from the Community Group meeting today:
anssik: proposed resolution: add the load and run a model API to the group's scope, the group does not attempt to mandate a specific ML schema or format. No changes to the charter with this clarification.
anssik: how about starting with all in one spec and split later?
LGTMs from @huningxin @jbingham and Rafael Cintron
For the record, the link to the resolution recorded in the 19 March 2020 WebML CG call minutes.
@jbingham to come back with a plan on how to move this proposal forward to a more formal spec shape. Thanks!
the group does not attempt to mandate a specific ML schema or format
I'm intrigued at this be maintained in the charter when it seems we've all agreed that a core tenant of the web is interoperability. As such I think that we HAVE to have a cononical format that others can be translated to in order to ship this to production. As folks have referenced the image analogy I actually don't want to recreate that as the picture element is actually a solution to a standardization issue of format battles rather than being able to depend on support of formats across the web.
I think I prefer what @jbingham proposed at one point which is to require a canonical format but allow browsers to support others if they desire.
What are people's thoughts?
I think what Anssi meant is that the existing charter of the community group doesn't mandate a specific schema or format, so there's no change being proposed here.
For a web standard that can actually ship, I agree we need to clarify this question. One or more canonical formats, plus zero or more optional formats, is a reasonable recommendation. There is precedent for not specifying any required format at all: eg, web standards don't specify what image formats must be supported, but in practice there are several formats that all browsers do support. This is less ideal, but could be workable, especially if offline conversion utilities exist.
I think what Anssi meant is that the existing charter of the community group doesn't mandate a specific schema or format, so there's no change being proposed here.
Ok, if that's the case then cool.
For a web standard that can actually ship, I agree we need to clarify this question.
Agreed
here is precedent for not specifying any required format at all: eg, web standards don't specify what image formats must be supported, but in practice there are several formats that all browsers do support.
Yep, I agree that there is precedent but I don't think it's something we should follow. To your point, there is a "standard" but it took pain of interop for web devs, UAs, and users; I wish to avoid standardizing in that way. As @wchao1115 noted above that in order to do that we'll need to agree at which layer we decide to standardize within this canonical format which other models with either translate to via tooling or conform themselves to allow us to standardize them as may implement formats.
I agree with @gregwhitworth , that comment is a bit confusing. I too think agreeing on a canonical format is important for a standard.
Thanks for your comments. Here’s more context to the resolution taken:
Our charter has the following text in its Out of Scope:
This Community Group does not attempt to mandate a specific neural network or Machine Learning model schema or format. Other groups are expected to address these requirements of this evolving area.
When discussing possible charter changes in this context, we heard a preference from Microsoft to not change the current charter since it triggers a costly internal review process. The details of that concern were not scribed, but @RafaelCintron can confirm.
If Microsoft’s position has changed, we can revisit the rechartering discussion. My recommendation is, however, to not block this or other ongoing incubations in this group on that detail at this stage of exploration and cross that bridge when we come to it.
Can we revisit the idea of an API to load and run a model?
I've written up a draft explainer for what a Web ML Inference API might look like, as well as a bunch of issues and questions around it.
The idea was discussed way back in issue 3, and probably even earlier by many people in the group. For various reasons, the group decided to pursue a graph API instead.
Why revisit now? The TensorFlow team raised some concerns that a graph API may not be the right level of abstraction, due to how fast ML is evolving, and their experience with the rapid growth in operations in TensorFlow models. After digging a bit to understand where this caution came from, I learned about the efforts around MLIR, and how the TensorFlow team sees that fitting into the picture. Also, I had a chance to talk with Greg and others at Microsoft about the original reasons not to go with an inference API, and it seems like things may have changed.
If this is an interesting enough topic to people, we could consider talking about it during the face-to-face in Berlin.