Integration with real-time video processing

dontcallmedom commented 2 years ago

The WebRTC Working Group is working on an API to allow fast processing of real-time video, with two proposals under discussion towards convergence: https://github.com/alvestrand/mediacapture-transform and https://github.com/jan-ivar/mediacapture-transform (see also relevant issues on convergence). Chromium has started shipping an implementation based on the first proposal which should allow for initial experimentation with the overall approach.

Since we can expect a lot of this real-time processing to be done based on Machine Learning models, and as suggested by the Web Machine Learning Working Group charter, we should ensure that models loaded via WebNN-backed JS frameworks can be used in the context of that API (in particular, of a WHATWG Streams-based API, running in a worker context, with video frames coming from a webcam likely stored in a GPU memory context), and that it delivers actual performance improvements (in particular that any boost from the hardware acceleration provided by WebNN doesn't get overtaken by cost associated with e.g. memory copies).

My sense is that the best way to determine this would be:

to build a prototype that integrates the mediacapture-transform API (in a worker context) with e.g. a TF.js model that allows for background blur
measure performance of the said prototype across various TF.js backends, including a WebNN-native one; ideally this would include specific measurements of memory copies, although the raw result on FPS may already give sufficient hints

While the real-time video processing framework in WebRTC is still somewhat in flux, I think we have enough convergence on the overall picture and a good enough basis for experimentation with the Chromium implementation to get started with such a work. The WebRTC Samples repo has a few examples of that API in action (video-crop in particular exercises it in a worker context).

/cc @aboba @alvestrand @jan-ivar

anssiko commented 2 years ago

Thanks @dontcallmedom for starting this discussion that clearly benefits from coordination between the WebML and WebRTC WGs.

A prototype integrating the WebNN API with the mediacapture-transform API (explainer, Chrome Status, crbug) would be an interesting exploration. I think we'll first need to look at the mediacapture-transform API in more detail in this group to understand it better. It would be an informative exercise to evaluate how the integration with mediacapture-transform API in a worker context affects the performance over what we currently experience.

Looking at our existing work, for background blur, we have made available a WebNN Semantic Segmentation sample (source) that uses DeepLab V3 MobileNet V2 from TFLite models. This sample uses webnn-polyfill that is based on TensorFlow.js. With the polyfill the sample performs OK-ish, but we expect substantial improvement in performance with the native implementation. One possibility would be to expand this sample, or build upon the WebRTC samples @dontcallmedom referenced.

@huningxin can possibly share the expected speedup when using WebNN-GPU or WebNN-CPU backends for semantic segmentation using the above-mentioned model compared to the polyfill and its CPU/Wasm and GPU/WebGL backends.

Currently, we have a test bench based on Electron.js we can use to approximate the performance of a native browser implementation. We shared some performance data using this test bench earlier for other use cases. There may be some gaps in API coverage in Electron.js to prototype this, needs investigation.

We're starting upstreaming WebNN to Chromium which should make it easier for us to identify implementation optimization opportunities and issues, but that work takes a while to land.

[Edit: updated mediacapture-transform to point to the TR version.]

huningxin commented 2 years ago

A prototype integrating the WebNN API with the mediacapture-transform API (explainer, Chrome Status, crbug) would be an interesting exploration.

+1

/cc @ibelem

@huningxin can possibly share the expected speedup when using WebNN-GPU or WebNN-CPU backends for semantic segmentation using the above-mentioned model compared to the polyfill and its CPU/Wasm and GPU/WebGL backends.

As TPAC WebNN demo video (semantic segmentation demo starts from 1:17, the performance summary starts from 2:00) shared by @Honry, there were 3.4x speedup on GPU and 7.2x speedup on CPU for DeepLab V3 MobileNet V2 model on the test device.

I am curious whether real-time audio processing is in the scope. Noise suppression use case is supported by WebNN, e.g., for video conferencing application. WebNN sample supports two noise suppression models: RNNoise and NSNet2.

aboba commented 2 years ago

mediacapture-transform API is based on WHATWG Streams, which enables media processing to be represented as a TransformStream. In this model, Audio/VideoFrames (raw samples) are provided as input, and the output is enqueued for processing by the next stage. A presentation on the model and some sample applications are provided here.

anssiko commented 2 years ago

Thanks for the presentation @aboba, highly informative. Any caveats in the TransformStream or mediacapture-transform implementation we should be aware of for this experiment? Can MediaStreamTracks be transferred yet?

Per the demo @huningxin pointed us to, we see ~26 fps background blur using semantic segmentation (internals: Electron.js with WebNN Node.js binding, GPU backend, Core i7 & integrated graphics).

We'll look into experimenting with background blur in a transform stream in a worker context. Any specific metrics you're interested in, let us know.

aboba commented 2 years ago

As far as I know, transferrable MediaStreamTracks are not implemented yet. So for experimentation, you'd need to use transferrable streams. The presentation included a demo of a mediapipeline implemented entirely in workers using transferrable streams. So you can just define another TransformStream and plug it into the pipeline. Wrapping encode/decode and serialize/deserialize in a TransformStream was pretty straightforward, though I have come to understand that additional work is needed to do a good job cleaning up after errors.

anssiko commented 2 years ago

@aboba, thanks for the mediapipeline in workers demo and for confirming my hunch we need some workarounds to transfer MTS across. I also want to acknowledge @tomayac who wrote an article that includes a pointer to a QR code scanner demo, and @dontcallmedom for pointing us to the WebRTC Samples repo initially.

Would folks prefer us to take some specific sample e.g. insertable-streams/video-processing as a starting point, move processing to worker and add a neural network accelerated background blur transform option to it? We could also use our semantic segmentation sample and bolt TransformStream in worker into it.

I'm trying to define this prototyping task so it could become more than just a one-off experiment, something that folks interested in this could use to build more experiments with. I think it'd be good to host this (also) in the canonical WebRTC samples repo, if there's one. WebRTC use cases are important to our group working on WebNN API.

miaobin commented 2 years ago

We have implemented a real-time sample of noise suppression (RNNoise) based on mediacapture-transform API. This sample successfully constructed a pipeline to suppress the noise of the audio data collected by the microphone and send it to the speaker or earphone of the device. BTW, we still have some uncertainties about using this API such as how to adjust the number of the audio frames and sample rate of the audio? (Now we know that the default numberOfFrames is 480 and sampleRate is 48000) @dontcallmedom @huningxin

dontcallmedom commented 2 years ago

great to hear!

If I may, I would suggest focusing rather than on video processing if at all possible, for two reasons:

mediacapture-transform (which the WebRTC WG has now formally adopted) will refocus (at least for the time being) to video processing only, since audioworklet exists for audio processing (speaking of which, it may be interesting to re-apply the work you did on noise suppression via audio worklet)
some of the major performance risks (in particular in terms of memory management, and potential transitions between CPU and GPU processing) will only be surfaced with video processing

Honry commented 2 years ago

I've integrated mediacapture-transform API for video processing in our semantic segmentation sample.

I just reused the original post-processing, which is a bit complicated that calculates segmentation map for different detected objects, renders outputs into a canvas element and provides features for filling in customized colors, images and etc., then I converted the output canvas to a VideoFrame and enqueued it to the mediacapture-transform API's controller.

This is the cheapest way to integrate the mediacapture-transform API into current webnn-samples but not efficient. We may need to figure out a new post-processing that suitable for this API to improve the performance.

Source code: https://github.com/Honry/webnn-samples/tree/mediacapture-transform/semantic_segmentation Preview: https://honry.github.io/webnn-samples/semantic_segmentation/index.html

anssiko commented 2 years ago

Thanks @Honry and @miaobin for your work on these prototypes.

@Honry, it seems your mediacapture-transform API prototype performance is on par with the original semantic segmentation sample used as a starting point. I think that was expected.

I think the next step would be to move expensive processing to workers, and as of now, that requires the use of transferrable streams, I believe. This task will be challenging due to limited availability of these work-in-progress APIs in Chromium, so the prototype may need to be revised as the browser implementation of mediacapture-transform API evolves. @aboba shared some tips above that may be helpful.

The WG will review discuss these prototypes on our next call. Thank you for your contributions, this is important prototyping work.

dontcallmedom commented 2 years ago

thanks indeed, good progress!

My reading of the code shows that there is at least a GPU→CPU transfer when turning the video frame into an input tensor; I'm not sure if the model inference is happening on the CPU or GPU. Ideally, and notwithstanding @anssiko's remarks about making this running in a worker, we would want to write a full-GPU-only pipeline, if at all possible with no memory copy.

Can you already identify gaps in the APIs that would make this hard or impossible?

dogben commented 2 years ago

I think the next step would be to move expensive processing to workers, and as of now, that requires the use of transferrable streams, I believe. This task will be challenging due to limited availability of these work-in-progress APIs in Chromium, so the prototype may need to be revised as the browser implementation of mediacapture-transform API evolves.

FWIW, there is a worker sample here: https://webrtc.github.io/samples/src/content/insertable-streams/video-crop/

The API used is not the API that ended up being standardized, but no browser supports the standardized API yet. The APIs used by the sample are stable in Chromium.

huningxin commented 2 years ago

My reading of the code shows that there is at least a GPU→CPU transfer when turning the video frame into an input tensor;

According to WebNN spec, an MLContext could be created from a specific GPU device such as GPUDevice or WebGLRenderingContext that is already in use by the application, in which case the corresponding GPUBuffer or WebGLBuffer resources used as graph constants, as well as the GPUTexture and WebGLTexture as graph inputs must also be created from the same device.

And there are corresponding import video frame to GPU texture extension/proposal: WebGL WEBGL_webcodecs_video_frame Extension and import VideoFrame from WebCodec to WebGPU proposal.

So it looks like possible that the app can avoid the GPU-CPU transfer by importing the video frame into a GPU texture and feed it into WebNN graph which is created from the same GPU device.

dontcallmedom commented 2 years ago

thanks @huningxin for the pointers!

So assuming importExternalTexture is extended to accept VideoFrame as source, I'm still not clear how we would feed this into MLGraph.compute which accepts GPUTexture, but not GPUExternalTexture (and as far as I can tell, the latter cannot be IDL-cast into the former).

I'm also not clear how we would go from the resulting GPU buffers compute would calculate back to VideoFrames that could be fed into the MediaStreamTrack - the VideoFrame constructor only recognize image elements, offscreen canvas and video frame as seed at the moment.

Separately, it seems that all the machinery I have identified as needed so far in WebGPU/WebNN would be available in the worker in which MediaCapture Transform would operate, assuming there would be no particular challenge in having the ML model data available in that context. But do we have a mechanism to load the ML model directly in a GPU buffer, or does this necessarily go through some CPU copy first?

alvestrand commented 2 years ago

Both transferable streams and MediaStreamTrackGenerator / MediaStreamTrackProcessor are released APIs in Chrome (transferable streams were released in Chrome M87 per https://chromestatus.com/feature/5298733486964736)

The version that the WG is currently iterating on for MediaStreamTrackGenerator is slightly different from what Chrome implements (among other things, it's called VideoTrackGenerator), but the differences in semantics are very small so far.

Kangz commented 2 years ago

If the ML graph supports using a GPUTexture input then it can probably use a GPUExternalTexture input as well. However it's not clear how this would be doable without some recompilation of the underlying shaders, or a copy.

I'm surprised that the interaction with WebGPU is already integrated in WebNN without any discussions with the WebGPU group, and it's not immediately clear if passing a GPUTexture directly is always possible. It seems that it would depend on the underlying architecture of the WebGPU and WebNN interactions. I'd suggest opening an issue in gpuweb/gpuweb with an investigation/description of how this WebNN / WebGPU interop could be happening. Same thing for WebGL actually.

aboba commented 2 years ago

In WebCodecs we have PR https://github.com/w3c/webcodecs/pull/412 for conversion of VideoFrame to GPUExternalTexture, but it seems premature to merge this without understanding how to resolve the issues that @dontcallmedom raises.

huningxin commented 2 years ago

I'd suggest opening an issue in gpuweb/gpuweb with an investigation/description of how this WebNN / WebGPU interop could be happening.

Good suggestion. https://github.com/gpuweb/gpuweb/issues/2500 opened.

huningxin commented 2 years ago

So assuming importExternalTexture is extended to accept VideoFrame as source, I'm still not clear how we would feed this into MLGraph.compute which accepts GPUTexture, but not GPUExternalTexture (and as far as I can tell, the latter cannot be IDL-cast into the former).

The sample probably could use WebGPU shader to convert the GPUExternalTexture to GPUBuffer. The WebNN model takes tensor in 'nchw' layout. The sample today uses JS (getInputTensor) to do pre-processing and convert the canvas to an ArrayBufferView. For GPU pipeline, it could use WebGPU shader to do the pre-processing and convert the GPUExternalTexture to GPUBuffer and feed the GPUBuffer to MLGraph::compute.

I'm also not clear how we would go from the resulting GPU buffers compute would calculate back to VideoFrames that could be fed into the MediaStreamTrack - the VideoFrame constructor only recognize image elements, offscreen canvas and video frame as seed at the moment.

The sample today uses JS to render (drawOutput) the output ArrayBufferView to a canvas. For GPU pipeline, the sample could use GPUBuffer as output of the MLGraph::compute and use WebGPU shader to render the output to a canvas. Then according to https://github.com/w3c/mediacapture-transform/issues/34#issuecomment-839584916, a VideoFrame could be created from CanvasImageSource.

But do we have a mechanism to load the ML model directly in a GPU buffer, or does this necessarily go through some CPU copy first?

MLGraphBuidler::constant could load weights from GPUBuffer directly.

I proposed above GPU pipeline processing steps in https://github.com/webmachinelearning/webnn-samples/issues/124.

chcunningham commented 2 years ago

cc @sandersdan

wchao1115 commented 2 years ago

The on-ramp and off-ramp from VideoFrame and GPUExternalTexture that @huningxin describes can also be defined as part of the WebNN spec. That way the same backend implementation can be shared across different use cases for consistency and completeness.

dontcallmedom commented 2 years ago

another question that the usage of RGB-based canvas in the current version of the prototype raises: VideoFrame comes into a variety of pixel formats and color spaces; I assume most ML models are built on a specific format/color space (RGBA), but I'm not sure that's what would be most systematically available from a WebRTC pipeline.

I'm unclear in the first place who decides what format/space is used in a VideoFrame and if developers have any agency on it. If not, I'm not sure how we could get developers to run their model without incurring further GPU memory copy if they get frames in a format not aligned with what their models need as input.

sandersdan commented 2 years ago

When constructing from an ArrayBuffer, the pixel format and color space are set by the app. Otherwise it's up to the UA. In Chrome for example it's possible for webcam capture to be any of I420, NV12, RGBA, or BGRA, and the color space could be BT.601, BT.709, or sRGB.

The Chrome capture pipeline is optimized for WebRTC use which means a preference for I420 or NV12 depending on the platform.

Capture from canvas is similar, an RGB format is logical but if we expect to be encoding then it can be more efficient to produce I420 or NV12 directly at capture time. In practice we usually have an RGBA texture from canvas capture and defer pixel format conversion to encode time.

dontcallmedom commented 2 years ago

@sandersdan when the underlying pixel format and color space was a purely internal matter for optimization by the UA, leaving this entirely under the UA control made sense; but does it remain workable once we start exposing these internal aspects to developers?

sandersdan commented 2 years ago

As I see it, apps that want to be efficient should support multiple formats, and all apps should use resampling for formats they do not support. It is rarely going to be more efficient to do the resampling at capture time, and in many cases it would prevent the UA from doing the most efficient thing.

Sometimes we don't even know what the underlying format is, such as when it is provided by the platform as an external texture. In that case the only option is to sample it, and deferring that is likely to be more efficient.

We could work more on the import ergonomics. Sometimes we do have a backing that could directly become a Buffer, other times we could hide the sampling and provide a Buffer in a requested format.

This is moot with just importExternalTexture(), in that case you have no choice but to sample so it doesn't matter what the pixel format is. Depending on your needs the colorspace adaptation can likely be part of that same step.

anssiko commented 2 years ago

(FYI: https://www.w3.org/TR/mediacapture-transform/ was released as a FPWD today, so we have a canonical URL for this spec now. Congrats WebRTC WG!)

huningxin commented 2 years ago

I created a background blur sample based on the video processing sample of insertable streams:

main thread version: https://huningxin.github.io/webrtc-samples/src/content/insertable-streams/video-processing/ worker thread version: https://huningxin.github.io/webrtc-samples/src/content/insertable-streams/video-processing-worker/

Currently it supports two transforms (hopefully full-GPU-only processing pipeline):

WebGL segmentation and blur

The details of the WebGL processing pipeline (webgl-background-blur.js):

VideoFrame import: VideoFraome - (createImageBitmap) -> ImageBitmap - (gl.texImage2D) -> Texture.
Image blur: the shader implementation is based on @Volcomix 's virtual-background project (thanks!).
Background segmentation: it is based TF.js WebGL backend that runs the TF.js DeepLabV3 model.
Image blend: the segmentation result of TF.js is copied into a texture (Tensor.dataToGPU). Another WebGL frangment shader is used to blend the original input and the blurred one based on the result texture. The final output is drawn into an offscreen canvas.
VideoFrame export: create VideoFrame from the offscreen canvas.

WebNN segmentation and WebGPU blur

The details of the WebGPU/WebNN processing pipeline (webgpu-background-blur.js):

VideoFrame import: VideoFraome - (createImageBitmap) -> ImageBitmap - (GPUQueue::copyExternalImageToTexture) -> GPUTexture
Image blur: the shader implementation is based on @austinEng 's WebGPU samples project (thanks!).
Input tensor preprocessing: it is implemented in a WebGPU compute shader. Its input is a GPUTexture and the output is a GPUBuffer. The GPUBuffer will feed to WebNN graph compute as the input.
Background segmentation: it is implemented by a WebNN graph (webnn_deeplabv3.js). The weights come from the TFLite DeepLabV3 model. This TFLite model and TF.js DeepLabV3 model (used by WebGL pipeline) are based on the same TF model (tensorflow/deeplabv3/1).
Image blend: WebNN graph puts the segmentation results (segmap) into the output GPUBuffer. Another WebGPU compute shader is used to blend the original input and the blurred one based on the segmap. The final output is drawn into an offscreen canvas.
VideoFrame export: create VideoFrame from the offscreen canvas.

To test the WebGPU/WebNN processing pipeline, you may need to download the WebNN Chromium prototype, currently only Windows build is available. This prototype supports DirectML backend and implements WebNN/WebGPU interop API that accepts GPUBuffer as WebNN graph constants and inputs/outputs.

Other notes (known issues):

(fixed) There is GPU memory leak in WebGPU blur pipeline (even without WebNN). Tracked by a Chromium issue and @shaoboyan is fixing it.
There is a lack of a WebGPU based background segmentation implementation. This is blocked by a TF.js WebGPU backend issue that doesn't support DeepLabV3 model.
Running WebGL background blur on an entry level GPU may cause browser tab freezing. Tracked by a Chromium issue.
(done) The web worker support is a next step. The worker version: https://huningxin.github.io/webrtc-samples/src/content/insertable-streams/video-processing-worker/

The screenshot of the WebGPU/WebNN transform running in the WebNN Chromium prototype:

anssiko commented 2 years ago

Thanks @huningxin for your continued work on this topic, impressive proof of concept and the WebNN Chromium prototype.

I've put this topic on the WG agenda for this week.

dontcallmedom commented 2 years ago

Thanks @huningxin for another amazing piece of work! Beyond the implementation limitations you noted, are there new API lessons that have emerged from this new round of prototyping?

Trying it out, some (probably very unreliable) measurements on my laptop:

both the WebGL and WebGPU paths sustain 30 FPS
both still show a fairly important CPU usage (20%-40%)
the WebGPU path hovers ~20% CPU usage, where the WebGL one hovers ~40% CPU

I haven't investigated either of these aspects in any depth - the CPU usage may be linked to running this in the main thread rather than in a worker, creating contention?

huningxin commented 2 years ago

@dontcallmedom

The worker version is now available at:

https://huningxin.github.io/webrtc-samples/src/content/insertable-streams/video-processing-worker/

Feel free to test it on your laptop.

the CPU usage may be linked to running this in the main thread rather than in a worker, creating contention?

AFAIK, running the transform in worker would not reduce the CPU usage. It just offloads the workload off the main thread with some overhead of inter-thread-communication. I suppose it would help free the main/UI thread if the transform is blocking call, e.g., Wasm function (TF.js wasm backend?) or the sync version of WebNN graph compute (#229).

huningxin commented 2 years ago

There is GPU memory leak in WebGPU blur pipeline (even without WebNN). Tracked by a Chromium issue and @shaoboyan is fixing it.

This issue has been fixed. The updated WebNN prototype based on Chromium 102.0.4973.0 includes this fix. Thanks much @shaoboyan and @Kangz !

huningxin commented 2 years ago

@dontcallmedom

both still show a fairly important CPU usage (20%-40%)

According to my initial profiling, the transform loop spends about 35% of the total time on createImageBitmap and 20% on GC which is pretty high.

I'll look into whether importExternalTexture could help and reduce the object allocations as much as possible.

anssiko commented 2 years ago

To keep everyone on top: this background blur prototype was reviewed with the WebRTC WG:

See the minutes for the next steps.

huningxin commented 2 years ago

According to the resolution 2 of WebML WG Teleconference – 16 June 2022, I am going to remove the "cr" label of this issue. This use case depends on WebNN / WebGPU interop capability. #257 introduced the MLCommandEncoder interface that interops with WebGPU command buffer and queue. There are remaining open issues:

dontcallmedom commented 1 year ago

FYI @tidoust wrote up some of his extensive research in nearby spaces https://webrtchacks.com/real-time-video-processing-with-webcodecs-and-streams-processing-pipelines-part-1/ https://webrtchacks.com/video-frame-processing-on-the-web-webassembly-webgpu-webgl-webcodecs-webnn-and-webtransport/ - the latter mentions WebNN explicitly

anssiko commented 1 year ago

I was about to link these two fantastic articles here but @dontcallmedom beat me to it. Great work @tidoust and @dontcallmedom! I love the game of dominoes analog. From now on it is my mental model for the video processing pipeline :-)

anssiko commented 5 months ago

@dontcallmedom, is it correct that the WebRTC WG has been actively working on https://w3c.github.io/webrtc-encoded-transform/ as a (functionally equivalent?) replacement for the earlier proposal https://alvestrand.github.io/mediacapture-transform/ ?

Per https://chromestatus.com/feature/5499415634640896 the earlier proposal shipped in Chrome.

I wanted to document the most recent WebRTC WG's direction here should the WebML WG aspire to do further work in this space in the future.

alvestrand commented 5 months ago

mediacapture-transform (shipping in chrome, https://w3c/mediacapture-transform is what's been agreed upon but not implemented) is about raw buffers. webrtc-encoded-transform (also shipping a different API in Chrome) is for encoded frames. For WebML image/sound processing purposes, it's likely irrelevant, since ML doesn't want to have to decode the frames.

anssiko commented 5 months ago

@alvestrand thanks for the clarification. I had forgotten mediacapture-transform had transitioned (no pun intended) from an Unofficial Proposal Draft https://alvestrand.github.io/mediacapture-transform/ to its own Working Draft https://w3c.github.io/mediacapture-transform/

Feel free to ping this issue when you think it'd be a good time for the WebML WG to review the latest API again.

webmachinelearning / webnn

Integration with real-time video processing #226

WebGL segmentation and blur

WebNN segmentation and WebGPU blur

264