xiph / rav1e

The fastest and safest AV1 encoder.
BSD 2-Clause "Simplified" License
3.69k stars 252 forks source link

Possible speed upgrade, Use HW hevc encoding for first pass #2945

Open LuisB79 opened 2 years ago

LuisB79 commented 2 years ago

maybe something could be made for hw to provide data from the first pass runned in a hw accelerator like nvenc or vceenc, and then pass it to the av1 encoder

LuisB79 commented 2 years ago

Unlikely unless someone of the big three creates the feature or gives access to the source code, a possible route is reverse engineering.

namibj commented 2 years ago

Unlikely unless someone of the big three creates the feature or gives access to the source code, a possible route is reverse engineering.

ME-only is an officially supported mode of operation for NVENC (wasn't there at the start, but all the modern cards have it), and one could probably let it run single-pass encoding and look at the bitstream to get an estimate of frame-complexity (HEVC at least should allow enough reference frames to match the rav1e/CPU side in a representative fashion, though one has to sync frame types and frame references/reordering to prevent misleading data).

Using the HW encoder's ME facilities should help particularly at high speeds and with many references that have to be checked. If that is something that would be realistic, I'd be happy to help. (AV.1 screen sharing better than x264 would be really nice to have, and I think this could help a lot if applicable.)

LuisB79 commented 2 years ago

What about amd or intel cards?

FreezyLemon commented 2 years ago

FWIW, AMD has a PreAnalysis component in its AMF framework. AMF also includes the AVC and HEVC hardware encoders, both of which use this PreAnalysis. It can also be used standalone according to the docs (check section 3).

The output of this pre-analysis is described as "activity maps", I'm not sure what this means exactly but it might be helpful? They have a sample application in their repo to use the pre-analysis, maybe that could help with analyzing the output.

namibj commented 1 year ago

@FreezyLemon :

The output of this pre-analysis is described as "activity maps", I'm not sure what this means exactly but it might be helpful?

Unfortunately it seems to just attempt intra-coding 16x16 chunks and turns the resulting size into the output value, if it even does something that fancy (it might just eyeball the amplitudes of the spatial frequency spectrum looking for how quantization-friendly the block is).

While this API won't be useless for rav1e (should be useful for cheap adaptive quantization and scene change detection, like how AMD's HW encoders (H.264 & H.265) use it (and possibly load-balancing a frame across slices with a slice border based on correlating the block complexity estimate with time spent coding the block)).

From the PDF:

General description:

The AMF PA component accepts raw input images in NV12 format. It calculates a metric for content activity of different blocks across every image, as well as video property flags such as scene change and static scene flags. In the current release, spatial complexity is used as a measure of activity inside each block.

Standalone mode interface:

The standalone AMF PA component accepts AMF_SURFACE_R32 surfaces as input and produces activity maps also stored in AMF_SURFACE_R32 surfaces. The resulting activity maps consist of one activity value (32-bit unsigned) for each 16x16 pixel block.

LuisB79 commented 1 year ago

balancing a frame across slices with a slice border based on correlating the block complexity estimate with time spent coding the block

While this api what?

namibj commented 1 year ago

balancing a frame across slices with a slice border based on correlating the block complexity estimate with time spent coding the block

While this api what?

Oh, I see, I accidentally wrote invalid grammar there.

Sorry about that, @LuisB79.

In lieu of trying to fix the broken wording, I'll re-phrase the information contained within that sentence:

While this API won't be useless for rav1e, it's not going to help with motion vector search or even getting decent estimates of (bitrate/encoded-size) complexity at frame or, even, more granular levels.

AMD's HW encoding seems to use it for psychovisual adaptive quanitzation (like how x264's CRF mode scales the quantizer up during high-movement sequences as the viewer won't have time to spot the artifacts in real-time) and scene change detection (I-frame placement aligned to scene changes for the obvious efficiency gains over using a fixed interval). We could maybe use it for the same purpose. Additionally, we might be able to use it's estimation of per-block (IIUC fixed 64x64 pixel tiles) complexity to split the frame into slices of approximately equal complexity for multi-threaded encoding on the basis that blocks with more complexity tend to require more bits to encode and thus deserve more effort to shave off some of those bits during encoding, independently of whether they end up being coded direct, as intra-references, or as inter-references.

I also decided to dig into the topic of using Intel hardware encoding for motion estimation:

Access to Intel's H.264 pre transform/quantize HW encoding stages (called ENC) exists and spits out macroblock-level encoding decisions for inter/intra, specific references, and motion vectors. Officially the application is then free to tweak these before feeding them into the hardware's transform/quanitze/loop-filter/entropy-coding stages, but there is no requirement to ever call (IIUC not even to initialize) this later encoding stage.

There's also the PreENC (documented 2 sections above ENC) mode which is fed with specifically two reference frames and optional macroblock-level motion vector predictions, and then performs motion estimation/search, using the predictions as start points if any were provided, producing motion vectors and SAD statistics.

ME-only access on AMD

I looked again for lower-level API access to get motion-estimation without quantization/entropy coding out of AMD VCE/VCN, but there doesn't seem to be any "proper" SDK/API access. I even looked into the linux kernel source and poked the AMF source if there are any hints towards accessible modularity, but everything points towards the hardware blocks being plumbed together courtesy of the GPU firmware blob. TL;DR: I do not expect (NDA-free) access to any AMD HW motion estimation functionality to turn up, even with deeper search.

LuisB79 commented 1 year ago

Thanks for the clarification and the time it took you to investigate this, i wonder why amd has it's motion-estimation so blocked. A 5% improvement in performance is still an improvement so it would be nice (i just put a rnd number for the performance improvement, it's just to illustrate a point)

lolzballs commented 1 year ago

everything points towards the hardware blocks being plumbed together courtesy of the GPU firmware blob.

One of the features touted for LiquidVR is motion estimation between two source frames. Even though it seems like abandon-ware at this point it points to this being possible.

I know for a fact it is possible, since I implemented some of the PA features in AMF that utilize MV hints/RD costs from a "first pass" encoding in a similar manner being proposed here. Maybe it'll be exposed in a future SDK release.

LuisB79 commented 1 year ago

wouldn't it be a good idea to ask for the feature in the github of LiquidVR?