xiph / rav1e

The fastest and safest AV1 encoder.
BSD 2-Clause "Simplified" License
3.71k stars 252 forks source link

Don't restrict partition sizes to exactly fit on the right or bottom frame boundaries #2166

Closed ycho closed 4 years ago

ycho commented 4 years ago

Affected areas of codebase can be:

ycho commented 4 years ago

@rzumer

Example of partitions decided by current rav1e. The 2nd frame (inter frame) of 'blue_sky_360p_60f.y4m' encoded by rav1e, -s 6: The input frame size is 640x360 and there are total 10 x 6 (row x col) SBs (SuperBlocks) are encoded in a frame.

The bottom SB row has 40 visible pixels in height. See that all SBs in the bottom-most SB row are SPLIT down to: two of 32x32 and two of 16x4 partitions. This can be problematic because the encoder has no option to choose larger partitions when their rd performance is better and this eventually will cause overall coding loss of a frame. Our goal is to enable rav1e to not split for this SB, which is allowed by av1 spec. Note that av1 only allows PARTITION_SPLIT_NONE if more than half of block size pixels available in both row and col, as stated in https://aomediacodec.github.io/av1-spec/#decode-partition-syntax.

Screen Shot 2020-02-13 at 9 27 12 AM

Captured from https://beta.arewecompressedyet.com/analyzer/?maxFrames=4&decoder=https://people.xiph.org/~mbebenita/analyzer/inspect.js&decoderName=master-1955d6d49b692e687661ba3bed95a26e91877645&file=https://beta.arewecompressedyet.com/runs/master-1955d6d49b692e687661ba3bed95a26e91877645/objective-1-fast/blue_sky_360p_60f.y4m-252.ivf

And shown with its motion vectors: We (and rav1e encoder) probably want to use larger partitions on the bottom SB row, given that the scene is panning!

Screen Shot 2020-02-13 at 9 36 03 AM
ycho commented 4 years ago

On the contrary, here is the partitions decided by libaom (+ my modification that min partition size = 64x64, like "*.default_min_partition_size = BLOCK_64X64")

Screen Shot 2020-02-13 at 9 47 40 AM

And shown with its motion vectors:

Screen Shot 2020-02-13 at 9 49 02 AM

In case you want the bitstream file for above, bluesky.ivf.gz

tmatth commented 4 years ago

~@ycho nice analysis, do you know if/how libaom is behaving compared to rav1e for this?~ Nevermind, you just answered this before I asked :+1:

ycho commented 4 years ago

~@ycho nice analysis, do you know if/how libaom is behaving compared to rav1e for this?~ Nevermind, you just answered this before I asked 👍

Hey~ thanks for trying asking and enjoying art of screen-capturing!

ycho commented 4 years ago

@barrbrain, if bsize straddle on frame borders (with my dev branch), luma_ac() gets panic at https://github.com/xiph/rav1e/blob/8f273bcbde77e5f3711138a57c13dec9dc793973/src/encoder.rs#L1859. I think it is because the subregion() in plane_regions.rs limit the w and h of region only up to tile (or frame if one tile) w and h, https://github.com/xiph/rav1e/blob/8f273bcbde77e5f3711138a57c13dec9dc793973/src/encoder.rs#L1852.

ycho commented 4 years ago

Little progress with enforce partitions in a bottom SB row all skipped, where revised partition decide to choose 64x64 because split into four 32x32s is not available since two 32x32 of them requires to be further partitioned by the definition of av1 spec, which topdown partition does not do.

Screen Shot 2020-02-24 at 4 14 22 PM

Speed 9 is used, where 32x32 or 64x64 partition sizes are available. See the bottom SB row where 64x64 are able to be used now, which does not exactly matches the frame size.

ycho commented 4 years ago

What's done: 1) Since rav1e currently don't allow accessing outside frame(or tile) pixels, prediction is done for inside pixels only. (However, as in other encoder and decoder, rav1e might need accessing outside tile pixels for enc time efficiency reasons) 2) When computing the residue signal for tx input, rav1e has no definition what to do about outside frame pixels, which is to set outside residue values as zeros. So, for the tx-block that straddle on frame border, while intra prediction can be done for whole tx-block, the residue block for it should not count outside frame pixels.

Below decoded image is by dav1d, since aomdec failed to decode it (hence no image from AOMAnalyzer)

Then my next problem to solve is: why the last SB column corrupted except that in first SB row?

Screen Shot 2020-03-03 at 11 27 02 AM

Screen Shot 2020-03-03 at 4 21 36 PM

Since there is bright shifted content in the SB position (9,1) (col, row), it seems that the ref value for DC prediction is incorrectly obtained. The brightness level gets higher, which means larger than correct ref DC value is obtained.

ycho commented 4 years ago

So, "let mut residual_storage: Aligned<[i16; 64 * 64]> = Aligned::uninitialized();" in encode_tx_block(), can fill the array with values > abs(255), I think any value of i16. If I init with any random values in that range, it works with my test (fill the out of tile residue values, which is input to fwd tx). Then the best values to fill out of tile residue will be those that can generate lesser bits, for ex smaller magnitude and less AC coeffs. Also then, the convention that encoder does prediction for whole block make some sense if original source frame is reasonably padded on bottom/right frame aprons with the values that are close enough to predicted values.

ycho commented 4 years ago

Fixed the corrupted last SB column.

Screen Shot 2020-03-05 at 4 50 31 PM

ycho commented 4 years ago

I see that, when tx-block straddle on frame borders, I am seeing that inverse_transform_add() does not allow accessing outside frame. In the function, output.rows_iter_mut() that is based on , RowsIterMut() (i.e. 2nd stage for vertical 1d transform) does not returning rows outside the frame, and iterate until, remaining = plane.cfg.height as isize - self.y;, which breakes 2D transform work correctly.

ycho commented 4 years ago

To investigate the corrupted pixels with my trial of implementing open partition, especially to see how they behave when the tx-block straddles on frame border, I am testing function by function in major path of encoding and reconstructing a tx-block, i.e. encode_tx_block(), one of them is wrting diff() (generating residue signal between original source and predicted pixels in a block) as below.

https://github.com/xiph/rav1e/blob/master/src/encoder.rs#L1066 called at https://github.com/xiph/rav1e/blob/master/src/encoder.rs#L1201

fn diff2<T: Pixel>(
  dst: &mut [i16], src1: &PlaneRegion<'_, T>, src2: &PlaneRegion<'_, T>,
  width: usize, height: usize,
) {
  let stride1 = src1.plane_cfg.stride;
  let stride2 = src2.plane_cfg.stride;

  for y in 0..height {
    for x in 0..width {
      unsafe {
      let v1 = src1.data_ptr().add(y * stride1 + x);
      let v2 = src2.data_ptr().add(y * stride2 + x);
      dst[y * width + x] = i16::cast_from(*v1) - i16::cast_from(*v2);
      }
    }
  }
}

While diff() at https://github.com/xiph/rav1e/blob/master/src/encoder.rs#L1066 does seem access pixels outside a frame via .rows_iter(), the diff2() function can be called w/o clipping of the width and height of the tx-block that has pixels outside frame. So, using unsfafe{ } access, we can access outside the frame w/o changing current tile code, if required.

ycho commented 4 years ago

I've tried this experiment of open partition on older commit, Nov 2018, where non-square partition in not introduced yet. https://github.com/ycho/rav1e/commits/708a806c_2018_1104

test_dec_dav1d.y4m.gz test.ivf.gz

Screen Shot 2020-03-25 at 11 54 10 AM

This makes me think somewhere in coefficient (i.e. residue pixels) encoding stage is wrong for open partition, for ex, diff() or inverse_transform_add(), etc.

ycho commented 4 years ago

With allocating Frame blocks for multiple of SB (i.x. # of SB x 16), some corruption has been fixed. This fix was appplied because the coefficient coding for the tx block on frame boundary requires to use the coeff context even outside of frame. Possibly there is still some info that are missed when encoding coeff or others symbols.

Relevant change in my work branches are at: https://github.com/ycho/rav1e/commit/1dbdf34831349a33c1df53ea79d45080a69a2a82 and https://github.com/ycho/rav1e/commit/d93a7e0c517a22cfc23294698fb0bec2a96a77e9 (for Nov, 2018 branch)

ycho commented 4 years ago

48x48 size test input thankfully provided by Nathan, @negge. twitter.y4m.gz

ycho commented 4 years ago

New test images to test with 2x2, 3x3, 3x2 SB cases, to test using 64x64 open partitions on frame boundaries. Their width and/or height size are not multiple of 64 pixels and # of pixels in hor and ver chosen such that gives av1 encoder to have PARTITION_SPLIT option, i.e. not mandatary split either hor and/or ver direction.

128x112 test03_128x112 112x112 test03_112x112 176x112 test03_176x112 176x176 test03_176x176

test03_112x112.y4m.gz test03_128x112.y4m.gz test03_176x112.y4m.gz test03_176x176.y4m.gz

https://ibb.co/5jbhwCG https://ibb.co/P1stvTL https://ibb.co/W26mZqQ https://ibb.co/VttWLFt

ycho commented 4 years ago

Currently, I can reproduce corruption of u and v channel for 128x112 input, https://github.com/xiph/rav1e/files/4458167/test03_128x112.y4m.gz with the work branch https://github.com/ycho/rav1e/commit/b21c744437a9022a6312bf09d1b968c82e04ad39 and enc command: ./target/release/rav1e test03_128x112.y4m -o test.ivf -r test_rec.y4m --quantizer 50 --speed=9 --limit=1 --rdo-lookahead-frames=1 --low_latency

ycho commented 4 years ago

Currently, it works with 112x112 (i.e. wxh = 1.75 x 1.75 SBs), but does not work with 128x112 (2x1.75 SBs)

Screen Shot 2020-04-14 at 4 32 19 PM
ycho commented 4 years ago

I've fixed the corruption finally!

Screen Shot 2020-04-20 at 8 05 38 PM
ycho commented 4 years ago

Brand new test input to test 4x4 partitions sizes (and all other sizes as well :) ). test03_254x254.y4m.gz

test03_254x254

ycho commented 4 years ago

One example from objective_1_fast test sequence set, where open partition happens vs current master uses partitions to match frame boundaries.

Intra frame:

Screen Shot 2020-05-08 at 3 23 26 PM
ycho commented 4 years ago

In inter frame,

Screen Shot 2020-05-08 at 3 25 46 PM
ycho commented 4 years ago

Topdown partition with bottom-up partition of frame boundary: At speed 6, I still have ~1.5% coding loss.

https://beta.arewecompressedyet.com/?job=test_open_partition_btmup_on_boundary_s6%402020-05-27T23%3A16%3A14.788Z&job=master_af7f474fb%402020-05-25T16%3A43%3A25.801Z

For one of largest regression, I captured some screen shots for "rush_hour_1080p25_60f.y4m", at QIndex 172 for 1st frame.

First SB grid only, to let you toggle and see the quality difference on the last SB row (i.e, on the bottom frame boundary). The frame size is 1920x1080, so the last SB row has 56 pixels in its height.

master:

Screen Shot 2020-06-03 at 10 32 32 AM

open-partition branch

Screen Shot 2020-06-03 at 10 32 41 AM

Then, partition info. master

Screen Shot 2020-06-03 at 10 32 51 AM

open-partition branch

Screen Shot 2020-06-03 at 10 32 58 AM

Please find and see the "Bits" (per frame) info on the middle of right panel, that shows 130,982 - 130,477 = 505 bits less than master.

ycho commented 4 years ago

Result at speed 2

https://beta.arewecompressedyet.com/?job=test_open_partition_btmup_on_boundary_s2%402020-05-27T23%3A17%3A13.047Z&job=master_c6bf0bf_s2%402020-05-20T23%3A40%3A23.097Z

Some consistent observations are: 1) coding gain in VMAF, 2) coding gains in 360p

ycho commented 4 years ago

Result at speed 1

https://beta.arewecompressedyet.com/?job=master_af7f474fb_s1%402020-05-27T21%3A09%3A32.074Z&job=test_open_partition_fix2_s1%402020-05-27T06%3A14%3A22.687Z

Result at speed 0

https://beta.arewecompressedyet.com/?job=master_c6bf0bf_1f_s0%402020-05-20T23%3A17%3A10.299Z&job=test_open_partition_fix2_s0%402020-05-27T06%3A14%3A13.233Z

ycho commented 4 years ago
ycho commented 4 years ago

FYI, current work branch is https://github.com/ycho/rav1e/commits/test_open_partition_use_bottomup_on_frame_boundary

ycho commented 4 years ago

For QIndex = 252 (but the QIndex for 1st frame is 231),

https://beta.arewecompressedyet.com/runs/master_af7f474fb@2020-05-25T16:43:25.801Z/objective-1-fast/rush_hour_1080p25_60f.y4m-252.ivf https://beta.arewecompressedyet.com/runs/test_open_partition_btmup_on_boundary_s6@2020-05-27T23:16:14.788Z/objective-1-fast/rush_hour_1080p25_60f.y4m-252.ivf

With SB grid . master

Screen Shot 2020-06-07 at 4 59 20 PM

open-partition

Screen Shot 2020-06-07 at 4 59 24 PM

With partition info master

Screen Shot 2020-06-07 at 4 59 31 PM

open-partition

Screen Shot 2020-06-07 at 4 59 37 PM
ycho commented 4 years ago

Result for still image sets, subset1 For awcy mono subset1, speed 0

awcy subset1, speed 0

There were several fixes for varying image frame sizes that are not found in objective-1-fast. https://github.com/ycho/rav1e/commit/805d68d9f521158c311b94185067b653bb99c3b5 https://github.com/ycho/rav1e/commit/aaa36d0b5bc189b32b51a5db0b04670dc9976023 https://github.com/ycho/rav1e/commit/3ead6b6feaabfc02885a15dae58a7187dec59b0d

They mostly address the A) fix of size of tx contexts in w and h for coefficient encoder and B) CFL to access (read) the reconstructed outside frame luma pixels to do luma_ac().

ycho commented 4 years ago

The work branch has moved to https://github.com/ycho/rav1e/tree/make_partition_strong, and I plan to split one large patch into many smaller ones. Thank you!

ycho commented 4 years ago

Adding more test images:

test03_252x252.y4m.gz test03_253x253.y4m.gz

ycho commented 4 years ago

Another possibility for the coding loss by open partition is the undefined satd when pre-screening intra modes. It is good that satd compute with unsafe { } (since it can cross the plane region), but I had no chance what it does on frame boundaries. So, one way to exclude this factor is not doing intra mode decision then fix all intra mode block uses one mode, like dc pred. I will try that soon.

ycho commented 4 years ago

Another possibility for the coding loss by open partition is the undefined satd when pre-screening intra modes. It is good that satd compute with unsafe { } (since it can cross the plane region), but I had no chance what it does on frame boundaries. So, one way to exclude this factor is not doing intra mode decision then fix all intra mode block uses one mode, like dc pred. I will try that soon.

Checked this, and it seems not the reason from seeing awcy result for 1 frame only at speed 1.

ycho commented 4 years ago

AWCY for "--tune psnr" for subset1 at s0 and objective-1-fast at s1 will soon be available at following links:

https://beta.arewecompressedyet.com/?job=master_af7f474fb_1f_s0_psnr_subset1%402020-06-19T20%3A56%3A36.133Z&job=test_open_partition_s0_subset1_psnr%402020-06-19T20%3A58%3A57.092Z

https://beta.arewecompressedyet.com/?job=master_c6bf0bf_s1_low_latency_psnr%402020-06-19T21%3A02%3A47.414Z&job=test_open_partition_rebase3_s1_low_latency_psnr%402020-06-19T21%3A00%3A18.744Z

ycho commented 4 years ago

Comparing inter frame (2nd frame), Qindex 172 given to encoder (actual 291) https://beta.arewecompressedyet.com/runs/master_af7f474fb@2020-05-25T16:43:25.801Z/objective-1-fast/rush_hour_1080p25_60f.y4m-172.ivf

https://beta.arewecompressedyet.com/runs/test_open_partition_btmup_on_boundary_s6@2020-05-27T23:16:14.788Z/objective-1-fast/rush_hour_1080p25_60f.y4m-172.ivf

master vs open-partition

Screen Shot 2020-06-19 at 3 01 44 PM Screen Shot 2020-06-19 at 3 01 53 PM
ycho commented 4 years ago

Regarding whether MV for padded area is used, it seems so. Please look at bottom-left corner, bi-directional prediction, which has MV (shown right side panel), "Motion Vectors", (0,-70), (0,70), the blue arrow fetches from outside frame, i.e., padde area, I think.

Screen Shot 2020-06-19 at 3 03 33 PM
ycho commented 4 years ago

test03_146x146.y4m.gz

test03_248x248.y4m.gz

ycho commented 4 years ago

All right, I think we found the clue why it has not worked! I forgot that I left two lines of code that set deblocking strength as zero, left these two lines here : https://github.com/ycho/rav1e/blob/01f57b36019f1e8f0d93a543c5cda233211d6ee4/src/encoder.rs#L3088

Many times, I turn off as many coding features as possible, including those three in-loop filters when I begin a new video codec task and that can possibly touches or interfere with other coding tools. And this time, the task was large to me and whole my mind and eyes was taken for seeking the reason for coding loss and completely forgot turning back on one of in-loop filters, deblocking filter.

I have not verified it yet, but awcy result comes soon!

ycho commented 4 years ago

Okay, my fix was correct and we finally start seeing the coding gains! YAY!

awcy speed 6, psnr, no frame reorder

PSNR PSNR Cb PSNR Cr PSNR HVS SSIM MS SSIM CIEDE 2000
-1.5582 -1.7591 -1.8580 -1.5413 -1.6655 -1.5900 -1.7712

awcy speed 6

(with encoding time increase 5.2% ~ 9.7%) PSNR PSNR Cb PSNR Cr PSNR HVS SSIM MS SSIM CIEDE 2000
-2.0395 -2.6533 -2.9701 -2.0350 -2.1050 -1.9871 -2.2677
ycho commented 4 years ago

awcy speed 0, subset1

PSNR PSNR Cb PSNR Cr PSNR HVS SSIM MS SSIM CIEDE 2000
-0.4097 N/A -0.6440 -0.3641 -0.4035 -0.3906 -0.5393
ycho commented 4 years ago

awcy, speed 1 (all bottom up), low latency, psnr

PSNR PSNR Cb PSNR Cr PSNR HVS SSIM MS SSIM CIEDE 2000
-1.6251 -2.3216 -2.4041 -1.6014 -1.7150 -1.6379 -2.0424
ycho commented 4 years ago

If I don't use bottomup search on frame boundary and use the same topdown, there is a big coding loss, because topdown search cannot split and lookahead down more than one level deep when split is required by av1 spec, just largest partition (64x64) is used.

awcy, default speed 6 This mostly happens with 360p test sequences.

PSNR PSNR Cb PSNR Cr PSNR HVS SSIM MS SSIM CIEDE 2000
2.3520 0.6730 1.1094 1.5370 2.3579 1.7963 1.2295

master vs open-partition

1st frame (key frame), speed 6, master vs open-partition (with not using bottomup on frame boundary for topdown), smallest QIndex = 80 i awcy.

Screen Shot 2020-06-21 at 9 00 42 PM Screen Shot 2020-06-21 at 9 00 47 PM

1st frame (i.e, 1st inter frame), speed 6, master vs open-partition (with not using bottomup on frame boundary for topdown), smallest QIndex = 80 i awcy.

Screen Shot 2020-06-21 at 9 01 05 PM Screen Shot 2020-06-21 at 9 01 10 PM
ycho commented 4 years ago

And most superset awcy result, speed 0

PSNR PSNR Cb PSNR Cr PSNR HVS SSIM MS SSIM CIEDE 2000
-1.8174 -2.0760 -2.1534 -1.7602 -1.8175 -1.7785 -2.0376
ycho commented 4 years ago

speed_bag_640x360_60f.y4m, QIndex 252, speed 0

master vs open-partition, frame 1 (key frame, intra frame) (read: cyan color is for SB grid, white is for partition split, and yellow is for tx-split.)

Screen Shot 2020-06-21 at 9 21 19 PM Screen Shot 2020-06-21 at 9 21 24 PM

master vs open-partition, frame 2 (i.e. 1st inter frame)

Screen Shot 2020-06-21 at 9 21 34 PM Screen Shot 2020-06-21 at 9 21 39 PM
ycho commented 4 years ago

As a sanity check, master vs "open-partition branch with old partition searches"

+0.0041% for s2, and -0.0107% for s0. // mib block size 4x4 0.0% change for s3, s5, s6, s9 // min block size 8x8 0.0% change for s9 // min block size 32x32

awcy s0

awcy s2

awcy s2, psnr

awcy s3

awcy s5

awcy s6

awcy s6, psnr

awcy s9

ycho commented 4 years ago

Another screenshots of bitstream analyzer, showing how open-partition contributes to reduce bit rate by using larger partition sizes and thus encoding less amount of MVs in addition to partition info.

speed_bag_640x360_60f.y4m, QIndex 252, speed 0

master vs open-partition, 4th frame

Screen Shot 2020-06-23 at 9 56 50 AM Screen Shot 2020-06-23 at 9 56 55 AM

(bitstreams: master, open-partition)

ycho commented 4 years ago

Regarding the regression with topdown partition search, i.e. speed levels 2~10, I think I found the better reason and possible improvement idea for it. Previously, I explained that the reason is topdown search revised in open-partition branch does not split when it lookaheads that any of its children partition requires mandatary split by av1 spec. This is due to its nature of algorithm and if we ever change it with major revisions so that it can, then we will realize it is doing the task what bottomup search achieves. In a difficult way.

The better reason I've found now is, topdown for open-partition does not uses non-square partitions at all. This means, when topdown is able to split down to smallest possible partition size on right or bottom frame boundaries, even when non-square is best choice in rdo sense, it must use square partitions. Note the important fact that, the mandatory split required by av1 (i.e. either hasRows or hasCols is false in the _decodepartition( ) in https://aomediacodec.github.io/av1-spec/#decode-partition-syntax) applies to square partition only, so non-square can be used on frame boundaries under any form of straddling) And most of input content that needs further smaller partitions on bottom/right frame boundaries, non-square partitions are more likely be better choices in rdo sense, when SB straddle on those frame boundaries.

FYI, we stopped using non-square partitions for topdown some time ago. Because using it middle of splitting down leads to worse partition decisions (coding loss). We have also tried usingt non-square partitions for terminal cases, i.e. smallest partition sizes allowed, which gave a coding gain but disabled the feature for the maintenance of codebase.

Now the possible fix is, move the lookahead (in rdo_partition_simple( ) ) that checks whether children partition requires mandatory partition inside the _encode_partitiontopdown( ), where right before it calls _rdo_partitiondecision( ). So that, if any child needs mandatory partition requires, then drop PARTITION_SPLIT (i.e. stop further quadri-partition) and add PARTITION_VERT or PARTITION_HORZ (depends on _hascols and _hasrows conditions) to _partitiontypes parameter that is passed to _rdo_partitiondecision( ).

ycho commented 4 years ago

Completed with #2396.