Don't restrict partition sizes to exactly fit on the right or bottom frame boundaries

ycho commented 4 years ago

Current block partitioning approach on right or borrtom frame boundaries in rav1e
Both topdown and bottomup function keep splitting the input SB (SuperBlock) until no partitioned blocks straddle on the right or bottom frame boundaries. (The condition whether keep splitting or not is checked by 'must_split', which is set true if the current partition size straddle on the frame boundary or other condition such as if current size is larger than desired max partition size as shown in the code, https://github.com/xiph/rav1e/blob/8f273bcbde77e5f3711138a57c13dec9dc793973/src/encoder.rs#L2546)

Affected areas of codebase can be:

[x] Partition search functions should not restrict partition sizes to exactly fit on the right or bottom frame boundaries
[x] Distortion compute functions to use visible area, which can be any size not defined by av1 partition sizes ~~- [x] Intra prediction to predict only visible pixels only~~
[x] Ref pixels for intra pred (i.e. intra edge pixels) should not use invisible pixels. ~~- [ ] Inter prediction (Motion Estimation) to predict only visible pixels only~~ - [ ] Define the pixel values for invisible area as input to forward transforms, i.e. what kind of padding to use for input and reconstructed frame? Already defined that extension of directly adjacent and last available pixel.

ycho commented 4 years ago

@rzumer

Example of partitions decided by current rav1e. The 2nd frame (inter frame) of 'blue_sky_360p_60f.y4m' encoded by rav1e, -s 6: The input frame size is 640x360 and there are total 10 x 6 (row x col) SBs (SuperBlocks) are encoded in a frame.

The bottom SB row has 40 visible pixels in height. See that all SBs in the bottom-most SB row are SPLIT down to: two of 32x32 and two of 16x4 partitions. This can be problematic because the encoder has no option to choose larger partitions when their rd performance is better and this eventually will cause overall coding loss of a frame. Our goal is to enable rav1e to not split for this SB, which is allowed by av1 spec. Note that av1 only allows PARTITION_SPLIT_NONE if more than half of block size pixels available in both row and col, as stated in https://aomediacodec.github.io/av1-spec/#decode-partition-syntax.

Captured from https://beta.arewecompressedyet.com/analyzer/?maxFrames=4&decoder=https://people.xiph.org/~mbebenita/analyzer/inspect.js&decoderName=master-1955d6d49b692e687661ba3bed95a26e91877645&file=https://beta.arewecompressedyet.com/runs/master-1955d6d49b692e687661ba3bed95a26e91877645/objective-1-fast/blue_sky_360p_60f.y4m-252.ivf

And shown with its motion vectors: We (and rav1e encoder) probably want to use larger partitions on the bottom SB row, given that the scene is panning!

ycho commented 4 years ago

On the contrary, here is the partitions decided by libaom (+ my modification that min partition size = 64x64, like "*.default_min_partition_size = BLOCK_64X64")

And shown with its motion vectors:

In case you want the bitstream file for above, bluesky.ivf.gz

tmatth commented 4 years ago

~@ycho nice analysis, do you know if/how libaom is behaving compared to rav1e for this?~ Nevermind, you just answered this before I asked :+1:

ycho commented 4 years ago

~@ycho nice analysis, do you know if/how libaom is behaving compared to rav1e for this?~ Nevermind, you just answered this before I asked 👍

Hey~ thanks for trying asking and enjoying art of screen-capturing!

ycho commented 4 years ago

@barrbrain, if bsize straddle on frame borders (with my dev branch), luma_ac() gets panic at https://github.com/xiph/rav1e/blob/8f273bcbde77e5f3711138a57c13dec9dc793973/src/encoder.rs#L1859. I think it is because the subregion() in plane_regions.rs limit the w and h of region only up to tile (or frame if one tile) w and h, https://github.com/xiph/rav1e/blob/8f273bcbde77e5f3711138a57c13dec9dc793973/src/encoder.rs#L1852.

ycho commented 4 years ago

Little progress with enforce partitions in a bottom SB row all skipped, where revised partition decide to choose 64x64 because split into four 32x32s is not available since two 32x32 of them requires to be further partitioned by the definition of av1 spec, which topdown partition does not do.

Speed 9 is used, where 32x32 or 64x64 partition sizes are available. See the bottom SB row where 64x64 are able to be used now, which does not exactly matches the frame size.

ycho commented 4 years ago

What's done: 1) Since rav1e currently don't allow accessing outside frame(or tile) pixels, prediction is done for inside pixels only. (However, as in other encoder and decoder, rav1e might need accessing outside tile pixels for enc time efficiency reasons) 2) When computing the residue signal for tx input, rav1e has no definition what to do about outside frame pixels, which is to set outside residue values as zeros. So, for the tx-block that straddle on frame border, while intra prediction can be done for whole tx-block, the residue block for it should not count outside frame pixels.

Below decoded image is by dav1d, since aomdec failed to decode it (hence no image from AOMAnalyzer)

Then my next problem to solve is: why the last SB column corrupted except that in first SB row?

FYI, current encoding conditions are: speed 9 but all 64x64 partitions, DC intra pred only, no CfL, no fine directional intra pred.
The input frame size is 640x360 (wxh), thus 64x64 partitions in the last SB column has not outside frame pixels.
This problem in fact has been there for a while as similar corruption was shown in previous screen capture posted above https://github.com/xiph/rav1e/issues/2166#issuecomment-590615599 that was from encoded with 32x32 or 64x64 partitions at speed 9.

Screen Shot 2020-03-03 at 11 27 02 AM

Since there is bright shifted content in the SB position (9,1) (col, row), it seems that the ref value for DC prediction is incorrectly obtained. The brightness level gets higher, which means larger than correct ref DC value is obtained.

ycho commented 4 years ago

So, "let mut residual_storage: Aligned<[i16; 64 * 64]> = Aligned::uninitialized();" in encode_tx_block(), can fill the array with values > abs(255), I think any value of i16. If I init with any random values in that range, it works with my test (fill the out of tile residue values, which is input to fwd tx). Then the best values to fill out of tile residue will be those that can generate lesser bits, for ex smaller magnitude and less AC coeffs. Also then, the convention that encoder does prediction for whole block make some sense if original source frame is reasonably padded on bottom/right frame aprons with the values that are close enough to predicted values.

ycho commented 4 years ago

Fixed the corrupted last SB column.

When preparing ref pixels for intra prediction, i.e, inside get_intra_edges(), for the case need_left equality condition was incorrect.
- Similar bug corrected for need_top case.
- Also found other bugs that will show up for frame sizes that are not multiple of 64, tx_size is clipped incorrectly when tx block straddle on the right or bottom frame border.

Screen Shot 2020-03-05 at 4 50 31 PM

ycho commented 4 years ago

I see that, when tx-block straddle on frame borders, I am seeing that inverse_transform_add() does not allow accessing outside frame. In the function, output.rows_iter_mut() that is based on , RowsIterMut() (i.e. 2nd stage for vertical 1d transform) does not returning rows outside the frame, and iterate until, remaining = plane.cfg.height as isize - self.y;, which breakes 2D transform work correctly.

ycho commented 4 years ago

To investigate the corrupted pixels with my trial of implementing open partition, especially to see how they behave when the tx-block straddles on frame border, I am testing function by function in major path of encoding and reconstructing a tx-block, i.e. encode_tx_block(), one of them is wrting diff() (generating residue signal between original source and predicted pixels in a block) as below.

https://github.com/xiph/rav1e/blob/master/src/encoder.rs#L1066 called at https://github.com/xiph/rav1e/blob/master/src/encoder.rs#L1201

fn diff2<T: Pixel>(
  dst: &mut [i16], src1: &PlaneRegion<'_, T>, src2: &PlaneRegion<'_, T>,
  width: usize, height: usize,
) {
  let stride1 = src1.plane_cfg.stride;
  let stride2 = src2.plane_cfg.stride;

  for y in 0..height {
    for x in 0..width {
      unsafe {
      let v1 = src1.data_ptr().add(y * stride1 + x);
      let v2 = src2.data_ptr().add(y * stride2 + x);
      dst[y * width + x] = i16::cast_from(*v1) - i16::cast_from(*v2);
      }
    }
  }
}

While diff() at https://github.com/xiph/rav1e/blob/master/src/encoder.rs#L1066 does seem access pixels outside a frame via .rows_iter(), the diff2() function can be called w/o clipping of the width and height of the tx-block that has pixels outside frame. So, using unsfafe{ } access, we can access outside the frame w/o changing current tile code, if required.

ycho commented 4 years ago

I've tried this experiment of open partition on older commit, Nov 2018, where non-square partition in not introduced yet. https://github.com/ycho/rav1e/commits/708a806c_2018_1104

With first commit, try encode at speed 10, then can see bottom 64x64 partitions are all corrupted. (ex: ./target/release/rav1e nyan.y4m -o test.ivf -r test_rec.y4m --quantizer 50 --speed=10 --limit=2) https://github.com/ycho/rav1e/commit/0cee6bef8f5c615aaf7ea5bc37f9f6da8d902372 (Not decodable by aomdec, but dav1d decodes the corrupted frame, "../dav1d/build/tools/dav1d -i test.ivf -o test_dec_dav1d.y4m")

test_dec_dav1d.y4m.gz test.ivf.gz

With 2nd commit, which skips on bottom 64x64 partitions (if they are on frame borders), then decoder (aomdec) can decode the bitstream.

This makes me think somewhere in coefficient (i.e. residue pixels) encoding stage is wrong for open partition, for ex, diff() or inverse_transform_add(), etc.

ycho commented 4 years ago

With allocating Frame blocks for multiple of SB (i.x. # of SB x 16), some corruption has been fixed. This fix was appplied because the coefficient coding for the tx block on frame boundary requires to use the coeff context even outside of frame. Possibly there is still some info that are missed when encoding coeff or others symbols.

Relevant change in my work branches are at: https://github.com/ycho/rav1e/commit/1dbdf34831349a33c1df53ea79d45080a69a2a82 and https://github.com/ycho/rav1e/commit/d93a7e0c517a22cfc23294698fb0bec2a96a77e9 (for Nov, 2018 branch)

ycho commented 4 years ago

48x48 size test input thankfully provided by Nathan, @negge. twitter.y4m.gz

ycho commented 4 years ago

New test images to test with 2x2, 3x3, 3x2 SB cases, to test using 64x64 open partitions on frame boundaries. Their width and/or height size are not multiple of 64 pixels and # of pixels in hor and ver chosen such that gives av1 encoder to have PARTITION_SPLIT option, i.e. not mandatary split either hor and/or ver direction.

128x112 test03_128x112 112x112 test03_112x112 176x112 test03_176x112 176x176 test03_176x176

test03_112x112.y4m.gz test03_128x112.y4m.gz test03_176x112.y4m.gz test03_176x176.y4m.gz

https://ibb.co/5jbhwCG https://ibb.co/P1stvTL https://ibb.co/W26mZqQ https://ibb.co/VttWLFt

ycho commented 4 years ago

Currently, I can reproduce corruption of u and v channel for 128x112 input, https://github.com/xiph/rav1e/files/4458167/test03_128x112.y4m.gz with the work branch https://github.com/ycho/rav1e/commit/b21c744437a9022a6312bf09d1b968c82e04ad39 and enc command: ./target/release/rav1e test03_128x112.y4m -o test.ivf -r test_rec.y4m --quantizer 50 --speed=9 --limit=1 --rdo-lookahead-frames=1 --low_latency

ycho commented 4 years ago

Currently, it works with 112x112 (i.e. wxh = 1.75 x 1.75 SBs), but does not work with 128x112 (2x1.75 SBs)

ycho commented 4 years ago

I've fixed the corruption finally!

ycho commented 4 years ago

Brand new test input to test 4x4 partitions sizes (and all other sizes as well :) ). test03_254x254.y4m.gz

test03_254x254

ycho commented 4 years ago

One example from objective_1_fast test sequence set, where open partition happens vs current master uses partitions to match frame boundaries.

Intra frame:

ycho commented 4 years ago

In inter frame,

ycho commented 4 years ago

Topdown partition with bottom-up partition of frame boundary: At speed 6, I still have ~1.5% coding loss.

https://beta.arewecompressedyet.com/?job=test_open_partition_btmup_on_boundary_s6%402020-05-27T23%3A16%3A14.788Z&job=master_af7f474fb%402020-05-25T16%3A43%3A25.801Z

For one of largest regression, I captured some screen shots for "rush_hour_1080p25_60f.y4m", at QIndex 172 for 1st frame.

First SB grid only, to let you toggle and see the quality difference on the last SB row (i.e, on the bottom frame boundary). The frame size is 1920x1080, so the last SB row has 56 pixels in its height.

master:

open-partition branch

Then, partition info. master

open-partition branch

Please find and see the "Bits" (per frame) info on the middle of right panel, that shows 130,982 - 130,477 = 505 bits less than master.

ycho commented 4 years ago

Result at speed 2

https://beta.arewecompressedyet.com/?job=test_open_partition_btmup_on_boundary_s2%402020-05-27T23%3A17%3A13.047Z&job=master_c6bf0bf_s2%402020-05-20T23%3A40%3A23.097Z

Some consistent observations are: 1) coding gain in VMAF, 2) coding gains in 360p

ycho commented 4 years ago

Result at speed 1

https://beta.arewecompressedyet.com/?job=master_af7f474fb_s1%402020-05-27T21%3A09%3A32.074Z&job=test_open_partition_fix2_s1%402020-05-27T06%3A14%3A22.687Z

Result at speed 0

https://beta.arewecompressedyet.com/?job=master_c6bf0bf_1f_s0%402020-05-20T23%3A17%3A10.299Z&job=test_open_partition_fix2_s0%402020-05-27T06%3A14%3A13.233Z

ycho commented 4 years ago

Copying the updates from daala weekely meeting on 9AM PST, June 3.
Investigations on open partition scheme why it introduces coding loss for both bottom-up and top-down, ~1.5%. Visually checking with bitstream analyzer, open partitions seem reasonably chosen during partition search, yielding less bits but at larger QPs the quality seems worse compared to reference master.
Experiment using bottom-up partition for top-down on the frame boundary, otherwise, top-down just stops with larger open partition.
And, this leads me to think over the possibility of lambda used by rav1e might not be right for frame boundary case since it never has been trained from such cases. I.e seemingly too harsh penalization for partition cost for SB on frame boundary. But this is a hypothesis, just my idea needing a test (and hopefully proof), probably I need to start from knowing the origin of rav1e’s lambda and how it has formed.

ycho commented 4 years ago

FYI, current work branch is https://github.com/ycho/rav1e/commits/test_open_partition_use_bottomup_on_frame_boundary

ycho commented 4 years ago

For QIndex = 252 (but the QIndex for 1st frame is 231),

https://beta.arewecompressedyet.com/runs/master_af7f474fb@2020-05-25T16:43:25.801Z/objective-1-fast/rush_hour_1080p25_60f.y4m-252.ivf https://beta.arewecompressedyet.com/runs/test_open_partition_btmup_on_boundary_s6@2020-05-27T23:16:14.788Z/objective-1-fast/rush_hour_1080p25_60f.y4m-252.ivf

With SB grid . master

open-partition

With partition info master

open-partition

ycho commented 4 years ago

Result for still image sets, subset1 For awcy mono subset1, speed 0

awcy subset1, speed 0

There were several fixes for varying image frame sizes that are not found in objective-1-fast. https://github.com/ycho/rav1e/commit/805d68d9f521158c311b94185067b653bb99c3b5 https://github.com/ycho/rav1e/commit/aaa36d0b5bc189b32b51a5db0b04670dc9976023 https://github.com/ycho/rav1e/commit/3ead6b6feaabfc02885a15dae58a7187dec59b0d

They mostly address the A) fix of size of tx contexts in w and h for coefficient encoder and B) CFL to access (read) the reconstructed outside frame luma pixels to do luma_ac().

A) occurs with two opposite cases, 1) when tx blocks are not visible but required to be encoded, for ex sub 8x8 blocks not inside visible area but inside the encoded frame, 2) part of tx block is not visible but inside the encoded frame.
B) Different from other prediction, luma_ac() of CfL does need all of reconstructed luma pixels values from encoded tx blocks even when some tx blocks are outside of the encoded frame.

ycho commented 4 years ago

The work branch has moved to https://github.com/ycho/rav1e/tree/make_partition_strong, and I plan to split one large patch into many smaller ones. Thank you!

ycho commented 4 years ago

Adding more test images:

test03_252x252.y4m.gz test03_253x253.y4m.gz

ycho commented 4 years ago

Another possibility for the coding loss by open partition is the undefined satd when pre-screening intra modes. It is good that satd compute with unsafe { } (since it can cross the plane region), but I had no chance what it does on frame boundaries. So, one way to exclude this factor is not doing intra mode decision then fix all intra mode block uses one mode, like dc pred. I will try that soon.

ycho commented 4 years ago

Another possibility for the coding loss by open partition is the undefined satd when pre-screening intra modes. It is good that satd compute with unsafe { } (since it can cross the plane region), but I had no chance what it does on frame boundaries. So, one way to exclude this factor is not doing intra mode decision then fix all intra mode block uses one mode, like dc pred. I will try that soon.

Checked this, and it seems not the reason from seeing awcy result for 1 frame only at speed 1.

ycho commented 4 years ago

AWCY for "--tune psnr" for subset1 at s0 and objective-1-fast at s1 will soon be available at following links:

https://beta.arewecompressedyet.com/?job=master_af7f474fb_1f_s0_psnr_subset1%402020-06-19T20%3A56%3A36.133Z&job=test_open_partition_s0_subset1_psnr%402020-06-19T20%3A58%3A57.092Z

https://beta.arewecompressedyet.com/?job=master_c6bf0bf_s1_low_latency_psnr%402020-06-19T21%3A02%3A47.414Z&job=test_open_partition_rebase3_s1_low_latency_psnr%402020-06-19T21%3A00%3A18.744Z

ycho commented 4 years ago

Comparing inter frame (2nd frame), Qindex 172 given to encoder (actual 291) https://beta.arewecompressedyet.com/runs/master_af7f474fb@2020-05-25T16:43:25.801Z/objective-1-fast/rush_hour_1080p25_60f.y4m-172.ivf

https://beta.arewecompressedyet.com/runs/test_open_partition_btmup_on_boundary_s6@2020-05-27T23:16:14.788Z/objective-1-fast/rush_hour_1080p25_60f.y4m-172.ivf

master vs open-partition

ycho commented 4 years ago

Regarding whether MV for padded area is used, it seems so. Please look at bottom-left corner, bi-directional prediction, which has MV (shown right side panel), "Motion Vectors", (0,-70), (0,70), the blue arrow fetches from outside frame, i.e., padde area, I think.

ycho commented 4 years ago

test03_146x146.y4m.gz

test03_248x248.y4m.gz

ycho commented 4 years ago

All right, I think we found the clue why it has not worked! I forgot that I left two lines of code that set deblocking strength as zero, left these two lines here : https://github.com/ycho/rav1e/blob/01f57b36019f1e8f0d93a543c5cda233211d6ee4/src/encoder.rs#L3088

Many times, I turn off as many coding features as possible, including those three in-loop filters when I begin a new video codec task and that can possibly touches or interfere with other coding tools. And this time, the task was large to me and whole my mind and eyes was taken for seeking the reason for coding loss and completely forgot turning back on one of in-loop filters, deblocking filter.

I have not verified it yet, but awcy result comes soon!

ycho commented 4 years ago

Okay, my fix was correct and we finally start seeing the coding gains! YAY!

awcy speed 6, psnr, no frame reorder

PSNR	PSNR Cb	PSNR Cr	PSNR HVS	SSIM	MS SSIM	CIEDE 2000
-1.5582	-1.7591	-1.8580	-1.5413	-1.6655	-1.5900	-1.7712

awcy speed 6

(with encoding time increase 5.2% ~ 9.7%)	PSNR	PSNR Cb	PSNR Cr	PSNR HVS	SSIM	MS SSIM	CIEDE 2000
-2.0395	-2.6533	-2.9701	-2.0350	-2.1050	-1.9871	-2.2677

ycho commented 4 years ago

awcy speed 0, subset1

PSNR	PSNR Cb	PSNR Cr	PSNR HVS	SSIM	MS SSIM	CIEDE 2000
-0.4097	N/A	-0.6440	-0.3641	-0.4035	-0.3906	-0.5393

ycho commented 4 years ago

awcy, speed 1 (all bottom up), low latency, psnr

PSNR	PSNR Cb	PSNR Cr	PSNR HVS	SSIM	MS SSIM	CIEDE 2000
-1.6251	-2.3216	-2.4041	-1.6014	-1.7150	-1.6379	-2.0424

ycho commented 4 years ago

If I don't use bottomup search on frame boundary and use the same topdown, there is a big coding loss, because topdown search cannot split and lookahead down more than one level deep when split is required by av1 spec, just largest partition (64x64) is used.

awcy, default speed 6 This mostly happens with 360p test sequences.

PSNR	PSNR Cb	PSNR Cr	PSNR HVS	SSIM	MS SSIM	CIEDE 2000
2.3520	0.6730	1.1094	1.5370	2.3579	1.7963	1.2295

master vs open-partition

1st frame (key frame), speed 6, master vs open-partition (with not using bottomup on frame boundary for topdown), smallest QIndex = 80 i awcy.

1st frame (i.e, 1st inter frame), speed 6, master vs open-partition (with not using bottomup on frame boundary for topdown), smallest QIndex = 80 i awcy.

ycho commented 4 years ago

And most superset awcy result, speed 0

PSNR	PSNR Cb	PSNR Cr	PSNR HVS	SSIM	MS SSIM	CIEDE 2000
-1.8174	-2.0760	-2.1534	-1.7602	-1.8175	-1.7785	-2.0376

ycho commented 4 years ago

speed_bag_640x360_60f.y4m, QIndex 252, speed 0

master vs open-partition, frame 1 (key frame, intra frame) (read: cyan color is for SB grid, white is for partition split, and yellow is for tx-split.)

master vs open-partition, frame 2 (i.e. 1st inter frame)

ycho commented 4 years ago

As a sanity check, master vs "open-partition branch with old partition searches"

+0.0041% for s2, and -0.0107% for s0. // mib block size 4x4 0.0% change for s3, s5, s6, s9 // min block size 8x8 0.0% change for s9 // min block size 32x32

ycho commented 4 years ago

Another screenshots of bitstream analyzer, showing how open-partition contributes to reduce bit rate by using larger partition sizes and thus encoding less amount of MVs in addition to partition info.

speed_bag_640x360_60f.y4m, QIndex 252, speed 0

master vs open-partition, 4th frame

(bitstreams: master, open-partition)

ycho commented 4 years ago

Regarding the regression with topdown partition search, i.e. speed levels 2~10, I think I found the better reason and possible improvement idea for it. Previously, I explained that the reason is topdown search revised in open-partition branch does not split when it lookaheads that any of its children partition requires mandatary split by av1 spec. This is due to its nature of algorithm and if we ever change it with major revisions so that it can, then we will realize it is doing the task what bottomup search achieves. In a difficult way.

The better reason I've found now is, topdown for open-partition does not uses non-square partitions at all. This means, when topdown is able to split down to smallest possible partition size on right or bottom frame boundaries, even when non-square is best choice in rdo sense, it must use square partitions. Note the important fact that, the mandatory split required by av1 (i.e. either hasRows or hasCols is false in the _decodepartition( ) in https://aomediacodec.github.io/av1-spec/#decode-partition-syntax) applies to square partition only, so non-square can be used on frame boundaries under any form of straddling) And most of input content that needs further smaller partitions on bottom/right frame boundaries, non-square partitions are more likely be better choices in rdo sense, when SB straddle on those frame boundaries.

FYI, we stopped using non-square partitions for topdown some time ago. Because using it middle of splitting down leads to worse partition decisions (coding loss). We have also tried usingt non-square partitions for terminal cases, i.e. smallest partition sizes allowed, which gave a coding gain but disabled the feature for the maintenance of codebase.

Now the possible fix is, move the lookahead (in rdo_partition_simple( ) ) that checks whether children partition requires mandatory partition inside the _encode_partitiontopdown( ), where right before it calls _rdo_partitiondecision( ). So that, if any child needs mandatory partition requires, then drop PARTITION_SPLIT (i.e. stop further quadri-partition) and add PARTITION_VERT or PARTITION_HORZ (depends on _hascols and _hasrows conditions) to _partitiontypes parameter that is passed to _rdo_partitiondecision( ).

ycho commented 4 years ago

Completed with #2396.

xiph / rav1e

Don't restrict partition sizes to exactly fit on the right or bottom frame boundaries #2166