Closed ycho closed 4 years ago
@rzumer
Example of partitions decided by current rav1e. The 2nd frame (inter frame) of 'blue_sky_360p_60f.y4m' encoded by rav1e, -s 6: The input frame size is 640x360 and there are total 10 x 6 (row x col) SBs (SuperBlocks) are encoded in a frame.
The bottom SB row has 40 visible pixels in height. See that all SBs in the bottom-most SB row are SPLIT down to: two of 32x32 and two of 16x4 partitions. This can be problematic because the encoder has no option to choose larger partitions when their rd performance is better and this eventually will cause overall coding loss of a frame. Our goal is to enable rav1e to not split for this SB, which is allowed by av1 spec. Note that av1 only allows PARTITION_SPLIT_NONE if more than half of block size pixels available in both row and col, as stated in https://aomediacodec.github.io/av1-spec/#decode-partition-syntax.
And shown with its motion vectors: We (and rav1e encoder) probably want to use larger partitions on the bottom SB row, given that the scene is panning!
On the contrary, here is the partitions decided by libaom (+ my modification that min partition size = 64x64, like "*.default_min_partition_size = BLOCK_64X64")
And shown with its motion vectors:
In case you want the bitstream file for above, bluesky.ivf.gz
~@ycho nice analysis, do you know if/how libaom is behaving compared to rav1e for this?~ Nevermind, you just answered this before I asked :+1:
~@ycho nice analysis, do you know if/how libaom is behaving compared to rav1e for this?~ Nevermind, you just answered this before I asked 👍
Hey~ thanks for trying asking and enjoying art of screen-capturing!
@barrbrain, if bsize straddle on frame borders (with my dev branch), luma_ac() gets panic at https://github.com/xiph/rav1e/blob/8f273bcbde77e5f3711138a57c13dec9dc793973/src/encoder.rs#L1859. I think it is because the subregion() in plane_regions.rs limit the w and h of region only up to tile (or frame if one tile) w and h, https://github.com/xiph/rav1e/blob/8f273bcbde77e5f3711138a57c13dec9dc793973/src/encoder.rs#L1852.
Little progress with enforce partitions in a bottom SB row all skipped, where revised partition decide to choose 64x64 because split into four 32x32s is not available since two 32x32 of them requires to be further partitioned by the definition of av1 spec, which topdown partition does not do.
Speed 9 is used, where 32x32 or 64x64 partition sizes are available. See the bottom SB row where 64x64 are able to be used now, which does not exactly matches the frame size.
What's done: 1) Since rav1e currently don't allow accessing outside frame(or tile) pixels, prediction is done for inside pixels only. (However, as in other encoder and decoder, rav1e might need accessing outside tile pixels for enc time efficiency reasons) 2) When computing the residue signal for tx input, rav1e has no definition what to do about outside frame pixels, which is to set outside residue values as zeros. So, for the tx-block that straddle on frame border, while intra prediction can be done for whole tx-block, the residue block for it should not count outside frame pixels.
Below decoded image is by dav1d, since aomdec failed to decode it (hence no image from AOMAnalyzer)
Then my next problem to solve is: why the last SB column corrupted except that in first SB row?
Since there is bright shifted content in the SB position (9,1) (col, row), it seems that the ref value for DC prediction is incorrectly obtained. The brightness level gets higher, which means larger than correct ref DC value is obtained.
So, "let mut residual_storage: Aligned<[i16; 64 * 64]> = Aligned::uninitialized();" in encode_tx_block(), can fill the array with values > abs(255), I think any value of i16. If I init with any random values in that range, it works with my test (fill the out of tile residue values, which is input to fwd tx). Then the best values to fill out of tile residue will be those that can generate lesser bits, for ex smaller magnitude and less AC coeffs. Also then, the convention that encoder does prediction for whole block make some sense if original source frame is reasonably padded on bottom/right frame aprons with the values that are close enough to predicted values.
Fixed the corrupted last SB column.
I see that, when tx-block straddle on frame borders, I am seeing that inverse_transform_add() does not allow accessing outside frame. In the function, output.rows_iter_mut() that is based on , RowsIterMut() (i.e. 2nd stage for vertical 1d transform) does not returning rows outside the frame, and iterate until, remaining = plane.cfg.height as isize - self.y;, which breakes 2D transform work correctly.
To investigate the corrupted pixels with my trial of implementing open partition, especially to see how they behave when the tx-block straddles on frame border, I am testing function by function in major path of encoding and reconstructing a tx-block, i.e. encode_tx_block(), one of them is wrting diff() (generating residue signal between original source and predicted pixels in a block) as below.
https://github.com/xiph/rav1e/blob/master/src/encoder.rs#L1066 called at https://github.com/xiph/rav1e/blob/master/src/encoder.rs#L1201
fn diff2<T: Pixel>(
dst: &mut [i16], src1: &PlaneRegion<'_, T>, src2: &PlaneRegion<'_, T>,
width: usize, height: usize,
) {
let stride1 = src1.plane_cfg.stride;
let stride2 = src2.plane_cfg.stride;
for y in 0..height {
for x in 0..width {
unsafe {
let v1 = src1.data_ptr().add(y * stride1 + x);
let v2 = src2.data_ptr().add(y * stride2 + x);
dst[y * width + x] = i16::cast_from(*v1) - i16::cast_from(*v2);
}
}
}
}
While diff() at https://github.com/xiph/rav1e/blob/master/src/encoder.rs#L1066 does seem access pixels outside a frame via .rows_iter(), the diff2() function can be called w/o clipping of the width and height of the tx-block that has pixels outside frame. So, using unsfafe{ } access, we can access outside the frame w/o changing current tile code, if required.
I've tried this experiment of open partition on older commit, Nov 2018, where non-square partition in not introduced yet. https://github.com/ycho/rav1e/commits/708a806c_2018_1104
test_dec_dav1d.y4m.gz test.ivf.gz
This makes me think somewhere in coefficient (i.e. residue pixels) encoding stage is wrong for open partition, for ex, diff() or inverse_transform_add(), etc.
With allocating Frame blocks for multiple of SB (i.x. # of SB x 16), some corruption has been fixed. This fix was appplied because the coefficient coding for the tx block on frame boundary requires to use the coeff context even outside of frame. Possibly there is still some info that are missed when encoding coeff or others symbols.
Relevant change in my work branches are at: https://github.com/ycho/rav1e/commit/1dbdf34831349a33c1df53ea79d45080a69a2a82 and https://github.com/ycho/rav1e/commit/d93a7e0c517a22cfc23294698fb0bec2a96a77e9 (for Nov, 2018 branch)
48x48 size test input thankfully provided by Nathan, @negge. twitter.y4m.gz
New test images to test with 2x2, 3x3, 3x2 SB cases, to test using 64x64 open partitions on frame boundaries. Their width and/or height size are not multiple of 64 pixels and # of pixels in hor and ver chosen such that gives av1 encoder to have PARTITION_SPLIT option, i.e. not mandatary split either hor and/or ver direction.
128x112 112x112 176x112 176x176
test03_112x112.y4m.gz test03_128x112.y4m.gz test03_176x112.y4m.gz test03_176x176.y4m.gz
https://ibb.co/5jbhwCG https://ibb.co/P1stvTL https://ibb.co/W26mZqQ https://ibb.co/VttWLFt
Currently, I can reproduce corruption of u and v channel for 128x112 input, https://github.com/xiph/rav1e/files/4458167/test03_128x112.y4m.gz with the work branch https://github.com/ycho/rav1e/commit/b21c744437a9022a6312bf09d1b968c82e04ad39 and enc command: ./target/release/rav1e test03_128x112.y4m -o test.ivf -r test_rec.y4m --quantizer 50 --speed=9 --limit=1 --rdo-lookahead-frames=1 --low_latency
Currently, it works with 112x112 (i.e. wxh = 1.75 x 1.75 SBs), but does not work with 128x112 (2x1.75 SBs)
I've fixed the corruption finally!
Brand new test input to test 4x4 partitions sizes (and all other sizes as well :) ). test03_254x254.y4m.gz
One example from objective_1_fast test sequence set, where open partition happens vs current master uses partitions to match frame boundaries.
Intra frame:
In inter frame,
Topdown partition with bottom-up partition of frame boundary: At speed 6, I still have ~1.5% coding loss.
For one of largest regression, I captured some screen shots for "rush_hour_1080p25_60f.y4m", at QIndex 172 for 1st frame.
First SB grid only, to let you toggle and see the quality difference on the last SB row (i.e, on the bottom frame boundary). The frame size is 1920x1080, so the last SB row has 56 pixels in its height.
master:
open-partition branch
Then, partition info. master
open-partition branch
Please find and see the "Bits" (per frame) info on the middle of right panel, that shows 130,982 - 130,477 = 505 bits less than master.
Result at speed 2
Some consistent observations are: 1) coding gain in VMAF, 2) coding gains in 360p
Result at speed 1
Result at speed 0
Copying the updates from daala weekely meeting on 9AM PST, June 3.
Investigations on open partition scheme why it introduces coding loss for both bottom-up and top-down, ~1.5%. Visually checking with bitstream analyzer, open partitions seem reasonably chosen during partition search, yielding less bits but at larger QPs the quality seems worse compared to reference master.
Experiment using bottom-up partition for top-down on the frame boundary, otherwise, top-down just stops with larger open partition.
And, this leads me to think over the possibility of lambda used by rav1e might not be right for frame boundary case since it never has been trained from such cases. I.e seemingly too harsh penalization for partition cost for SB on frame boundary. But this is a hypothesis, just my idea needing a test (and hopefully proof), probably I need to start from knowing the origin of rav1e’s lambda and how it has formed.
FYI, current work branch is https://github.com/ycho/rav1e/commits/test_open_partition_use_bottomup_on_frame_boundary
For QIndex = 252 (but the QIndex for 1st frame is 231),
https://beta.arewecompressedyet.com/runs/master_af7f474fb@2020-05-25T16:43:25.801Z/objective-1-fast/rush_hour_1080p25_60f.y4m-252.ivf https://beta.arewecompressedyet.com/runs/test_open_partition_btmup_on_boundary_s6@2020-05-27T23:16:14.788Z/objective-1-fast/rush_hour_1080p25_60f.y4m-252.ivf
With SB grid . master
open-partition
With partition info master
open-partition
Result for still image sets, subset1 For awcy mono subset1, speed 0
There were several fixes for varying image frame sizes that are not found in objective-1-fast. https://github.com/ycho/rav1e/commit/805d68d9f521158c311b94185067b653bb99c3b5 https://github.com/ycho/rav1e/commit/aaa36d0b5bc189b32b51a5db0b04670dc9976023 https://github.com/ycho/rav1e/commit/3ead6b6feaabfc02885a15dae58a7187dec59b0d
They mostly address the A) fix of size of tx contexts in w and h for coefficient encoder and B) CFL to access (read) the reconstructed outside frame luma pixels to do luma_ac().
The work branch has moved to https://github.com/ycho/rav1e/tree/make_partition_strong, and I plan to split one large patch into many smaller ones. Thank you!
Adding more test images:
Another possibility for the coding loss by open partition is the undefined satd when pre-screening intra modes. It is good that satd compute with unsafe { } (since it can cross the plane region), but I had no chance what it does on frame boundaries. So, one way to exclude this factor is not doing intra mode decision then fix all intra mode block uses one mode, like dc pred. I will try that soon.
Another possibility for the coding loss by open partition is the undefined satd when pre-screening intra modes. It is good that satd compute with unsafe { } (since it can cross the plane region), but I had no chance what it does on frame boundaries. So, one way to exclude this factor is not doing intra mode decision then fix all intra mode block uses one mode, like dc pred. I will try that soon.
Checked this, and it seems not the reason from seeing awcy result for 1 frame only at speed 1.
AWCY for "--tune psnr" for subset1 at s0 and objective-1-fast at s1 will soon be available at following links:
Comparing inter frame (2nd frame), Qindex 172 given to encoder (actual 291) https://beta.arewecompressedyet.com/runs/master_af7f474fb@2020-05-25T16:43:25.801Z/objective-1-fast/rush_hour_1080p25_60f.y4m-172.ivf
master vs open-partition
Regarding whether MV for padded area is used, it seems so. Please look at bottom-left corner, bi-directional prediction, which has MV (shown right side panel), "Motion Vectors", (0,-70), (0,70), the blue arrow fetches from outside frame, i.e., padde area, I think.
All right, I think we found the clue why it has not worked! I forgot that I left two lines of code that set deblocking strength as zero, left these two lines here : https://github.com/ycho/rav1e/blob/01f57b36019f1e8f0d93a543c5cda233211d6ee4/src/encoder.rs#L3088
Many times, I turn off as many coding features as possible, including those three in-loop filters when I begin a new video codec task and that can possibly touches or interfere with other coding tools. And this time, the task was large to me and whole my mind and eyes was taken for seeking the reason for coding loss and completely forgot turning back on one of in-loop filters, deblocking filter.
I have not verified it yet, but awcy result comes soon!
Okay, my fix was correct and we finally start seeing the coding gains! YAY!
awcy speed 6, psnr, no frame reorder
PSNR | PSNR Cb | PSNR Cr | PSNR HVS | SSIM | MS SSIM | CIEDE 2000 |
---|---|---|---|---|---|---|
-1.5582 | -1.7591 | -1.8580 | -1.5413 | -1.6655 | -1.5900 | -1.7712 |
(with encoding time increase 5.2% ~ 9.7%) | PSNR | PSNR Cb | PSNR Cr | PSNR HVS | SSIM | MS SSIM | CIEDE 2000 |
---|---|---|---|---|---|---|---|
-2.0395 | -2.6533 | -2.9701 | -2.0350 | -2.1050 | -1.9871 | -2.2677 |
PSNR | PSNR Cb | PSNR Cr | PSNR HVS | SSIM | MS SSIM | CIEDE 2000 |
---|---|---|---|---|---|---|
-0.4097 | N/A | -0.6440 | -0.3641 | -0.4035 | -0.3906 | -0.5393 |
awcy, speed 1 (all bottom up), low latency, psnr
PSNR | PSNR Cb | PSNR Cr | PSNR HVS | SSIM | MS SSIM | CIEDE 2000 |
---|---|---|---|---|---|---|
-1.6251 | -2.3216 | -2.4041 | -1.6014 | -1.7150 | -1.6379 | -2.0424 |
If I don't use bottomup search on frame boundary and use the same topdown, there is a big coding loss, because topdown search cannot split and lookahead down more than one level deep when split is required by av1 spec, just largest partition (64x64) is used.
awcy, default speed 6 This mostly happens with 360p test sequences.
PSNR | PSNR Cb | PSNR Cr | PSNR HVS | SSIM | MS SSIM | CIEDE 2000 |
---|---|---|---|---|---|---|
2.3520 | 0.6730 | 1.1094 | 1.5370 | 2.3579 | 1.7963 | 1.2295 |
master vs open-partition
1st frame (key frame), speed 6, master vs open-partition (with not using bottomup on frame boundary for topdown), smallest QIndex = 80 i awcy.
1st frame (i.e, 1st inter frame), speed 6, master vs open-partition (with not using bottomup on frame boundary for topdown), smallest QIndex = 80 i awcy.
And most superset awcy result, speed 0
PSNR | PSNR Cb | PSNR Cr | PSNR HVS | SSIM | MS SSIM | CIEDE 2000 |
---|---|---|---|---|---|---|
-1.8174 | -2.0760 | -2.1534 | -1.7602 | -1.8175 | -1.7785 | -2.0376 |
speed_bag_640x360_60f.y4m, QIndex 252, speed 0
master vs open-partition, frame 1 (key frame, intra frame) (read: cyan color is for SB grid, white is for partition split, and yellow is for tx-split.)
master vs open-partition, frame 2 (i.e. 1st inter frame)
As a sanity check, master vs "open-partition branch with old partition searches"
+0.0041% for s2, and -0.0107% for s0. // mib block size 4x4 0.0% change for s3, s5, s6, s9 // min block size 8x8 0.0% change for s9 // min block size 32x32
Another screenshots of bitstream analyzer, showing how open-partition contributes to reduce bit rate by using larger partition sizes and thus encoding less amount of MVs in addition to partition info.
speed_bag_640x360_60f.y4m, QIndex 252, speed 0
master vs open-partition, 4th frame
(bitstreams: master, open-partition)
Regarding the regression with topdown partition search, i.e. speed levels 2~10, I think I found the better reason and possible improvement idea for it. Previously, I explained that the reason is topdown search revised in open-partition branch does not split when it lookaheads that any of its children partition requires mandatary split by av1 spec. This is due to its nature of algorithm and if we ever change it with major revisions so that it can, then we will realize it is doing the task what bottomup search achieves. In a difficult way.
The better reason I've found now is, topdown for open-partition does not uses non-square partitions at all. This means, when topdown is able to split down to smallest possible partition size on right or bottom frame boundaries, even when non-square is best choice in rdo sense, it must use square partitions. Note the important fact that, the mandatory split required by av1 (i.e. either hasRows or hasCols is false in the _decodepartition( ) in https://aomediacodec.github.io/av1-spec/#decode-partition-syntax) applies to square partition only, so non-square can be used on frame boundaries under any form of straddling) And most of input content that needs further smaller partitions on bottom/right frame boundaries, non-square partitions are more likely be better choices in rdo sense, when SB straddle on those frame boundaries.
FYI, we stopped using non-square partitions for topdown some time ago. Because using it middle of splitting down leads to worse partition decisions (coding loss). We have also tried usingt non-square partitions for terminal cases, i.e. smallest partition sizes allowed, which gave a coding gain but disabled the feature for the maintenance of codebase.
Now the possible fix is, move the lookahead (in rdo_partition_simple( ) ) that checks whether children partition requires mandatory partition inside the _encode_partitiontopdown( ), where right before it calls _rdo_partitiondecision( ). So that, if any child needs mandatory partition requires, then drop PARTITION_SPLIT (i.e. stop further quadri-partition) and add PARTITION_VERT or PARTITION_HORZ (depends on _hascols and _hasrows conditions) to _partitiontypes parameter that is passed to _rdo_partitiondecision( ).
Completed with #2396.
Affected areas of codebase can be:
- [x] Intra prediction to predict only visible pixels only- [ ] Inter prediction (Motion Estimation) to predict only visible pixels only- [ ] Define the pixel values for invisible area as input to forward transforms, i.e. what kind of padding to use for input and reconstructed frame? Already defined that extension of directly adjacent and last available pixel.