xmar / 360Transformations

67 stars 16 forks source link

General questions #10

Open gsabran opened 7 years ago

gsabran commented 7 years ago

Hi.

Thanks for putting this all together. This is extremely interesting. I've thought quite a bit about the problem, without implementing things directly. The two biggest questions I had and on which I was hoping to get your thoughts are:

gwendalsimon commented 7 years ago

Hey,

Here are some thoughts to your question.

Regarding the representation switch before the end of the segment. I guess that it depends on the decoder, which gets all the burden. In a given representation, some area A are very distorted and some other area B are high quality. If suddenly (in the middle of a GOP), you pick another representation where A switches to high quality and B to very distorted, the decoder has good chances to become crazy (the macroblocks are different, the prediction error too, so the motion vector are probably totally different). But since both representations share some structural similarities (the original content is the same), you may be lucky and make the decoder not crash. It is worth testing. In the case it works relatively well, the advantage would be that we could have longer segments (5 seconds?) since we could implement this intra-segment switching process for the extreme cases of very bad head movement prediction.

Regarding your proposal of making two different "sub-representations" of the same representation: an I-less sub-representation (in the case the client watched the same representation at the last segment) and an I-full sub-representation (in the case the clients just switched from another representation at the previous segment). This proposal can probably be implemented in DASH since the two sub-representations can be offered as if they were two independent representations (same resolution, same quality target but different bit-rate and different conditions of decodability). I don't know how to make sure that this latter option is explicit. The cons are: You multiply by two the number of videos to process and store at the server. The pros are: for the clients that do not switch representation from one segment to another, the overall data delivery reduction is equal to the size difference between an I frame and an P frame per segment. This gain depends on the encoder you use (some encoder vendors are known to generate large P-frames). Note also that this requires the last frame of the segments to be exactly the same on both sub-representations (to allow potential selection of I-less sub-representation for the next segment).

Regarding your region-based proposal, I understand that it is similar to the concept of "tiles", which is offered in HEVC. Many recent VR proposals are based on it: the client selects the quality for each tile. See for instance what TNO and Harmonic have shown at the last IBC conference. In the long paper of our proposal, we give some references to papers and we discuss such proposal.

Hope it helps,

gsabran commented 7 years ago

Yes that's great and very interesting. Thanks. I'll read the references and let you know if I manage to pull something interesting up.