General questions - Githubissues

gsabran commented 7 years ago

Hi.

Thanks for putting this all together. This is extremely interesting. I've thought quite a bit about the problem, without implementing things directly. The two biggest questions I had and on which I was hoping to get your thoughts are:

Is it possible to switch between representations before the end of the current segment? When the user turns fast for instance, the already streamed current segment is not very relevant anymore and it's be great to switch to the representation that corresponds to the new orientation without waiting 2s, if possible not streaming from the beginning of the segment frames that correspond to past time. The idea I had (which is not in ligne with the DASH framework) was something like:
- each representation has two set of segment sets:
  - one where all frames are normal frames, with no key frames that we can represent like ..... ..... ..... ..... .....
  - one where all segments start with a key frame: |.... |.... |.... |.... |.....
- when a representation starts to stream, a segment from the second set is streamed
- when the same representation continues to stream, we use segments from the first set.
- The idea is that while we're on the same representation, we don't stream expensive keyframes, which in turn make is possible to have much shorter segments, which would make the direction-adaptive streaming more reactive.
Do you think there'd be issues with how the required storage and processing scale with the number of representations? For instance Facebook says they use 30 different viewports, which mean 30 times the storage and processing. So the complexity is Number of viewports * Number of available qualities What's your take on an approach that would be more similar to how google map loads data, where the space is split in regions and each is independently streamed, and represented in different quality settings. If we then need things to be more precise, this would only mean have smaller regions which would have smaller file size. So complexity would be Number of viewports * Number of available qualities / Number of viewports = Number of available qualities. That has the downside of having to stream from a number of sources and putting things together on the client, but would offer much more freedom on what is streamed and which part of the viewport should has it's quality changed.

gwendalsimon commented 7 years ago

Hey,

Here are some thoughts to your question.

Regarding the representation switch before the end of the segment. I guess that it depends on the decoder, which gets all the burden. In a given representation, some area A are very distorted and some other area B are high quality. If suddenly (in the middle of a GOP), you pick another representation where A switches to high quality and B to very distorted, the decoder has good chances to become crazy (the macroblocks are different, the prediction error too, so the motion vector are probably totally different). But since both representations share some structural similarities (the original content is the same), you may be lucky and make the decoder not crash. It is worth testing. In the case it works relatively well, the advantage would be that we could have longer segments (5 seconds?) since we could implement this intra-segment switching process for the extreme cases of very bad head movement prediction.

Regarding your proposal of making two different "sub-representations" of the same representation: an I-less sub-representation (in the case the client watched the same representation at the last segment) and an I-full sub-representation (in the case the clients just switched from another representation at the previous segment). This proposal can probably be implemented in DASH since the two sub-representations can be offered as if they were two independent representations (same resolution, same quality target but different bit-rate and different conditions of decodability). I don't know how to make sure that this latter option is explicit. The cons are: You multiply by two the number of videos to process and store at the server. The pros are: for the clients that do not switch representation from one segment to another, the overall data delivery reduction is equal to the size difference between an I frame and an P frame per segment. This gain depends on the encoder you use (some encoder vendors are known to generate large P-frames). Note also that this requires the last frame of the segments to be exactly the same on both sub-representations (to allow potential selection of I-less sub-representation for the next segment).

Regarding your region-based proposal, I understand that it is similar to the concept of "tiles", which is offered in HEVC. Many recent VR proposals are based on it: the client selects the quality for each tile. See for instance what TNO and Harmonic have shown at the last IBC conference. In the long paper of our proposal, we give some references to papers and we discuss such proposal.

Hope it helps,

gsabran commented 7 years ago

Yes that's great and very interesting. Thanks. I'll read the references and let you know if I manage to pull something interesting up.

xmar / 360Transformations

General questions #10