xbmc-imx6 / xbmc

XBMC Main Repository
http://xbmc.org
Other
32 stars 5 forks source link

1080i deinterlace issue #70

Open Gobam opened 10 years ago

Gobam commented 10 years ago

Interlaced HD videos (1080i) are not properly processed using the existing deinterlace options in XBMC GUI. All the existing options are causing high frame drops/skips. Same videos are working properly in other platofrms (Raspberry Pi, ION). Thanks!

wolfgar commented 10 years ago

Hi, You are right, Thanks for opening this ticket so that we will properly track this remaining issue...

Ryo99 commented 10 years ago

Do this commits to master-pr fix this issue? 38626442fdbf428848487a86a824b84bcb5fc4de e384ec1fd13254176ed91b6ea96802f2eae3bd6c 5daa4b1a87e04b37a5cf8a742611dcadff4f79a0 091b6e8aa9ca4de184e78455a06e658d3a636d60

smallint commented 10 years ago

No. Deinterlacing of HD streams suffers from either limited memory bandwidth or bad implementation. I have now my Cubox with an up-to-date kernel up and running and will continue to work on this in future. If the latter is the case I have already some initial work in place towards an additional mixer thread that could help. In case of hardware limitations (VPU/IPU/GPU utilization) we have to test other options. Stephan already proposed some kernel tweaks but all this needs time and proper testing.

susisstrolch commented 10 years ago

@smallint - can you state more precisely what you mean by "up-to-date kernel"?

piotrasd commented 10 years ago

any progress with deinterlace issue ? for now i think this is the biggest problem :(

jillesv commented 10 years ago

It is like there is no interest in solving this issue. For me it is also very important that 1080i Live TV playback is working. That is why I bought this device in the first place. My raspberry was not powerfull enough to play the 1080i stream. I thought this device was powerfull enough, but now there are other issues that prevent to watch Live-TV. This issue is also not on the known issue list for Openelec and Geexbox... I don't understand why... I think it is really important.

koying commented 10 years ago

@jillesv Your priorities might not be the priorities of everybody... I personally don't care for instance, and the devs who care may have more important issues at hand.

rabeeh commented 10 years ago

Maybe someone can shed some light on where the bottlenecks are? Previously smallint mentioned it either suffers from memory bandwidth or bad implementation; anyone can provide more details? jillesv - for completeness; i suggest you provide a link for a test 1080i content to benchmark against.

jillesv commented 10 years ago

Link to test TS 1080i file: http://jillesathome.nl:5050/fbsharing/lwHy0Rta

wolfgar commented 10 years ago

Hi there, @jillesv : first you are right, the issue is annoying for liveTV use case and is definitively real. This thread is about it, so there is no attempt to hide it. But maybe distro should stress it and link to this place... I understand that you might be frustrated and would like that this issue is already solved but you also have to understand that people who contribute to the development are not the vendors of imx6 devices and do it for free during their free time. So things are sometimes slow to happen @rabeeh, in fact issue is that decoding a frame + deinterlacing it (using IPU which has to split it in several blocks because of the vdic resolution limitation) + rendering is sometimes longer than a frame duration hence the jerky effect. We are not very far from getting it to work smoothly so before changing significantly the current implementation we wanted to check whether tweaking AXI priority (indeed the deinterlacing requires additional memory bandwidth) can do the trick. If it is not the case, then we would have to change the implementation : Today deinterlacing is performed in the rendering thread and delays the rendering of the frame. We could try to improve it by having a scheme where, in nominal case, during a frame duration : frame n+2 is decoded frame n+1 is deinterlaced frame n is rendered Of course at the end we still have to do the 3actions during a frame duration but here we could deinterlace and render at the same time (to the extend of common hw resources of course) while in the current way of working, they are purely sequential operations. So conceptually it could work, but implementing requires a good synchronization and can be tricky...

jillesv commented 10 years ago

Wolfgar, Thank you for your explanation!

piotrasd commented 10 years ago

so we waiting for fix :) is good to know there is hope :)

anyway i was testing last time - latest build and deinterlace for SD also i no perfect, picture is not smooth like without deinterlace, when you watch you can feel in like little drop frames or jerky

wolfgar commented 10 years ago

yes there is hope for sure ! ;) For SD, in fact there are 2 different algorithms : Try to select one instead of automatic, you may have better behavior.. Also what is your board ?

piotrasd commented 10 years ago

MatrixTV v2.1 and i was used latest MatrixTV os v1.0.0.9 and also latest openelec. (and compiled myself openelc with latets xbmc imx6 master branch) About algorithms you mean half and full deinterlace from video settings in xbmc ? i was try both. efect this same ....jerky picture is maybe to big word, but this is special visible when example some text scrolling,example: some end subtitles of movies or some info bar down of picures on some news program. Without deinterlace eveything is smooth, i can record some samples for for test.

wolfgar commented 10 years ago

Yes I alluded to half an full deinterlace from video settings. In fact these are xbmc naming but these 2 algos are wired to 2 different configuration of the vdic motion engine. That's why I suggested to try both : one should behave better for fast motion scenes... OK we will try to improve this

wolfgar commented 10 years ago

@piotrasd : you should have received a mail from me

smallint commented 10 years ago

Sorry for my long absence. I am currently setting up a build environment on my cubox with Archlinux and xbmc compiled already. I am going to look into the deinterlacing issue and will try different methods to test and improve performance. This is scheduled for Monday. @wolfgar: Should I fork master or is there a better branch to use in terms of the upcoming PR? I think we will need to include new enumerations for deinterlacing with better string representations. The more easy possible merges later the better.

wolfgar commented 10 years ago

@smallint : Nice to see you back ! As you have seen I have removed the deinterlacing support from initial PR in order to add proper imx dedicated values (in the enum) and I though it would be the opportunity to work on this remaining issue before pushing it upstream... If you want to have a look, you are very welcome of course I would advise to start from imx-pr branch as it is the candidate from the upstream PR...

smallint commented 10 years ago

I have tested deinterlacing with a dedicated mixer thread that does deinterlacing in parallel to decoding and rendering. The results are not very promising but I will investigate further. The synchronization itself is quite tricky and once this is stable I will do more tests on performance. What I can say is that the performance with my old kernel (3.0.xx) on the Wandboard and the Yocto build was much faster. Not sure if the kernel frequency setting did the trick before. The hardest part is to do proper logging of all threads to analyze possible bottlenecks. Stay tuned ...

wolfgar commented 10 years ago

HI Jan, Yes this way of working was expected to be tricky : Especially if we come to wait for deinterlacing to submit a new picture then the whole way of working is useless

Regarding your performance, I am unsure about the root cause : Maybe the kernel config CONFIG_MX6_VPU_352M ? or maybe there are deeper changes in IPU driver for new kernels (I have not checked, I had a good understanding of the 3.0.35 driver because I had to look at it to remove the flickering lines but I don't know if major changes were introduced in the >3.10 kernels)

Take your time, as I know you are working on it there is no risk of duplicate work and do not hesitate to tell me if I can help in any way...

smallint commented 10 years ago

Currently I still need to figure out how exactly the sychronization performs. I am going to build a dummy deinterlacer (e.g. sleep(30ms)) and check if the playback is still smooth. If so I can assume that both pipelines are really parallel and do not block each other, the output should be in the interval of the maximum processing time (VPU or IPU). Once I introduce the real IPU and the performance drops dramatically it is likely that we are dealing with hardware limitations.

Take my word that I am continuously working on that during the next weeks and will let you know how things are going.

cmichal2 commented 10 years ago

I recompiled the Geexbox kernel with CONFIG_MX6_VPU_352M, and it makes a big improvement. It doesn't completely fix the issue, but obvious glitches in 1080i videos are much less frequent than without it.

piotrasd commented 10 years ago

maybe kernel 3.14 resolve some problems

wolfgar commented 10 years ago

@cmichal2 : Thanks for your feedback : Which board and which kernel do you use for your test @352M ?

cmichal2 commented 10 years ago

I'm using a Cubox-i4. The kernel is 3.10.30 from geexbox-devel

wolfgar commented 10 years ago

Thanks a lot, I asked because contrary to the 3.0.35 version, the 3.10 freescale kernel seemed to ignore this configuration option... Thanks for your answer, I will have a deeper look at the 3.10.30 kernel used by geexbox for cuboxi...

cmichal2 commented 10 years ago

You know, I went grepping through the source, and I couldn't find anywhere that that configuration option actually affected anything. But honestly, it does seem to make a big difference.

wolfgar commented 10 years ago

hehe the same from my previous look at the source (while in 3.0.35, this config was clearly taken into account through code) For 3.10, maybe there is something hidden/correlated with dts when we select this option, I will check

Anyway, many thanks for confirming that it has the expected effect ! (even if not yet enough any improvement is good at this stage...)

cmichal2 commented 10 years ago

I'll be interested to hear if you can find any effect of that config option. I'm a little unsure now of what's going on. After xbmc (and tvheadend) have been running for a day the glitches seem to be more frequent again. I tried each kernel, and immediately after booting it does seem as though with the clock speed option it is better, but I'm not that confident. Could there be a memory management aspect to this?

smallint commented 10 years ago

Funny, while I am working currently on the deinterlacing issue I came to the conclusion that the IMX6 is not powerful enough with current kernel configuration (3.10.30, ArchLinux) to run VPU, IPU and GPU at the same time with 1080i. This is either related to limited memory bandwidth or that the IPU is stalling the VPU. Thats why I wanted to apply the CONFIG_MX6_VPU_352M option as well. But checking the sources I could not find any place where this option is used at all. So from my understanding this option cannot change anything unless there is some real magic done.

I am about to publish some numbers soon. What I would like to have is a small performance test tool built with the XBMC libraries to decode and process a stream using the IMX6 codec. Does there anything exist somewhere?

rabeeh commented 10 years ago

@smallint we had a fix ages ago for LK 3.0.35 to improve IPU to DDR internal buses quality of service. Can you please check that register by - devmem 0x020e0018 If you want to get best QoS for IPU then try running - devmem 0x020e0018 32 0xffffffff

@wolfgar Please look at LK 3.14.14 too. It performs way better on the i.MX6 hardware. Jon Nettleton and RMK had added lots of fixes for CMA, GPU, IPU and one featured that boosted the performance on HummingBoard-i1 (the single cpu) was BFQ scheduler that boosted the performance to decode 80Mbps h.264 content.

smallint commented 10 years ago

@rabeeh The performance does not seem to be related to the IPU which processes fine within given time frames but the VPU. Its performance decreases significantly if IPU is active and the decode times are not fitting anymore into one frame (40ms for 25fps). So we need to find a good balance for the two. Is LK 3.14.14 already available as PKGBUILD for Arch?

pepedog commented 10 years ago

Not in official git or repo http://myplugbox.com/new/ source plus pkg there I am having trouble booting hbi1 with it

smallint commented 10 years ago

Thanks a lot, I will try it out. Btw, what is hbi1?

rabeeh commented 10 years ago

@pepedog Only LK 3.14.x has a good support for HummingBoard. @smallint hbi1 is HummingBoard-i1 (i.e. HummingBoard carrier board with MicroSOM with solo i.MX6 MicroSOM.

smallint commented 10 years ago

Is the current firmware compatible with LK 3.14.x or to put it in other words: can I just replace 3.10.x with 3.14.x and it should work?

pepedog commented 10 years ago

With arch it is comparable. Remember arch has very few -dev or -devel packages, their bundled with main pkg (xorg-server-devel is notable exception).

smallint commented 10 years ago

@pepedog Thank you very much. I have already recompiled the current kernel on Arch, so no issue at all here. XBMC also compiles fine. I just want to replace my kernel with your package and check the performance difference.

pepedog commented 10 years ago

Will be interested if you can point out any improvements @smallint

wolfgar commented 10 years ago

Hi there,

@smallint : I have sent to you a private email with additional detailed data regarding the way to configure qos for the different blocks

Unfortunately, here we use IPU for 2 different use cases : first one to use DP and to display the fb on the HDMI interface, the other one to use VDI block to deinterlace our fields But we have a common qos (well at least for read... write should only be used by vdic I think)

@cmichal2 : Apart from the CONFIG_MX6_VPU_352M option, have you changed other options compared to the default geexbox configuration ? Especially, have you changed the CONFIG_PREEMPT config ? I ask because I cannot find how the option VPU_352M would be handled and I wonder whether another option would be responsible for your improved behavior...

cmichal2 commented 10 years ago

I enabled highmem, and heavily patched the au0828 tuner driver, but that's all. And those changes are the same in the two kernels I'm comparing. It is possible I imagined the improvement - I'll try to do a better comparison of the two kernels sometime in the next few days, will test Rabeeh's suggestion of setting the ipu qos register as well.

wolfgar commented 10 years ago

Thanks for your answer As smallint rightly explained, it is not only about the ipu qos : His tests prove the required time for VPU to decode a frame significantly increases when IPU is in use in parallel. As the most obvious common resource is memory bus, it is possible that, on the contrary, ipu priority is too high at the moment Just to give some additional feedback regarding the Rabeeh suggestion : In fact forcing all IPU accesses to max prio is especially useful when you experience underrun : ie when stream does not arrive on time to be displayed and you loose HDMI sync (the effect is your screen turns black for a moment). And it was all the more useful when I did not use the GPU to combine video and GUI but used the DP in IPU to do so. At that time 2 streams had to arrive on time to be combined on the fly and displayed...

smallint commented 10 years ago

imx6-vpu-ipu-gpu-perf imx6-vpu-ipu-perf

Here are two plots showing the processing times of an 1080i stream. The first figure is with GPU active while the second is not (frames are not rendered). You can easily see at the first figure how the green plot (VPU decode times per frame in ms with IPU active in parallel) is mostly above 40ms while the red (VPU decode times per frame in ms without IPU active) is much below this limit. The IPU performs fine and is very well below the limit of 40 ms.

This figures are for LK 3.10.30 on a CuBox-i4Pro under ArchLinux.

smallint commented 10 years ago

I checked with LK 3.14 and the overall situation has not improved. The figures are almost the same ... the deinterlacing plot seems smoother now and the mean is also a bit lower. So deinterlacing improved with that kernel.

smallint commented 10 years ago

Regarding LK 3.14: some CEC patches are probably missing. I cannot use my remote anymore after restarting XBMC. Only reboot deactivates it. This is never an issue with 3.10.30.

cmichal2 commented 10 years ago

@smallint, do you see any difference if you set bpp=16 vs bpp=32 in uEnv.txt ?

smallint commented 10 years ago

I haven't checked and my current setting is bpp=16. Do you think it would run faster? bpp=32 seems to need even more memory bandwidth than less. But anyhow, it is worth a try.

cmichal2 commented 10 years ago

It looks to me like bpp=32 is worse, but I'm learning not to trust my eyes. It would be interesting to have numbers.

cmichal2 commented 10 years ago

Baby step: in 3.0.52, setting CONFIG_MX6_VPU_352M does a number of things: sets a couple of voltages to 1.25V rather than 1.175 (pu_voltage, soc_voltage), also disables bus frequency scaling, changes the cpu frequency and the vpu frequency. It looks like all of these settings are now exported in /sys

For example, /sys/bus/platform/drivers/imx6_busfreq/busfreq.13/enable lets you turn bus frequency scaling on/off.

It looks like the voltages can be changed on the fly in /sys/bus/platfor/devices/soc.1/2000000.aips-bus/20c8000.anatop/regulator-vddpu.9/regulator/regulator.5/microvolts (and regulator-vddsoc.10/regulator/regulator.6/microvolts).

The clock frequencies appear to be in /sys/kernel/debug/clk/osc/pll2_bus but I haven't figured out which is which (help!)

Searching the 3.0.52 source for VPU_352M shows the things it touches - not all of which are totally transparent to me. But presumably with the knowledge of what that config option does we could just set the settings in /sys to reproduce the effect of the old CONFIG_MX6_VPU_352M configure option.

There's a little information here: https://community.freescale.com/thread/309304

smallint commented 10 years ago

@cmichal2 I was thinking af something like that after checking the current sources. Since the option has no real effect the VPU frequency must be changable either on the fly or via a kernel parameter (unless disabled at all). Wolfgar told me already how to increase the memory qos parameter for the VPU and it already helped along with other code optimizations to increase the deinterlacing speed. But unfortunately we are still not able to decode, deinterlace and render a 1080i with 25 fps. I was able to show that VPU and IPU can work together within 20ms but a soon as the GPU comes into play the whole things breaks down and we are slightly above 40ms. For sure the old v4l way was much faster but comes with other drawbacks in combination with XBMC (3d, subtext, ...). This stuff is harder to solve than I thought in the beginning. ;) And we are still far away from double rate rendering for HD ...

P.S. Again, LK 3.14 did not help at all to improve the speed. I tested two days all kind of combinations without break through. Btw, got anyone Gstreamer running under Arch to test playbacks? I was not able to install the required plugins ... probably haven't tried hard enough.