plan about my dev v2 - Githubissues

yuq commented 6 years ago

As previous plan is done, start a new one.

I've setup a mali450 board for mali450 dev and found the kernel driver HW ops not stable, like L2 cache and MMU reset command timeout, so want to give the kernel driver some refine and fix which may also benefit some problem found when mali400 dev. After this, I can send a RFC to kernel DRM driver mailing list for feedback.

anarsoul commented 6 years ago

Looking forward to see lima driver mainlined :)

yuq commented 6 years ago

Progress update

fix the PP error irq and MMU fault due to not enough PLB number
fully mali450 support with DLBU and BCAST used for PP jobs

still some bug need to be fixed in the kernel
try to use TTM as our MM if possible

anarsoul commented 6 years ago

@yuq PP error irq isn't fixed for mali400. I still get this on some runs of glmark:

[  768.373615] lima 1c40000.gpu: pp error irq state=201 status=40
[  768.390193] lima 1c40000.gpu: pp error irq state=201 status=40

anarsoul commented 6 years ago

Also I get MMU faults in kmscube -M rgba:

[ 2045.982632] lima 1c40000.gpu: mmu page fault at 0xe9bf80 from bus id 0 of type read on ppmmu1
[ 2046.001755] lima 1c40000.gpu: mmu page fault at 0xe997c0 from bus id 0 of type read on ppmmu0
[ 2046.021156] lima 1c40000.gpu: mmu resume
[ 2046.035419] lima 1c40000.gpu: mmu resume

yuq commented 6 years ago

What's your screen resolution when this kind of error happens? I remember you said you have a 2536x1440 monitor?

I fix this error when 1920x1080 and the PLB number is not set to max. But there's another dimension I haven't tried -- the PLB size. PLB size can be 128, 256, 512, 1024. Dumping mali driver I always see it's set to 512, so does lima-ng. But maybe when higher resolution, it should be increased to 1024.

anarsoul commented 6 years ago

My monitor resolution is 2560x1440.

yuq commented 6 years ago

So does increase LIMA_CTX_PLB_BLK_SIZE to 1024 solves the error on your side?

anarsoul commented 6 years ago

No, with LIMA_CTX_PLB_BLK_SIZE = 1024 kmscube doesn't work at all - and I get this in dmesg:

[  288.667891] lima 1c40000.gpu: pp error irq state=200 status=41
[  288.683349] lima 1c40000.gpu: pp error irq state=200 status=41

yuq commented 6 years ago

OK, maybe there's other place need to be configured for 1024 PLB like the DLBU reg: https://github.com/yuq/mesa-lima/commit/1c7700fb32a5974867b10da2088da2d3790699b6#diff-15af9d78941ee5e81caea488e2910f77R1092

I just hard code 0x20000000 for 512 PLB, 1024 PLB should be 0x30000000. So there maybe the same field for mali400 that we haven't discovered. We can first dump 2560x1440 mali and see if it uses 1024 PLB size, then where's this field.

anarsoul commented 6 years ago

Here's dump: https://drive.google.com/file/d/16WDMIvAeE6-wK4NYvepF8R0YEfJXEUHD/view?usp=sharing - I'm not really sure what to look for.

yuq commented 6 years ago

From your dump, although the gp stream mem is missing, I can see in the pp stream mem it's still 512 PLB. But I also find in the code that LIMA_CTX_PLB_BLK_SIZE is not used every where it should be, so fixed with: https://github.com/yuq/mesa-lima/commit/376b3c82dd684299d2b3baeb70c56c4bed7dcfaa

With this fix, 1024 PLB works, could you try it again?

anarsoul commented 6 years ago

1024 PLB works now, but it's the same as 512 - I'm still getting mmu fault in 'kmscube -m rgba':

[  138.162217] lima 1c40000.gpu: mmu page fault at 0x1bd400 from bus id 0 of type read on ppmmu0
[  138.181672] lima 1c40000.gpu: mmu page fault at 0x1bd400 from bus id 0 of type read on ppmmu1
[  138.200989] lima 1c40000.gpu: mmu resume
[  138.215337] lima 1c40000.gpu: mmu resume

and pp error in glmark2-es2-drm -b build:

[  300.957500] lima 1c40000.gpu: pp error irq state=200 status=41
[  300.973596] lima 1c40000.gpu: pp error irq state=201 status=40

Btw, everything that uses textures stutters for me, i.e. textured cube or 'glmark2-es-drm -b pulsar'

yuq commented 6 years ago

OK, then seems not the plb size problem. As the texture, Is it caused by the compiler: https://www.mail-archive.com/mesa-dev@lists.freedesktop.org/msg189216.html

anarsoul commented 6 years ago

Oh, I wasn't aware of this change in mesa-18.0. That explains stuttering.

As for the issue - I suspect it's something related to cache - since it works 4 out of 5 times fine, and fails on 5th time (that's approximately)

yuq commented 6 years ago

The compiler scalar back to vec problem will get worse when 18.1. But I want to focus on kernel currently so left it with some incomplete work around.

The issue maybe cache problem. Another possibility is the switch_delay, I found on Amlogic chip, when in high frequency (>500MHz), it has to be bigger than 0xff, otherwise the chip will work in unstable state. Not sure if this affect your chip.

anarsoul commented 6 years ago

Setting switch-delay to 0xffff doesn't help for ppmmu error, but "pp error irq state=200" goes away. Looks like Mali400 in Allwinner A64 needs switch-delay 0xffff to work properly. Does it make sense to make switch-delay = 0xffff default value?

anarsoul commented 6 years ago

And I think I understand when "ppmmu error" happens - it always happens if I run some app that uses textures and when I press ctrl+c to interrupt it. I believe driver tears down MMU mapping while PP is still running.

yuq commented 6 years ago

I don't know if it's proper to always set switch delay to 0xffff as some platform just set this value to 0xff and some set it to 0xffff in the mali driver, also this value depends on the clk freq. Does proprietary A64 mali kernel driver set it to 0xffff or 0xff?

As the ppmmu error, no matter your guess is true, kernel driver indeed has no mechanism to prevent this situation happen. If user just call vm_unmap before PP task is done, this result is expected. If user is interrupted and resource is freed due to dev file descriptor close, we may add some code to wait the task done.

anarsoul commented 6 years ago

If I read this code correctly, it uses 0x0 as delay since there's no pmu_switch_delay in device tree: https://github.com/mripard/sunxi-mali/blob/master/r6p2/src/devicedrv/mali/linux/mali_osk_mali.c#L244

What does 0x0 mean in this case? Highest possible delay?

yuq commented 6 years ago

Are you sure the switch delay reg is set to 0? this is the min delay or no delay from the comment.

anarsoul commented 6 years ago

I verified it, and it's setting it to 0.

yuq commented 6 years ago

Then if set to 0 in lima kernel driver, does it fix your pp error too?

yuq commented 6 years ago

Progress update:

switch to use TTM as MM is done, but I left the buffer eviction and swap not implemented because I don't know if GP/PP support MMU fault recovery (mali kernel driver doesn't implement it either), need reverse engineering. Otherwise we may implement it by pin/unpin buffer when task creation/deletion.
implement EGL_ANDROID_native_fence_sync for atomic modesetting, "kmscube -A" is supported

I'll prepare an RFC for the kernel driver recently.

anarsoul commented 6 years ago

@yuq please CC me on your RFC patches

yuq commented 6 years ago

@anarsoul no problem.

yuq commented 6 years ago

RFC has been send: https://lists.freedesktop.org/archives/dri-devel/2018-May/177314.html

mirh commented 6 years ago

Soo.. I noticed a guy noticed you are missing some Mali architectures there. You can even add ARCH_U8500, ARCH_HISI, ARCH_MEDIATEK, ARCH_SPRD, ARCH_ZX and ARCH_TANGO

yuq commented 6 years ago

Oh, I didn't know there're so many ARCH. Now I decide to just write like this: ARM || ARM64 || COMPILE_TEST

Thanks for your notice.

yuq / mesa-lima

plan about my dev v2 #37