Implement optimized DrawTarget::clear

yuri91 / ili9341-rs

A WIP, no_std, generic driver for the ILI9341 (and ILI9340C) TFT LCD display

Apache License 2.0

57 stars 50 forks source link

Implement optimized DrawTarget::clear #17

Closed KarlK90 closed 4 years ago

KarlK90 commented 4 years ago

The optimized clear method draws directly to the display without bounds checking.

Thank you for the driver :slightly_smiling_face:

jamwaffles commented 4 years ago

Is it possible to benchmark the performance of this change? The default size() impl is very close to the code in this PR, but leverages more features of embedded-graphics. I'd be interested to see if this PR speeds up the clear method at all.

KarlK90 commented 4 years ago

Sure I can try to benchmark it, but it already felt faster. My reasoning was that the draw_rectangle method of this crate does check if the pixel lay inside the display bounds, if these checks are not elided in release builds there should be a noticeable speedup.

jamwaffles commented 4 years ago

Ah, I see. I was looking at the embedded_graphics code but this optimises away some driver code instead :)

KarlK90 commented 4 years ago

That was certainly interesting! System is a Raspberry Pi 3B with a 3.2" ili9341 Display, spi speed 62.5 MHz using linux-embedded-hal spidev abstraction.

"Unoptimized" version:
Optlevel 3 + LTO + codegen-units = 1: 678ms
Optlevel 3: 693ms

"Optimized" version:
Optlevel 3 + LTO + codegen-units = 1: 675ms
Optlevel 3: 693ms

So basically marginal difference, which is really interesting and shows how subjective things can be! But I saw huge (12x!) differences from using the display-interface enabled version by @therealprof implemented in https://github.com/yuri91/ili9341-rs/pull/16

"Unoptimized" version display-interface:
Optlevel 3 + LTO + codegen-units = 1: 56ms

"Optimized" version display-interface:
Optlevel 3 + LTO + codegen-units = 1: 52ms

Maybe the 4ms improvement is still relevant enough to justify the merge?

jamwaffles commented 4 years ago

But I saw huge (12x!) differences from using the display-interface enabled version

Wow! That's a huge improvement. Well worth the benchmark.

Maybe the 4ms improvement is still relevant enough to justify the merge?

Certainly can't hurt :) but I'll leave that decision up to the maintainers ;)

therealprof commented 4 years ago

Well, the trivial implementation takes each value from the iterator, turns it into an array of 2 u8s and sends it to the SPI peripheral via embedded-hal adapter using individual sends. The display-interface SPI driver implements a (currently fixed sized) buffer. Depending on how efficient the SPI embeded-hal adapter works there might be a lot of overhead per send.

It should be possible to write a Linux spidev specific display-interface driver which could go even faster. The embedded-hal traits are universal but the slice based approach is not ideal for this kind of application. The nice thing about the display-interface is that one can easily chose (or write) a platform specific driver to get the most bang for the buck and simply swap it via initialisation in the application.

yuri91 commented 4 years ago

Thanks for the PR, seems good to me and I am going to merge it!