UDP TX performance improvement

zephyrproject-rtos / zephyr

Primary Git Repository for the Zephyr Project. Zephyr is a new generation, scalable, optimized, secure RTOS for multiple hardware architectures.

https://docs.zephyrproject.org

Apache License 2.0

10.19k stars 6.25k forks source link

UDP TX performance improvement #75610

Open ssharks opened 2 weeks ago

ssharks commented 2 weeks ago

Is your enhancement proposal related to a problem? On two independent boards the UDP TX performance is lower then the UDP RX performance.

On an ST based system (nucleo_h723zg), as mentioned in: https://github.com/zephyrproject-rtos/zephyr/pull/75281

UDP      TX          RX
     76.08 Mbps  93.62 Mbps
TCP      TX          RX
     74.19 Mbps  85.51 Mbps

By thesis report on an NXP board https://is.muni.cz/th/p6jl9/

The relevant table

Describe the solution you'd like Look into the reason why the UDP TX throughput is lower then the UDP RX throughput

ssharks commented 2 weeks ago

@dleach02 @jukkar @rlubos I created this issue to keep track of any investigation on why the UDP TX might be slower than the UDP RX.

dleach02 commented 2 weeks ago

The other part of this is understanding why the RX side is consistently slower then lwip

ssharks commented 2 weeks ago

The other part of this is understanding why the RX side is consistently slower then lwip

Could you detail what you mean? The table in the description shows comparable results for LWIP and Zephyr on UDP RX and Zephyr is even faster at TCP RX.

dleach02 commented 2 weeks ago

maybe I'm mixing the directions. The "upload" tests are consistently slower then lwip.

ssharks commented 1 week ago

@dleach02 mentioned earlier that one of his suspects for this issue is the number of context switches when sending packets. @rlubos /@dleach02 Do you think it makes sense to make a test version that hands over not every packet directly to the TX thread, but only ever 4 packets? And do you has the possibility to test the impact? If this gives a significant difference, this gives at least some direction to solving this.

rlubos commented 1 week ago

@dleach02 mentioned earlier that one of his suspects for this issue is the number of context switches when sending packets.

That could be easily verified if you set CONFIG_NET_TC_TX_COUNT=0 - that way all network stack processing (from the socket all way down to the driver) should be done from the application thread. Hovewer, as I mentioned already in other places, at least with STM it gave worse results due to driver's blocking nature.

@rlubos /@dleach02 Do you think it makes sense to make a test version that hands over not every packet directly to the TX thread, but only ever 4 packets? And do you has the possibility to test the impact?

I guess that could be doable with some hacking, I may try this out when I have some time.