zephyrproject-rtos / zephyr

Primary Git Repository for the Zephyr Project. Zephyr is a new generation, scalable, optimized, secure RTOS for multiple hardware architectures.
https://docs.zephyrproject.org
Apache License 2.0
10.71k stars 6.54k forks source link

Call settings_save_one in the system work queue, which will cause real-time performance degradation. #34239

Closed xiaoliang314 closed 3 years ago

xiaoliang314 commented 3 years ago

settings_save_one is a very time-consuming function, a call is usually a few hundred milliseconds to a few seconds, I think it should be preemptible. However, the function is usually called in the system work queue. Because the system work queue is not preemptive, this will cause the entire system to stagnate for a long time.

Laczen commented 3 years ago

@xiaoliang314, settings_save_one can indeed take a long time to complete, could you add some more information about the use case (e.g. the used backend, the backend storage size, ...).

xiaoliang314 commented 3 years ago

@Laczen Thank you for your reply!

Configuration

  1. backend: FCB
  2. storage size: 8KB
  3. sector size: 4KB
Laczen commented 3 years ago

@xiaoliang314, could it be that your storage needs are approaching 4kB?

xiaoliang314 commented 3 years ago

@Laczen My storage needs are less than 1KB. But the storage frequency is faster, about once every 4 seconds.

Laczen commented 3 years ago

@xiaoliang314, with this configuration and requirements you should not be getting a few seconds to do a settings_save_one. The worst case should be when a sector erase is needed. Your storage/settings area is also not polluted with settings that are no longer needed? As a test could you erase the storage area and perform the test again?

xiaoliang314 commented 3 years ago

@Laczen Yes, it is usually not so fast in normal use. This is just one of my test cases. I have an application. I received data from the hardware in the interrupt handler and placed it in a high-priority thread for processing. Because the work queue was doing settings_save_one, my thread did not process the data and release the buffer in time, resulting in data loss.

xiaoliang314 commented 3 years ago

@xiaoliang314, with this configuration and requirepents you should not be getting a few seconds to do a settings_save_one. The worst case should be when a sector erase is needed. Your storage/settings area is also not polluted with settings that are no longer needed? As a test could you erase the storage area and perform the test again?

My storage area is used after erase.

Laczen commented 3 years ago

@xiaoliang314, on some devices the flash erase blocks all execution of code. It is in general not possible to make the flash erase preemtible. For code to be executable during the erase all code should be executed from ram. If you really need the high priority task to be always done you will not be able to use any flash storage that is using internal flash.

xiaoliang314 commented 3 years ago

@Laczen Usually, the time for a flash erase or write is not very long, about 2-8ms. The time consumed by settings_save_one is generally a few hundred milliseconds or a few seconds. I have browsed the code of the settings subsystem, settings_save_one spends most of the time on the lookup flash.

I think calling "settings_save_one" in the system work queue caused this issue because it cannot be preempted.

@nvlsianpu Sorry to disturb you, maybe you understand this issue better.

Laczen commented 3 years ago

@xiaoliang314, yes during settings_save_one all your data is read to avoid rewriting the same data twice. However your storage size is so small (and reads are so fast) that I concluded the reads could not be the problem.

xiaoliang314 commented 3 years ago

@Laczen I halt the CPU, I see it is doing settings_fcb_compress most of the time.

Laczen commented 3 years ago

@Laczen I halt the CPU, I see it is doing settings_fcb_compress most of the time.

Ok, this means it is erasing sectors. You are going through the flash life fast, be carefull. Do you have the possibility to increase the number of sectors?

Laczen commented 3 years ago

@xiaoliang314, could you share exactly what you are saving (size and name), how frequent, from what you reported you are doing fcb_compress each time you are doing a save. This is not normal.

xiaoliang314 commented 3 years ago

@xiaoliang314, could you share exactly what you are saving (size and name), how frequent, from what you reported you are doing fcb_compress each time you are doing a save. This is not normal.

@Laczen name: "bt/mesh/RPL/%x", (%x is 0x0001 ~ 0x7FFF) size: 4 frequent: 4 ~ 8 second

Laczen commented 3 years ago

@xiaoliang314, this is writing data with a new name every time, if this data is never deleted, storage is full and each time a call to settings_save is done a trial is made (unsuccesfully) to create space for the new data.

xiaoliang314 commented 3 years ago

@Laczen This is part of bluetooth mesh. https://github.com/zephyrproject-rtos/zephyr/blob/b6217e1bec999c8133a21c34701d4bf0d3536099/subsys/bluetooth/mesh/rpl.c#L251-L271

Laczen commented 3 years ago

Yes, but the entry->src is not expected to have so many values.

Laczen commented 3 years ago

@xiaoliang314, can this issue be closed ?

xiaoliang314 commented 3 years ago

@Laczen I think that time-consuming functions should not be called in the system queue, which will reduce the real-time performance of system, what do you think?

xiaoliang314 commented 3 years ago

@Laczen Another solution is to add a set of APIs that modify thread preemptibility, set it to preemptible when calling settings_save, and restore preemptibility after the end.

xiaoliang314 commented 3 years ago

settings_save is preemptible because it can be called by general threads, which are preemptible threads.

Laczen commented 3 years ago

@xiaoliang314, this IMO not so important. The real problem is: a flash erase operation can take a long time during which the cpu can be halted. If this is the case it is impossible to guarantee real time operation. This can be avoided by not using internal flash storage.

The problem you had was that this happened on every write because the flash space was exhausted.

xiaoliang314 commented 3 years ago

@Laczen The erasing and writing time of my flash is not very long, about 2-8ms. If the thread can preempt, theoretically the CPU can process a higher priority task every 8 milliseconds.

Laczen commented 3 years ago

@Laczen The erasing and writing time of my flash is not very long, about 2-8ms. If the thread can preempt, theoretically the CPU can process a higher priority task every 8 milliseconds.

When and how settings_save_one is called is completely up to the user. Please time the routine in a normal use case, it should be ms.

xiaoliang314 commented 3 years ago

@Laczen The Bluetooth subsystem also uses settings_save_one in the system work queue, which is the cause of this problem. However, there seems to be no better solution. Can I keep it open until the problem is solved?

Laczen commented 3 years ago

@xiaoliang314, no problem with keeping it open. I will remove my assignment, for me this is closed.

galak commented 3 years ago

@Laczen The Bluetooth subsystem also uses settings_save_one in the system work queue, which is the cause of this problem. However, there seems to be no better solution. Can I keep it open until the problem is solved?

Please open a new bug for this with the template filled out.

github-actions[bot] commented 3 years ago

This issue has been marked as stale because it has been open (more than) 60 days with no activity. Remove the stale label or add a comment saying that you would like to have the label removed otherwise this issue will automatically be closed in 14 days. Note, that you can always re-open a closed issue at any time.