zephyrproject-rtos / west

West, Zephyr's meta-tool
https://docs.zephyrproject.org/latest/guides/west/index.html
Apache License 2.0
218 stars 118 forks source link

Setting `zephyr.base` variable breaks parallel builds #408

Open eugmes opened 4 years ago

eugmes commented 4 years ago

I'm attempting to run multiple west build commands in parallel (to save some time used by cmake). This breaks randomly with corrupted .west/config file. I tracked this down to this line:

https://github.com/zephyrproject-rtos/west/blob/618020d1979d07003d6df40ce59b413785ce3015/src/west/app/main.py#L692

Here multiple processes may attempt to write to the same file at the same time in my scenario. It would be good to get rid of this. Right now one have to run a zephyr-provided command with valid arguments before doing the build. west boards seems to work.

Note that I have ZEPHYR_BASE environment variable set during the build, but that does not prevent west from mangling the config file.

mbolivar-nordic commented 4 years ago

Thanks for your report.

We'd like to solve these types of problems in general by divorcing zephyr base setting from west itself, now that @tejlmand implemented a Zephyr CMake package which makes the environment variable obsolete. But the transition is tricky and we haven't done it yet, and I'm sorry this is causing you trouble.

I'm a bit surprised that this is a continual problem. The code should set zephyr.base once in the config file and then not have to set it again, so as long as your .west/config has a [zephyr] base = ... everything should work in subsequent runs. Is that not what you're seeing?

Is this in a CI environment where the workspace is being set up fresh each time? If so, perhaps you can work around this issue by having the CI setup scripts run west config zephyr.base "$ZEPHYR_BASE" or so, since you say the environment variable is set?

tejlmand commented 4 years ago

@mbolivar I believe @eugmes problem could be that two west processes writes the file simultaneously, and thus the file get corrupted. Which also explains why running west boards first, before running n-west build commands in parallel work, cause then the config is set, and will not be updated by the parallel runs afterwards.

(to save some time used by cmake).

What time you try to save ? west build invokes CMake, just as you would do by hand, and then ninja afterwards. It is simply a convenience instead of doing:

cmake -DBOARD=<board_name> -Bbuild .; ninja -Cbuild
eugmes commented 4 years ago

@tejlmand Yes, the problem only occurs when one runs multiple west builds in parallel, when they started almost at the same time. There is a chance of config file corruption then, but only when no other extension command was run before.

I have a test repo that builds multiple simple applications for multiple configurations. All together there are more than 100 binaries to build, and the number is going to increase. For those applications the configuration step using cmake takes almost the same time as building the applications. Almost nothing can be done to speed up building, it is already done in parallel. But the time used to run cmake can be reduced by running west processes in parallel. That's why I'm trying to use ninja for that. That not optimal, but it still results in significant speedup.

tejlmand commented 4 years ago

But the time used to run cmake can be reduced by running west processes in parallel.

I'm just being curios, why not simply run n-CMake instances in parallel ? After all, west build invokes CMake, so you would save the overhead of west.

eugmes commented 4 years ago

@tejlmand mainly because west seems to be an official way to build zephyr apps.

mbolivar-nordic commented 4 years ago

@mbolivar I believe @eugmes problem could be that two west processes writes the file simultaneously, and thus the file get corrupted.

I understand that, but west should only write the file if there's not a value already there, hence my follow up questions. If you could provide some answers for those, @eugmes, that would be helpful. Thanks.

eugmes commented 4 years ago

@mbolivar-nordic The problem is that the process of checking and setting the config variable is not atomic. So if multiple west build processes are started in quick succession then several of them could see that the value is not set. Additionally writing to the config file is also not atomic, so when multiple process try to write to this file, it may end up being corrupted.

Consider this sequence for example:

process 1: is the value set? - no
process 1: compute the default value
process 2: is the value set? - no
process 2: compute the default value
process 1: start writing the default value
process 2: start writing the default value
process 1: write complete
process 2: write complete

In this case the config file will contain both data written by process 1 and 2, in arbitrary order, depending on exact timing of events.

mbolivar-nordic commented 4 years ago

I understand the race. Are you able to apply any of the suggested workarounds?

eugmes commented 4 years ago

@mbolivar-nordic yes, running west boards before doing anything else works fine.

tejlmand commented 4 years ago

@tejlmand mainly because west seems to be an official way to build zephyr apps.

Thanks for the reason. It is correct that Zephyr docs are describing the build process using west and it is an easy way to get started with Zephyr.

But actually west is an optional (but very convenient tool), so there is no problem in using cmake directly, especially if it solves your problems.

That being said, west should still be fixed so that this issue can be fixed. (we would like to remove zephyr.base from Cmake, but currently we need west to be backward compatible.)

mbolivar-nordic commented 4 years ago

That being said, west should still be fixed so that this issue can be fixed. (we would like to remove zephyr.base from Cmake, but currently we need west to be backward compatible.)

I agree with this.

@mbolivar-nordic yes, running west boards before doing anything else works fine.

Ok, glad to know you have a workaround, since the fix (getting rid of zephyr.base entirely) is something we don't have scheduled.

tejlmand commented 2 years ago

@mbolivar-nordic with Zephyr 2.7.0 LTS out, it now the time to remove this in west ? Users on older Zephyr (for example 1.14 LTS ) can manually set zephyr.base if they need it.

I believe we don't need to set this value automatically anymore.