project.py: Add new 'joblib' dependency to parallelize update

RobertGalatNordic commented 4 weeks ago

The projects from manifests were updated one by one. The parallel approach has potential to fully utilize CPU and network connection to speed up the update.

Needs more testing, I checked only few common cases and it seems fine.

Performance tests: west init -m https://github.com/nrfconnect/sdk-nrf --mr main time west update

real 14m35.399s user 6m2.368s sys 1m34.562s

system and network load during update:

Clean up and replace west for version from this PR rm -rf <.west and all downloaded repositories> pip3 uninstall west cd <west_source_dir> pip3 install -e . cd ..

repeat the test west init -m https://github.com/nrfconnect/sdk-nrf --mr main time west update

real 6m34.408s user 9m25.977s sys 2m7.339s

system and network load during parallel update:

As we can see, the result is substantially faster. The same operation finishes in less than half of the original time (from 14.5 minutest down to 6.5 minutes).

RobertGalatNordic commented 4 weeks ago

Another tip for faster fetch would be to use ssh instead of http. By adding the following config to git it will automatically translate the http to ssh, for github repos. git config --global url.ssh://git@github.com/.insteadOf https://github.com/

marc-hb commented 4 weeks ago

There are a few test failures in https://github.com/zephyrproject-rtos/west/actions/runs/9462265663/job/26086666688?pr=713, please take a look at "Running the tests" in the README.rst

Ignore the macOS failures, unrelated. Fixed in #712.

But more importantly, unless I'm confused (please correct me) I believe there was a previous attempt to do this and it had to be reverted because of terminal issues on Windows, see 2nd and abandoned attempt in #551. Were you aware of it? This new #713 attempt looks surprisingly much shorter: sounds too good to be true.

@M1cha can you take a look at this #713?

RobertGalatNordic commented 4 weeks ago

I did have a minor copy-paste bug, The tests ware very helpful to catch this. I rerun the tests locally, and got a green mark. So I hope this time it will be OK.

I looked into the previous attempt, and my approach seems simpler because I used a library that has been designed to be a drop-in replacement to for loops.

marc-hb commented 4 weeks ago

I did have a minor copy-paste bug, The tests ware very helpful to catch this.

Yes, tests save lives. Speaking of which: make sure all the existing tests stick to -j 1 by default. Again, the last thing we want is concurrency when debugging some totally unrelated issue. You should also add at least one test using a parallel update for minimal test coverage.

(pytest should run tests in parallel but that's a completely different story)

For those reasons, 1. requires unbuffered git output, see --help string in https://github.com/zephyrproject-rtos/west/pull/551.

If joblib provides terminal output ONLY after a job is done, then this means west update will now be surprisingly quiet for a possibly long time. This could misled users into thinking they have a network issue or some proxy misconfiguration when git is actually running fine. So now I'm actually wondering whether -j 1 should maybe stay the default, among other benefits it would avoid disorienting old users. It would also make the new dependency "more optional". Either way, when all output is buffered then print a message like "parallel update started, be patient..."

Another tip for faster fetch ...

BTW: https://github.com/zephyrproject-rtos/west/labels/performance

marc-hb commented 4 weeks ago

Now that memories are coming back I feel like I just started repeating the previous discussions on this topic. To avoid such a repeat, please review all past discussions in #533, #551 and the ones before that. It's a big long sorry but it will be shorter than repeating them.

RobertGalatNordic commented 3 weeks ago

rebased on main, added detection if joblib is available (if not use old single core approach), added -j [JOBS] parameter to west update ( -j will use all cores, same as -j -1) when joblib is not installed or -j 1 has been passed or -j has not been passed at all, the old single thread approach with for loop will be used. Updated tests to add parametrization to test cases

-j 1 - same as without -j
-j - use all threads
-j -1 - use all threads
-j 2 - use two cores

RobertGalatNordic commented 3 weeks ago

I did not test those changes on Windows, as I do not work on that platform. I think that if there is an issue there, we could remove this feature for that platform, or maybe someone more experienced on Windows could help us. But as I mentioned previously, I used off the shelf library to do the heavy lifting, and it is possible that we could delegate any new issues to joblib project

RobertGalatNordic commented 3 weeks ago

I tested it on Windows in Power shell.

When using west update or west update -j 1 the results are as expected, nice clean output.

On the other hand, when more threads are used, the output gets a little messy. It seems like the multiple threads push to stdout at the same time, and some of the lines are placed in the middle of other lines.

I can see how this could be fixed, but it will require a lot more refactoring in the self.update(project) which does all the job the output is generated and pushed to stdout with print statements, or it calls the git commands without catching its output, We would have to collect all this output and return it from the self.update(project) function.

So just to summarize, the default behavior is the same as before this PR, so you have to explicitly state that you want to use multiple threads for update. On Windows the output with multiple threads is messy, but when single thread is used there is no change compared to version from main.

On Linux the output looks a little better, as the lines are not interrupted, but you see lines from multiple processes one under another.

IMO this is not a big deal, to make it right we would have to collect all the output, and print it only after the job is completed, this would cause some idle time for the user (there is a package to provide progress bar for joblib so this could solve our issue of idle output for user https://pypi.org/project/joblib-progress/).

Furthermore, there is always option to not use this, and I imagine that this feature will be used in automation scripts, where we do not care about the output, just on the time.

marc-hb commented 3 weeks ago

I believe there was a previous attempt to do this and it had to be reverted because of terminal issues on Windows, see 2nd and abandoned attempt in #551. Were you aware of it? This new #713 attempt looks surprisingly much shorter: sounds too good to be true.

I looked into the previous attempt, and my approach seems simpler because I used a library that has been designed to be a drop-in replacement to for loops.

Ha! That explains. Re-using a higher-level library: great idea. This makes the PR very short indeed, nice job. [...] Either way, when all output is buffered then please print a message like "parallel update started, be patient..."

On the other hand, when more threads are used, the output gets a little messy. It seems like the multiple threads push to stdout at the same time, and some of the lines are placed in the middle of other lines.

I can see how this could be fixed, but it will require a lot more refactoring in the self.update(project) which does all the job the output is generated and pushed to stdout with print statements, or it calls the git commands without catching its output, We would have to collect all this output and return it from the self.update(project) function.

Wait, so you're saying joblib/loky have no ability to capture terminal/standard outputs?

I skimmed their APIs and I found https://github.com/joblib/loky/issues/306 which all seem to confirm they indeed don't care about standard outputs. Their focus is parallel computation, which I bet uses standard outputs rarely - only when something goes wrong.

This is a MAJOR disappointment :-(

All this time I was confused and assumed joblib/loky offered this ability to capture standard outputs. I thought this was why the joblib solution was much shorter. Now I understand it is shorter because it does not try to capture outputs and just let them pass through.

I wouldn't mind a first shot that does not capture outputs (others may disagree) but there would need to be very clear path towards implementing this in the future and I see none with joblib right now; it looks like a dead-end.

Now that memories are coming back I feel like I just started repeating the previous discussions on this topic. To avoid such a repeat, please review all past discussions in #533, #551 and the ones before that. It's a big long sorry but it will be shorter than repeating them.

As you can see the title of the aborted fix #533 was project: overwrite stdout/stderr objects as well, that was a pretty strong clue. I also mentioned multiple times that the previous attempt was reverted because of terminal issues.

Furthermore, there is always option to not use this, and I imagine that this feature will be used in automation scripts, where we do not care about the output, just on the time.

Failures happen in automation, then output is important too. In case of failure, outputs can be even more important than in interactive use because there is often no simple access to the automation environment and no or little "trial and error" possibility to test again. Changing verbosity levels and other parameters is also more difficult at best and impossible at worst.

PS: I also discovered that the main reason loky was created was to avoid bugs in Python multiprocessing. But these bugs have apparently been fixed in 3.7 and west requires 3.8 minimum

marc-hb commented 2 weeks ago

Failures happen in automation, then output is important too.

Moreover: compilers, linkers etc. are used to have their outputs interleaved with concurrent processes, it's been like this for them for decades. So they expect concurrency and generally make a good effort to push standalone "chunks" of information. At the receiving end, make and ninja try to somehow respect that.

On the other hand, I really doubt git output comes prepared for such concurrency and "naive" parallelization like the one in this PR does not either; your test in Powershell proves that.

Trying to summarize:

I would approve a PR that defaults to -j 1 and prints interleaved garbage terminal output for -j > 1 but only if it does not add any new, significant dependency (like joblib) that does not care about outputs and just lets them through. Maybe Python's built-in multiprocessing could be enough, I don't know. Also, there should be a very clear warning in the --help that -j > 1 can produce garbage.
Based on past discussions (links already shared above), I suspect @mbolivar-ampere may disagree with 1. He should be back from vacation soon.
Without waiting, please extract from this PR and submit separately the creation of the project_update() and project_update_some() functions. Leave aside the actual -j bits for now. This will lay the ground for ANY parallelization attempt with pretty much any library and it can't hurt so I think we can merge that preliminary bit yesterday. Let's get those mundane bits merged and out of the way.
Preferably in a separate commit, please also submit your test changes like this:

pytest.mark.parametrize("options", [""]) #, "-j 1", "-j 2", "-j -1"])`
def test_update_some_with_imports(options, repos_tmpdir):

Again, this will help anyone experimenting with any sort of -j prototype. We don't want to lose this no matter what happens next.

zephyrproject-rtos / west

project.py: Add new 'joblib' dependency to parallelize update #713