Allow using reference repositories to share objects

zephyrproject-rtos / west

West, Zephyr's meta-tool

https://docs.zephyrproject.org/latest/guides/west/index.html

Apache License 2.0

215 stars 117 forks source link

Allow using reference repositories to share objects #695

Closed hzarnani closed 8 months ago

hzarnani commented 8 months ago

Today, the only way to efficiently handle cloning a large repository using west update that I am aware of is to limit the fetch depth as in -o=--depth=<n>. And I have noticed that the depth value has to be chosen carefully by the user because, evidently based on my trial and errors as well as cursory look at the Python code, <n> must be deep enough to include the specific SHA listed in the west.yml file, or else the update step fails.

An alternative to limiting the fetch depth is to share objects with a local reference repository by setting up a .git/objects/info/alternates file. The request is for a feature akin to git-clone's --reference[-if-able] option. See the documentation for more.

Ideally, one should be able to provide a reference for each repository separately. The format might also allow an option to specify a local prefix path for a particular URL base as a convenience. For example, a user may have locally cloned mirrors for all of the NCS repositories that they need under /home/<user>/git-mirrors/nrf-connect and want to associate that prefix path with the url-base https://github.com/nrfconnect.

I'll leave it to the feature designer/developer to determine when and in what format the reference repositories should be provided (and, of course, define precedence rules for ambiguities, i.e. when the local reference path for a project could be determined more than one way). One could think of an extension to the manifest format and add an optional local-refs (plural) counterpart to remotes and/or a new field local-ref (singular) for each project along with name and repo-path.

But since local references are very user- and environment-specific, perhaps paths to local references shouldn't be part of west.yml at all but rather part of another manifest, say, local-refs.yml that is similar to the format of west.yml but only accepts a subset of it, and that can be provided to west init or west update or both with an option like --reference[-if-able] similar to git-clone's. CLI option equivalents are also fine (e.g. --local-ref-base https://github.com/nrfconnect,/home/<user>/git-mirrors/nrf-connect or --local-ref https://github.com/zephyrproject-rtos/zephyr,/home/<user>/git-mirrors/zephyrproject-rtos/zephyr.git).

marc-hb commented 8 months ago

Today, the only way to efficiently handle cloning a large repository using west update that I am aware of is to limit the fetch depth as in -o=--depth=.

There are other (and not mutually exclusive) optimizations

must be deep enough to include the specific SHA listed in the west.yml file, or else the update step fails.

I don't think this is how west typically works. Take a look at

https://github.com/zephyrproject-rtos/west/pull/344

An alternative to limiting the fetch depth is to share objects with a local reference repository

Did you look at west update -h?

hzarnani commented 8 months ago

Today, the only way to efficiently handle cloning a large repository using west update that I am aware of is to limit the fetch depth as in -o=--depth=.

There are other (and not mutually exclusive) optimizations

Consider adding config option for treeless clones (--fetch-opt=--filter=...) #638

Support for Treeless clones actions/checkout#1152

performance How long things take

must be deep enough to include the specific SHA listed in the west.yml file, or else the update step fails.

I don't think this is how west typically works. Take a look at

[RFC] manifest: allow projects to say where their SHAs are #344

An alternative to limiting the fetch depth is to share objects with a local reference repository

Did you look at west update -h?

The caching mechanism added by @mbolivar-nordic in https://github.com/zephyrproject-rtos/west/commit/c50d342cc687010a56a15958c2fc67264c792851 does not actually take advantage of Git's object sharing. It still clones all of the objects and the entire history of the repository, only does so locally rather than over the network, which of course is an improvement, but not what's being asked here. The crucial step is to set up a .git/objects/info/alternates file. Using the --shared option when running git-clone does that. Other tools that use Git would have to create this file themselves, which is fairly easy to do. That would result in a much smaller size for the .git directory in the project workspace as well, thus saving a lot of disk space in addition to a faster creation of the work tree.

On a related note, using git-init and git-fetch is much preferred over using git-clone. In other words,

git init
git remote add <remote_name> <remote_url>
[set up .git/objects/info/alternates to point to objects in <local_cache>]
git fetch <remote_name>

is much better than

git clone <local_cache>
git set-url <remote_name> <remote_url>
git fetch ...

which seems to be how west works when given a cached repository.

marc-hb commented 8 months ago

That would result in a much smaller size for the .git directory in the project workspace as well,

Much smaller disk space... if you don't count the initial repos.

only does so locally rather than over the network, which of course is an improvement, but not what's being asked here

There is no doubt --shared would be a big optimization. But as with any optimization work the most important question is: "How much?". More precisely: how much compared to existing optimizations? Greatly increasing the complexity of the code base for saving a few percents would never be worth it.

So far you haven't provided any number, not even any order of magnitude. You don't sound like you've explored all available options either: your first sentence at the top is "--depth is the only efficient way I'm aware of", which is incorrect.

Interactive users clone very rarely from scratch. In our CI, west update takes 1-2 minutes from scratch (using the existing optimizations I listed) which is acceptable for us. Need some time to run tests anyway.

So what is your use case? Development normally happens to fix tangible and measurable issues, not just "cool ideas".

Before implementing one of the existing optimizations, @mbolivar-ampere spent a lot of time performing some measurements. You can find those at one of the links I shared above if you're interested.

On a related note, using git-init and git-fetch is much preferred over using git-clone.

west used to do this but it was changed in e283d9986f9d

hzarnani commented 8 months ago

That would result in a much smaller size for the .git directory in the project workspace as well,

Much smaller disk space... if you don't count the initial repos.

Think of many concurrent workspaces, not just a single one.

only does so locally rather than over the network, which of course is an improvement, but not what's being asked here

There is no doubt --shared would be a big optimization. But as with any optimization work the most important question is: "How much?".

Multiple workspaces sharing the same Git objects is very clearly a huge advantage, both in terms of storage and speed of checkout. Imagine N users, or many concurrent CI jobs, using the same Git mirrors on some NFS share locally. N workspaces sharing the same repository histories is very clearly an advantage over N workspaces and N replications of the same history. And it's faster.

More precisely: how much compared to existing optimizations?

I'm not going to do that comparison. But feel free to do some Google searching on the advantages of sharing Git objects with a reference repository.

Greatly increasing the complexity of the code base for saving a few percents would never be worth it.

What exactly is the complexity? And no, it's not a few percents.

So far you haven't provided any number, not even any order of magnitude.

It should be self-evident. A single replicated history, which is a constant, versus N replicated histories.

You don't sound like you've explored all available options either: your first sentence at the top is "--depth is the only efficient way I'm aware of", which is incorrect.

I indeed have. As I said, the caching implementation in west is mediocre at best and doesn't address the issue of object sharing.

Interactive users clone very rarely from scratch.

Now I am going to challenge statements like this -- please provide some numbers. How many users? How often?

In our CI, west update takes 1-2 minutes from scratch (using the existing optimizations I listed) which is acceptable for us. Need some time to run tests anyway.

The assumption in that statement is that the Git repositories involved are small. But what if large repositories are involved and they may not be using LFS?

So what is your use case?

I mentioned that earlier -- many concurrent workspaces using large repositories.

Development normally happens to fix tangible and measurable issues, not just "cool ideas".

I appreciate that.

Before implementing one of the existing optimizations, @mbolivar-ampere spent a lot of time performing some measurements. You can find those at one of the links I shared above if you're interested.

I'll take a look.

On a related note, using git-init and git-fetch is much preferred over using git-clone.

west used to do this but it was changed in e283d99

marc-hb commented 8 months ago

Interactive users clone very rarely from scratch.

Now I am going to challenge statements like this -- please provide some numbers. How many users? How often?

You're the one asking for a "clearly", "self-evident" new feature - without providing any number, reproducible use case, example, measurements of existing optimizations, prototype code or any offer to contribute or help[1]. You seem to have a performance problem to solve[2]. I don't.

Now answering your question anyway:

Doctor, it hurts when I keep cloning from scratch interactively.
Don't.

You don't sound like you've explored all available options

I indeed have.

Then share some reproducible example and actual data, not "self-evidence".

[1] "I'll leave it to the feature designer/developer..." - who is that? "I'm not going to do that comparison. Feel free to Google..." [2] assuming it's not a https://en.wikipedia.org/wiki/XY_problem

marc-hb commented 8 months ago

What exactly is the complexity?

This was just an example. Every feature and code addition increases complexity - and bugs, and maintenance costs. If you take a quick look at the git log, you'll notice this project is not really staffed with an army of full-time developers. Not even one full time in fact, very far from it.

I have no idea what would the complexity be in this particular case but your description of the new feature is not exactly short while still leaving a lot of opens. If you think this would be a small effort then I can't wait for your pull requests (with some sample data to back them up). Don't forget the test code.

hzarnani commented 8 months ago

It is desirable for a tool built on top of Git to allow using the facilities that it offers for dealing with various complexities, particularly cloning large repositories. west is deficient on that front because it does not support using Git's object sharing mechanism, which is a well-known and primary feature of this tool. While I'd like to motivate the need for my feature ask with numbers, suffice it to say that some development and test environments rely on object sharing. I ask that the reader refer to the wealth of literature available on this topic to learn more.

About the problem statement being long, sure, it could have been more concise. But thoroughness was the goal.

I realize and appreciate how with limited time and resources, feature requests have to be addressed judiciously.

I can take a stab at extending west and adding the desired behavior. I'll make a pull request if I decide that what I have is presentable. And I certainly hope that then the conversation goes a little better.

hzarnani commented 8 months ago

Turns out, this feature request is closely related to (really a duplicate of) https://github.com/zephyrproject-rtos/west/issues/625.