yarnpkg / yarn

The 1.x line is frozen - features and bugfixes now happen on https://github.com/yarnpkg/berry
https://classic.yarnpkg.com
Other
41.37k stars 2.72k forks source link

Usage of uncompressed tarballs #541

Open Daniel15 opened 7 years ago

Daniel15 commented 7 years ago

Something to consider as a future enhancement, post-launch

Some people may want to store tarballs of all their dependencies in their source control repository, for example if they want a fully repeatable/reproducable build that does not depend on npm's servers. Storing compressed tarballs in Git or Mercurial is generally bad news. Every update to a package would result in a new copy of the entire file in the repo, which can make the repo very large. Every time you clone the repo, the full history is transferred including every previous version of all the packages, so even deleting the binary files has a lasting effect until you rewrite history to kill them.

Instead, we should try storing uncompressed tarballs (ie. .tar files). Since the tar files are mostly plain text, in theory Git/Mercurial should be able to more easily diff changes to the files if a new version of a module is added while an old version is removed and just store the delta rather than storing an entirely new blob.

Related: This was implemented in Shrinkpack: https://github.com/JamieMason/shrinkpack/issues/40 and https://github.com/JamieMason/shrinkpack/commit/7b2f341408be4f0415714ec57534debfdaaa3fbf#comments. According to the comments on the commit, this actually sped up npm install when shrinkpack implemented it, as npm no longer needed to decompress the archive every time. This makes sense since you're removing the overhead of gzip from the installation time.

bestander commented 7 years ago

A few arguments from an internal discussion:

Saying that, we can't deny the speed improvement when unzipping non compressed tars, so there may be a reason to consider this feature

joncursi commented 7 years ago

+1

shrinkpack has become a huge part of our development workflow. When packages are upgraded and the build is "shrinkpacked", individual tar files are created for only the new packages, and the outdated versions are automatically dropped. That's because the name of the resulting .tar files are a function of the package versions. Here's a short snapshot of what an node_shrinkwrap directory would look like:

screen shot 2016-11-25 at 7 56 08 am

You can explicitly follow the git history on this directory to figure out which dependencies were upgraded and when, i.e. react-native-animatable in this example...

screen shot 2016-11-25 at 8 05 32 am

...with quick and easy access to the backup:

screen_shot_2016-11-25_at_8_05_56_am

With shrinkpack, the diffs in GitHub are hyper reflective of the commit message and the actual changes being made. Commiting and pushing the result of a new shrinkpack is a better experience, IMO, than doing the same after a yarn pack, because as mentioned, changes are handled at the package version level, rather than repository version level. So you're only pushing up individual .tar files, which is fast, especially if you're using Git LFS, and you don't need to touch your package.json version number at all.

bestander commented 7 years ago

@joncursi, we have offline mirror feature that does what you want https://yarnpkg.com/blog/2016/11/24/offline-mirror. The only thing missing is cleanup that we don't do on purpose because the storage of tars is used by multiple projects

joncursi commented 7 years ago

@bestander very cool, thank you for sharing that blog post. I didn't catch this feature ability by reading the CLI docs. This would be a lovely addition to https://yarnpkg.com/en/docs/cli/config

I use shrinkpack local to each project, rather than globally for multiple projects. I would like to do the same with yarn, which would require old tar files to be removed when packages are upgraded. I only care about maintaining the latest working version of the package; if I need to dig up an older package version, it's always there in the git history. But I don't need or want to store it directly in the mirror forever.

My use-case is to implement the mirror less-so for offline purposes, and more-so for maintaining a concise list of package backups incase packages are suddenly unpublished from NPM. Risk control. As far as I know, that was largely the intent behind shrinkpack in the first place.

Is there a smarter way to automate package removal from the mirror when a new package version is added? Perhaps a config option in .yarnrc to specify this (feature request)? ATM it seems I have to manually do...

yarn add package@new-version && rm -rf yarn_mirror/package@old-version

Also, the same issue presents itself when removing a package from use in the repo entirely...

yarn remove package && rm -rf yarn_mirror/package-*
bestander commented 7 years ago

@joncursi, this is a bit offtopic of this issue, better come up with an RFC discussion of what is needed.

As for the cleanup, it can be a 10 line JS/bash script you can run on the side of yarn until we implement it. The script should be:

Daniel15 commented 7 years ago

This issue is specifically for switching from compressed (.tar.gz) to uncompressed (.tar) tarballs, anything else should be discussed in a separate task 😄

bfricka commented 7 years ago

From an implementation standpoint, what sort of risks and level of effort would you foresee simply by making this a flag that you can pass to the CLI? Shrinkpack is written so that uncompressed tarballs are the default, but you can opt into compressed packages with a flag. What would the impact be for simply implementing the inverse behavior (opt-in to uncompressed with a flag)?

It seems like this would address the issue of potentially unpleasant changes for those already using the offline mirror to commit modules locally, while allowing the uncompressed behavior for those who don't mind aliasing a couple of yarn commands.

Edit: Even more simply, the flag could just be defined in the .yarnrc

This is actually the main thing preventing us from switching to yarn, as it already admirably solves the determinism issue and the offline mirror feature (thanks for the link, btw!) takes care of the rest. However, it leaves us with the undesirable (from our perspective) situation of committing binary packages. In our experience, Git does very well with simple tar, as most updated packages are recognized as renamed with tiny deltas, and the compression does all the rest. Thus, the actual bandwidth used is dramatically lower.

bestander commented 7 years ago

Yarn puts the same tarballs that it downloads from the registry into offline mirror folder.

To allow non compressed tarballs you would need to unzip it first and then zip it again.

Also the tarballs have versions in file names, so git won't be able to track version updates as small diffs.

On Wed, 7 Jun 2017 at 03:40, Brian Frichette notifications@github.com wrote:

From an implementation standpoint, what sort of risks and level of effort would you foresee simply by making this a flag that you can pass to the CLI? Shrinkpack is written so that uncompressed tarballs are the default, but you can opt into compressed packages with a flag. What would the impact be for simply implementing the inverse behavior (opt-in to uncompressed with a flag)?

It seems like this would address the issue of potentially unpleasant changes for those already using the offline mirror to commit modules locally, while allowing the uncompressed behavior for those who don't mind aliasing a couple of yarn commands.

This is actually the main thing preventing us from switching to yarn, as it already admirably solves the determinism issue and the offline mirror feature (thanks for the link, btw!) takes care of the rest. However, it leaves us with the undesirable (from our perspective) situation of committing binary packages. In our experience, Git does very well with simple tar, as most updated packages are recognized as renamed with tiny deltas, and the compression does all the rest. Thus, the actual bandwidth used is dramatically lower.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/yarnpkg/yarn/issues/541#issuecomment-306669841, or mute the thread https://github.com/notifications/unsubscribe-auth/ACBdWINjiFuNdwuHxPYhFcCAng-KdXK7ks5sBg2HgaJpZM4KQspY .

Daniel15 commented 7 years ago

You wouldn't need to unzip then zip again, you'd simply need to decompress the tarball. The inner .tar can stay the same, it'll just not be compressed.

Not sure about Git, but Mercurial tracks copied files, so it could track new versions of dependencies as copies of old ones if they're similar enough.

Sent from my phone.

On Jun 7, 2017 6:27 PM, "Konstantin Raev" notifications@github.com wrote:

Yarn puts the same tarballs that it downloads from the registry into offline mirror folder.

To allow non compressed tarballs you would need to unzip it first and then zip it again.

Also the tarballs have versions in file names, so git won't be able to track version updates as small diffs.

On Wed, 7 Jun 2017 at 03:40, Brian Frichette notifications@github.com wrote:

From an implementation standpoint, what sort of risks and level of effort would you foresee simply by making this a flag that you can pass to the CLI? Shrinkpack is written so that uncompressed tarballs are the default, but you can opt into compressed packages with a flag. What would the impact be for simply implementing the inverse behavior (opt-in to uncompressed with a flag)?

It seems like this would address the issue of potentially unpleasant changes for those already using the offline mirror to commit modules locally, while allowing the uncompressed behavior for those who don't mind aliasing a couple of yarn commands.

This is actually the main thing preventing us from switching to yarn, as it already admirably solves the determinism issue and the offline mirror feature (thanks for the link, btw!) takes care of the rest. However, it leaves us with the undesirable (from our perspective) situation of committing binary packages. In our experience, Git does very well with simple tar, as most updated packages are recognized as renamed with tiny deltas, and the compression does all the rest. Thus, the actual bandwidth used is dramatically lower.

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/yarnpkg/yarn/issues/541#issuecomment-306669841, or mute the thread https://github.com/notifications/unsubscribe-auth/ ACBdWINjiFuNdwuHxPYhFcCAng-KdXK7ks5sBg2HgaJpZM4KQspY

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/yarnpkg/yarn/issues/541#issuecomment-306726562, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFnHVDswWH9cTWxFh7IrnUB3_s7Q_t2ks5sBl70gaJpZM4KQspY .

bestander commented 7 years ago

Thanks, Daniel, good to know.

Although someone needs to show that this advanced mercurial/git tracking would happen on a real example then before we consider this change, right?

On Wed, 7 Jun 2017 at 09:35, Daniel Lo Nigro notifications@github.com wrote:

You wouldn't need to unzip then zip again, you'd simply need to decompress the tarball. The inner .tar can stay the same, it'll just not be compressed.

Not sure about Git, but Mercurial tracks copied files, so it could track new versions of dependencies as copies of old ones if they're similar enough.

Sent from my phone.

On Jun 7, 2017 6:27 PM, "Konstantin Raev" notifications@github.com wrote:

Yarn puts the same tarballs that it downloads from the registry into offline mirror folder.

To allow non compressed tarballs you would need to unzip it first and then zip it again.

Also the tarballs have versions in file names, so git won't be able to track version updates as small diffs.

On Wed, 7 Jun 2017 at 03:40, Brian Frichette notifications@github.com wrote:

From an implementation standpoint, what sort of risks and level of effort would you foresee simply by making this a flag that you can pass to the CLI? Shrinkpack is written so that uncompressed tarballs are the default, but you can opt into compressed packages with a flag. What would the impact be for simply implementing the inverse behavior (opt-in to uncompressed with a flag)?

It seems like this would address the issue of potentially unpleasant changes for those already using the offline mirror to commit modules locally, while allowing the uncompressed behavior for those who don't mind aliasing a couple of yarn commands.

This is actually the main thing preventing us from switching to yarn, as it already admirably solves the determinism issue and the offline mirror feature (thanks for the link, btw!) takes care of the rest. However, it leaves us with the undesirable (from our perspective) situation of committing binary packages. In our experience, Git does very well with simple tar, as most updated packages are recognized as renamed with tiny deltas, and the compression does all the rest. Thus, the actual bandwidth used is dramatically lower.

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/yarnpkg/yarn/issues/541#issuecomment-306669841, or mute the thread https://github.com/notifications/unsubscribe-auth/ ACBdWINjiFuNdwuHxPYhFcCAng-KdXK7ks5sBg2HgaJpZM4KQspY

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/yarnpkg/yarn/issues/541#issuecomment-306726562, or mute the thread < https://github.com/notifications/unsubscribe-auth/AAFnHVDswWH9cTWxFh7IrnUB3_s7Q_t2ks5sBl70gaJpZM4KQspY

.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/yarnpkg/yarn/issues/541#issuecomment-306728462, or mute the thread https://github.com/notifications/unsubscribe-auth/ACBdWNPXiyEZpnNFpKLpD7FoIXNe6NqYks5sBmDQgaJpZM4KQspY .

webuniverseio commented 7 years ago

Hi @bestander, we use git with bitbucket & npm + shrinkwrap on some projects. Here is what it looks like when minor version of the tar changes: image

Here are sample tar files for package from screenshot that was tracked as renamed: tars.zip

Thanks

Daniel15 commented 7 years ago

Although someone needs to show that this advanced mercurial/git tracking would happen on a real example then before we consider this change, right?

I've been meaning to test it out, I just haven't had time to do so.

bfricka commented 7 years ago

Hey there! It's been awhile, and since you're busy, I thought I'd make this as painless as possible.

Check out this shrinkpack tar proof of concept

bestander commented 7 years ago

This seems like a reasonable idea after all.

So how would it work?

  1. (if file is missing in offline mirror) download tar.gz from registry
  2. unzip and copy tar file to offline mirror
  3. unpack tar to cache
  4. if prune-offline-mirror is enabled and a tarball of a package was added to offline mirror and a another version was removed then register add/remove with git/hg mv

Results: A. Potential CPU wins because step 2 will be skipped when installing from offline mirror. B. Space wins if tarball contents are similar at step 4 C. Checking in unzipped tarballs gives a negative impact on repo size D. Step 4 seems a bit complex with all sorts of edge cases

So if A + B > C + D then why not? A, B and C can be measured, although D subjective.

bfricka commented 6 years ago

Bumpity bump! I can work on this if you guys want?

bestander commented 6 years ago

@bfricka, of course, give it a try. We would need to see a few real life examples with the impact this feature provides though.