yarnpkg / yarn

The 1.x line is frozen - features and bugfixes now happen on https://github.com/yarnpkg/berry
https://classic.yarnpkg.com
Other
41.44k stars 2.72k forks source link

Consider hardlinks rather than separate copy of packages per app #499

Closed Daniel15 closed 3 years ago

Daniel15 commented 8 years ago

This was touched on in a comment on #480, but I thought it's worth pulling into its own separate issue.

Currently, each app that uses Yarn (or npm) has its own node_modules directory with its own copies of all the modules. This results in a lot of duplicate files across the filesystem. If I have 10 sites that use the same version of Jest or React or Lodash or whatever else you want to install from npm, why do I need 10 identical copies of that package's contents on my system?

We should instead consider extracting packages into a central location (eg. ~/.yarn/cache) and hardlinking them. Note that this would be a hardlink rather than symlink, so that deleting the cache directory does not break the packages.

cpojer commented 8 years ago

Yarn initially used to use symlinks and we changed it because our internal tooling (watchman etc.) doesn't work well with symlinks. Are hardlinks different in this case? If yes, that might be worth doing.

I think the initial release should continue to use the copy approach; it is more consistent with the rest of the ecosystem and we should evaluate this behavior for a future major release.

dxu commented 8 years ago

Upon thinking about it further, another issue that might come up is that people may try to modify their local node_modules for local debugging purposes or testing purposes, and not expect that they're actually modifying the node module linked to everywhere else. I don't know how often this happens with others, but I've definitely done it (though rarely) in the past. Apart from that, hardlinks seems to make sense. I'd guess that the tooling would be fine, since it should be the same as any other file.

The primary issue this was intended to address was the cache causing issues with hardcoded paths that result from building packages (#480).

Daniel15 commented 8 years ago

Are hardlinks different in this case?

Not sure, might be worth asking @wez whether Watchman can handle hardlinks.

another issue that might come up is that people may try to modify their local node_modules for local debugging purposes or testing purposes, and not expect that they're actually modifying the node module linked to everywhere else

I think this is the use case for npm link or Yarn's equivalent though, right? You're never supposed to directly modify files in node_modules. We could have an option to make a local copy if people want to do this though.

cpojer commented 8 years ago

I totally agree with @dxu and actually wanted to write the same thing. I do this often: I manually add some debugging code into a random node_module (that I don't have checked out locally). Once I'm done, I wipe it away and do npm install. It would be a mental change for me to remember it would also affect other projects.

Daniel15 commented 8 years ago

Yeah, that's a use case I didn't really think about... Hmm...

Oh well, we can still hold on to this idea. Maybe it could be an optional configuration setting for people that don't directly edit node_modules and would like Yarn to run faster (less copying of the same data = less disk IO = stuff is faster and disk cache efficiency is improved) 😄

sebmck commented 8 years ago

Going to close this since we decided long ago to go away from symlinks. It's required for compatibility with the existing ecosystem as even projects like ESLint rely on this directory structure to load rules etc. There's also a lot of problems with existing tooling not supporting them. For example when Yarn initially used Jest and it would fail and produce extremely long paths. Jest is much better now and the bug is likely fixed but small issues like this exist in a lot of tools.

Daniel15 commented 8 years ago

Sebastian, this task is for hardlinks not symlinks. Hardlinks shouldn't have any of the problems you mentioned.

Sent from my phone.

On Oct 5, 2016 5:24 AM, "Sebastian McKenzie" notifications@github.com wrote:

Going to close this since we decided long ago to go away from symlinks. It's required for compatibility with the existing ecosystem as even projects like ESLint rely on this directory structure to load rules etc. There's also a lot of problems with existing tooling not supporting them. For example when Yarn initially used Jest and it would fail and produce extremely long paths. Jest is much better now and the bug is likely fixed but small issues like this exist in a lot of tools.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/yarnpkg/yarn/issues/499#issuecomment-251659387, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFnHQ9ON3xMLDxpGTj6kr0zMZII1hb6ks5qw5blgaJpZM4KOai0 .

sebmck commented 8 years ago

Hardlinks have the exact same problems and are semantically the same in this scenario. Why do you think they don't have any of the same issues?

yunxing commented 8 years ago

@kittens haven't really tested hardlinks. But once you hardlink a file, in theory from the filesystem's perspective, it should be exactly same as the original file -- you can remove the original file and the hardlinked file will still work.

This is different from symlinks, whose content is just a pointer to the original file.

sebmck commented 8 years ago

You can have cycles though which is extremely problematic if tools aren't designed to handle them (most JavaScript tools aren't, and how would they?). Hardlinks and symlinks on Windows both require admin privileges (NTFS junctions don't but they're more synonymous with symlinks) which is a non-starter for a lot of environments.

yunxing commented 8 years ago

Good point of Windows. We can have platform specific logic maybe if we decide to go down this path.

How do you create a cycle in hardlink? Note that there is no hardlink for directories.

Daniel15 commented 8 years ago

Going to reopen this for tracking purposes. It should be doable as hardlinked files look identical to the file system. I might prototype it.

wycats commented 8 years ago

@Daniel15 one thing to keep in mind is that since hardlinks pretend to be the file system so well, deleting them usually deletes way more files than you're expecting. Since rm -rf node_modules is a common pattern, I'd want us to have some mitigation for that likelihood before unleashing this into the wild (even on an opt-in basis).

I remember unexpected deletions hitting users of n back in the day and it has left a permanent scar 😛 (not directly analogous, but it gave me serious fear about giving people rope that could cause mass deletions of shared files)

Daniel15 commented 8 years ago

I remember unexpected deletions hitting users of n back in the day and it has left a permanent scar 😛

Good point, I remember Steam on Linux accidentally running rm -rf /* too: http://www.pcworld.com/article/2871653/scary-steam-for-linux-bug-erases-all-the-personal-files-on-your-pc.html

Maybe we need a safer "clean" function rather than just doing rm -rf node_modules

also commented 8 years ago

Are issues with hard links and rm -rf node_modules actually possible? While you can create symlinks to directories, you can't create hard links to them*, so you shouldn't be able to recurse into some global directory while running rm -rf.

* On macOS you can, but you shouldn't

vjpr commented 8 years ago

Symlinking to a global cache is essential. The copying approach is very slow for large projects (which I would argue are very common), and extremely slow on VMs, and very slow on Windows, and insanely slow on a virtualized Windows VM running on a macOS host in Parallels/VMWare.

I have a relatively simple frontend/backend project and the node_modules is 270K files and about ~300MB.

With a warm global cache, the "Linking dependencies..." step takes about 5 minutes. Symlinking would take a couple of seconds.

rm -rf node_modules takes about 15 seconds.

So when I am building my Docker images, its taking me 5 mins everytime when it could be seconds.

It seems every package manager's authors flatly ignore real-world performance.

Is there a plan to support symlinking any time soon. I feel like it would be a simple implementation and just add a --symlink flag. Where can I find the issue?

Daniel15 commented 8 years ago

, the "Linking dependencies..." step takes about 5 minutes. Symlinking would take a couple of seconds.

I wonder how long hardlinking would take. Definitely longer than symlinking as you need to hardlink each individual file, but it should be faster than copying the files over while avoiding some of the disadvantages of symlinks. I think it's worth having both a hardlink and a symlink mode, both of them opt-in.

tlbdk commented 8 years ago

We could also use a symlink or hardlink feature when doing builds on our build server as copying of node modules is far the slowest part of the build, fx. our build time drops from 3 minutes with npm install(1:45 with yarn) to 15 seconds if we cache and symlink the node_modules folder between builds(we do hasing of the packages.json to know when to invalidate the cache). A raw copy with cp takes 45 seconds.

AlicanC commented 8 years ago

Yarn initially used to use symlinks and we changed it because our internal tooling (watchman etc.) doesn't work well with symlinks.

Lack of symlink support in Watchman blocks more than Yarn: https://github.com/facebook/react-native/issues/637

I develop React Native, Browser and Electron apps and I only had problems with symlinks in React Native and that was because of Watchman.

The reason we can't have symlinking in Yarn shouldn't be Watchman or some other internal Facebook tooling. The rest of the ecosystem appears to support it well.

Symlinking should be opt-out.

Daniel15 commented 8 years ago

Hardlinks should work fine with Watchman, and any other tool, since they look identical to "regular" files. That's one reason I suggested trying hardlinks rather than symlinks.

AlicanC commented 8 years ago

Isn't it a lot slower to create hardlinks and also harder to implement? Also, should Yarn (or any other tool) really mess with people's filesystems?

If you are planning to implement both hardlinking and symlinking, starting from symlinking should be better. I think many people would opt-in and work problem free or fix any problems caused by other tooling.

Daniel15 commented 8 years ago

Isn't it a lot slower to create hardlinks

We'd need to benchmark. It'd be slower than symlinks (as you need to hardlink each file rather than just symlinking a directory), but it should be faster than copying the files.

Also, should Yarn (or any other tool) really mess with people's filesystems?

What do you mean by this? Yarn, by definition needs to create files on the user's system.

AlicanC commented 8 years ago

Also, should Yarn (or any other tool) really mess with people's filesystems?

What do you mean by this? Yarn, by definition needs to create files on the user's system.

I say that because hardlinks always felt like some low level "approach with caution" feature rather than a regular thing to do.

AFAIK, hardlinks are indifferentiable from regular files (unless inspected with the right tool) which make them obscure. (If both functioned the same, I'd prefer symlinks because they are obvious.)

Also, IIRC, you can't remove a source if it's still linked to a destination (which would prevent clearing the global cache) and you can't remove a destination by usual means and you have to use the hardlink tool to remove them. I hope I'm wrong with these though.

MajorBreakfast commented 8 years ago

I'd really like to see some sort of linking as this would speed up things even further. Yarn is already lightning-fast :P

I've used hardlinks in the past on macOS and Windows and AFAIK after creating a hardlink there is no obvious way to see which file was the "original". That is actually a nice feature since that means that deleting the cache is fine anytime (unlike for symlinks): The copies in the node_module/ folders would stay perfectly unharmed.

The only gotcha for linking in gerneral is of course that modifying files in a node_module/ folder for debug purposes would be a big no-no.

I have currently an install time of 280 seconds. With hardlinks I'd image it could be down to about 25s or so.

Daniel15 commented 7 years ago

I wonder if anyone has benchmarked Yarn on a filesystem that supports deduplication, such as zfs or btrfs. In theory, deduping data at the filesystem-level should be better than anything we could do in userland.

Also, IIRC, you can't remove a source if it's still linked to a destination (which would prevent clearing the global cache) and you can't remove a destination by usual means and you have to use the hardlink tool to remove them. I hope I'm wrong with these though.

You just delete the file like normal. If anything else uses the same file, it'll still work fine.

MajorBreakfast commented 7 years ago

[...] Yarn on a filesystem that supports deduplication [...]

@Daniel15 Neither macOS nor Windows support these file systems out of the box. Hardlinks are more practical. But I get that you're talking theory :)

[...] hardlinks pretend to be the file system so well, deleting them usually deletes way more files than you're expecting [...]

@wycats I know that this was a big issue in Broccoli. I don't quite understand it, though. Does the rm -rf problem exist if you only hardlink files (not folders)?

dfreeman commented 7 years ago

[...] Yarn on a filesystem that supports deduplication [...]

@Daniel15 Neither macOS nor Windows support these file systems out of the box. Hardlinks are more practical. But I get that you're talking theory :)

In theory both Apple and Microsoft's next-gen filesystems will be copy-on-write, which would change the landscape here considerably. APFS is allegedly due out in 2017, but it's not clear when ReFS will be ready for general-purpose use, so ¯_(ツ)_/¯

jeffbski commented 7 years ago

I believe hard links to the files would work well as long as you are not modifying the node_modules directly.

My program https://github.com/jeffbski/pkglink can be used to setup and maintain those hard links to your node_modules packages. In addition to checking the package version it also checks the file size and date before deciding to link a file so modified files will not be linked. On windows the dates are always different so they are not used as a comparison but file size still is.

hard links are not implemented in ReFS AFAIK but they work in NTFS, HPFS+, and ext. I didn't try running on a windows user without admin permissions.

So if you aren't modifying node_modules, you can use pkglink anytime.

If you are modifying your node_modules, do that first before using pkglink so that it won't link packages that have been modified.

jeffbski commented 7 years ago

hard links are really only supported and useful for files. Mac OS X allows on directories but then it only allows one link so it really is only useful for backup purposes. pkglink thus only does hard links at the file level. I have never run into any problems blowing away the node_modules directory since hard links reference count at the file system level, it tracks all that for you.

andretf commented 6 years ago

How is the current status of this issue? In the end, will it be implemented?

evan-scott-zocdoc commented 6 years ago

pnpm takes the hardlink approach on all platforms, if prior art is desired for strategizing: https://github.com/pnpm/pnpm

dhakehurst commented 6 years ago

I vote for an option to use symlinks

MajorBreakfast commented 6 years ago

@dhakehurst Symlinking whole folders has its problems. The pnpm team explains that here. I think their reasoning applies to all node package managers. A symlink only approach thus cannot work. As a result I think pnpm uses some kind of mixture between hard- and symlinks. But I'd be perfectly happy if Yarn offered an option to simply hardlink all the files. Also, it'd be good if it made them read-only, so that you cannot modify them accidentally.

dhakehurst commented 6 years ago

Not sure I agree with that explanation (or maybe I don't understand it)

In addition to speed for builds, and disk space usage (which can be solved with hard links) another reason for needing this is to do with backups. I develop on a mac. Mac Time Machine does not allow exclusion of folders with wild cards, only full paths I have several node developments, I can't add them all as exclusions. TimeMachine will back up every node_modules directory, even if they are hard links

hence I want soft links

(or a way to tell Time Machine not to be so stupid)

dhakehurst commented 6 years ago

Surely, If each module@version is downloaded and unpacked to one place on the filesystem, then any module that wants to reference it, just creates a soft link to the relevant directory.

i.e.

MajorBreakfast commented 6 years ago

@dhakehurst This does not work if you have two different versions of the same package in your project. This happens when dependencies have different version requirements for a common dependency.

dhakehurst commented 6 years ago

sorry I don't see why not. If copying files works, why does having a symlink not work. Are the files that are downloaded somehow modified?

vjpr commented 6 years ago

@dhakehurst Its much more complicated than you think.

Node resolver algorithm works with realpaths not symlinked paths. Peer deps / singletons can't be resolved properly without --preserve-symlinks.

E.g.

/dev/foo/node_modules/bar -> /dev/bar
/dev/bar

# /dev/bar symlinked search paths
/dev/bar/node_modules
/dev/node_modules
/node_modules

# /dev/bar non-symlinked search paths
/dev/foo/node_modules/bar/node_modules
/dev/foo/node_modules
/dev/node_modules
/node_modules

Notice how symlinked bar cannot access foo's node_modules anymore.

dhakehurst commented 6 years ago

if the dependency is from foo to bar, i.e. foo depends on bar, then bar should not need to access foo's node modules.

ljharb commented 6 years ago

@dhakehurst deduping means that bar and foo’s common deps get hoisted; bar wouldn’t be able to access any of its own modules.

vjpr commented 6 years ago

@dhakehurst And peer deps like react.

dhakehurst commented 6 years ago

ok, understood about peer deps. I think I see the problem.

basically the whole approach/"Node resolver algorithm" seems broken.

I guess it because, importing is done by some kind of file inclusion, rather than a properly designed module system, and every solution is a hack on top of that.

I wonder if it could be designed to work properly, or if it is a fundamental issue in node and the javascript language. I guess I need to dig into node a bit more to find out.

vjpr commented 6 years ago

a fundamental issue in node

Yes. But you can use --preserve-symlinks to get the behavior you want but that creates more problems. There are tons of long, long threads on this.

MajorBreakfast commented 6 years ago

The node module system is probably one of the most well thought-out module systems around. The ability to effortlessly get multiple (but not more than necessary) versions of the same package in one project, being able to use paths relative to the current file (within a package) and to define peer dependencies are all strong points. Besides, this issue is about Yarn. And, whether you like these features or not, they're all things that need to work with hardlinks/symlinks.

dhakehurst commented 6 years ago

use paths relative to the current file

Therein lies the problem. A good module system is not simply about importing files, and paths.

However, as you say, "this issue is about Yarn", so problems with node/javascript are out of scope, sorry to have brought it up.

As the title of this thread is about hard links rather than copy, If hardlinks would work, then so would softlinks/symlinks. Otherwise hardlinks have the same peer dependency problem as described by @vjpr

jpeg729 commented 6 years ago

There seems to be a lot of confusion about what hardlinks actually are. Most filesystems store files in two parts. The filename which points to the storage location, and the actual data. A symlink is a special filename that tells you to go look for another file. A hardlink is an second (or third, or ...) filename that points to the same data location.

Therefore hardlinked files do not suffer the same problems as symlinks, because they truly look like copies of the original files.

Also, assuming I only hardlink files and not directories, then if I do_ rm -rf node_modules then my system will delete the filename my-hardlink, but then it will notice that the underlying data storage is still referenced by yarn-cache/original-file and it won't delete the original file.

Basically, unless you are examining inode numbers, hardlinks look exactly like files copied from the originals, but they share the same storage location as the original files. So we will need to warn people not to modify the contents of their node_modules directories.

Another potential problem is that on linux you can't make a hardlink across filesystem boundaries. I don't know about windows or mac os. So we would need to fall back on true copying when hardlinking doesn't work.

Until something like this is implemented, I am going with the following approach...

hardlink -t -x '.*' -i '^(.*/node_modules/|/home/user/.cache/yarn/v1/npm-)' ~/.cache/yarn ~/code

Where ~/code is the directory I store all my projects in.

ljharb commented 6 years ago

@jpeg729 one problem that causes tho is that you’re supposed to be able to edit any file inside node_modules and see that change when you run your program - and if you have two places in node_modules that point to the same data location, editing one will end up editing the other, which might not be desired.

KSXGitHub commented 6 years ago

@ljharb

  1. You are not supposed to manually edit files in node_modules.

  2. We can have that feature configurable, so you can turn it off if you really want to edit node_modules.

ljharb commented 6 years ago

@KSXGitHub "you are not supposed to" where does that rule come from? It's always been both possible, and something node and npm explicitly supports.

As for being configurable, the problem is that users aren't going to know that this normal node ecosystem behavior behaves differently, and they could end up silently getting surprising behavior.

Pauan commented 6 years ago

As for being configurable, the problem is that users aren't going to know that this normal node ecosystem behavior behaves differently, and they could end up silently getting surprising behavior.

If the default is to not use hard-links, and the user has to manually enable it, then that's not a problem: they know they're using weird yarn-specific behavior.