Closed Daniel15 closed 3 years ago
Yarn initially used to use symlinks and we changed it because our internal tooling (watchman etc.) doesn't work well with symlinks. Are hardlinks different in this case? If yes, that might be worth doing.
I think the initial release should continue to use the copy approach; it is more consistent with the rest of the ecosystem and we should evaluate this behavior for a future major release.
Upon thinking about it further, another issue that might come up is that people may try to modify their local node_modules for local debugging purposes or testing purposes, and not expect that they're actually modifying the node module linked to everywhere else. I don't know how often this happens with others, but I've definitely done it (though rarely) in the past. Apart from that, hardlinks seems to make sense. I'd guess that the tooling would be fine, since it should be the same as any other file.
The primary issue this was intended to address was the cache causing issues with hardcoded paths that result from building packages (#480).
Are hardlinks different in this case?
Not sure, might be worth asking @wez whether Watchman can handle hardlinks.
another issue that might come up is that people may try to modify their local node_modules for local debugging purposes or testing purposes, and not expect that they're actually modifying the node module linked to everywhere else
I think this is the use case for npm link
or Yarn's equivalent though, right? You're never supposed to directly modify files in node_modules
. We could have an option to make a local copy if people want to do this though.
I totally agree with @dxu and actually wanted to write the same thing. I do this often: I manually add some debugging code into a random node_module (that I don't have checked out locally). Once I'm done, I wipe it away and do npm install
. It would be a mental change for me to remember it would also affect other projects.
Yeah, that's a use case I didn't really think about... Hmm...
Oh well, we can still hold on to this idea. Maybe it could be an optional configuration setting for people that don't directly edit node_modules
and would like Yarn to run faster (less copying of the same data = less disk IO = stuff is faster and disk cache efficiency is improved) 😄
Going to close this since we decided long ago to go away from symlinks. It's required for compatibility with the existing ecosystem as even projects like ESLint rely on this directory structure to load rules etc. There's also a lot of problems with existing tooling not supporting them. For example when Yarn initially used Jest and it would fail and produce extremely long paths. Jest is much better now and the bug is likely fixed but small issues like this exist in a lot of tools.
Sebastian, this task is for hardlinks not symlinks. Hardlinks shouldn't have any of the problems you mentioned.
Sent from my phone.
On Oct 5, 2016 5:24 AM, "Sebastian McKenzie" notifications@github.com wrote:
Going to close this since we decided long ago to go away from symlinks. It's required for compatibility with the existing ecosystem as even projects like ESLint rely on this directory structure to load rules etc. There's also a lot of problems with existing tooling not supporting them. For example when Yarn initially used Jest and it would fail and produce extremely long paths. Jest is much better now and the bug is likely fixed but small issues like this exist in a lot of tools.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/yarnpkg/yarn/issues/499#issuecomment-251659387, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFnHQ9ON3xMLDxpGTj6kr0zMZII1hb6ks5qw5blgaJpZM4KOai0 .
Hardlinks have the exact same problems and are semantically the same in this scenario. Why do you think they don't have any of the same issues?
@kittens haven't really tested hardlinks. But once you hardlink a file, in theory from the filesystem's perspective, it should be exactly same as the original file -- you can remove the original file and the hardlinked file will still work.
This is different from symlinks, whose content is just a pointer to the original file.
You can have cycles though which is extremely problematic if tools aren't designed to handle them (most JavaScript tools aren't, and how would they?). Hardlinks and symlinks on Windows both require admin privileges (NTFS junctions don't but they're more synonymous with symlinks) which is a non-starter for a lot of environments.
Good point of Windows. We can have platform specific logic maybe if we decide to go down this path.
How do you create a cycle in hardlink? Note that there is no hardlink for directories.
Going to reopen this for tracking purposes. It should be doable as hardlinked files look identical to the file system. I might prototype it.
@Daniel15 one thing to keep in mind is that since hardlinks pretend to be the file system so well, deleting them usually deletes way more files than you're expecting. Since rm -rf node_modules
is a common pattern, I'd want us to have some mitigation for that likelihood before unleashing this into the wild (even on an opt-in basis).
I remember unexpected deletions hitting users of n
back in the day and it has left a permanent scar 😛 (not directly analogous, but it gave me serious fear about giving people rope that could cause mass deletions of shared files)
I remember unexpected deletions hitting users of n back in the day and it has left a permanent scar 😛
Good point, I remember Steam on Linux accidentally running rm -rf /*
too: http://www.pcworld.com/article/2871653/scary-steam-for-linux-bug-erases-all-the-personal-files-on-your-pc.html
Maybe we need a safer "clean" function rather than just doing rm -rf node_modules
Are issues with hard links and rm -rf node_modules
actually possible? While you can create symlinks to directories, you can't create hard links to them*, so you shouldn't be able to recurse into some global directory while running rm -rf
.
* On macOS you can, but you shouldn't
Symlinking to a global cache is essential. The copying approach is very slow for large projects (which I would argue are very common), and extremely slow on VMs, and very slow on Windows, and insanely slow on a virtualized Windows VM running on a macOS host in Parallels/VMWare.
I have a relatively simple frontend/backend project and the node_modules
is 270K files and about ~300MB.
With a warm global cache, the "Linking dependencies..." step takes about 5 minutes. Symlinking would take a couple of seconds.
rm -rf node_modules
takes about 15 seconds.
So when I am building my Docker images, its taking me 5 mins everytime when it could be seconds.
It seems every package manager's authors flatly ignore real-world performance.
Is there a plan to support symlinking any time soon. I feel like it would be a simple implementation and just add a --symlink
flag. Where can I find the issue?
, the "Linking dependencies..." step takes about 5 minutes. Symlinking would take a couple of seconds.
I wonder how long hardlinking would take. Definitely longer than symlinking as you need to hardlink each individual file, but it should be faster than copying the files over while avoiding some of the disadvantages of symlinks. I think it's worth having both a hardlink and a symlink mode, both of them opt-in.
We could also use a symlink or hardlink feature when doing builds on our build server as copying of node modules is far the slowest part of the build, fx. our build time drops from 3 minutes with npm install(1:45 with yarn) to 15 seconds if we cache and symlink the node_modules folder between builds(we do hasing of the packages.json to know when to invalidate the cache). A raw copy with cp takes 45 seconds.
Yarn initially used to use symlinks and we changed it because our internal tooling (watchman etc.) doesn't work well with symlinks.
Lack of symlink support in Watchman blocks more than Yarn: https://github.com/facebook/react-native/issues/637
I develop React Native, Browser and Electron apps and I only had problems with symlinks in React Native and that was because of Watchman.
The reason we can't have symlinking in Yarn shouldn't be Watchman or some other internal Facebook tooling. The rest of the ecosystem appears to support it well.
Symlinking should be opt-out.
Hardlinks should work fine with Watchman, and any other tool, since they look identical to "regular" files. That's one reason I suggested trying hardlinks rather than symlinks.
Isn't it a lot slower to create hardlinks and also harder to implement? Also, should Yarn (or any other tool) really mess with people's filesystems?
If you are planning to implement both hardlinking and symlinking, starting from symlinking should be better. I think many people would opt-in and work problem free or fix any problems caused by other tooling.
Isn't it a lot slower to create hardlinks
We'd need to benchmark. It'd be slower than symlinks (as you need to hardlink each file rather than just symlinking a directory), but it should be faster than copying the files.
Also, should Yarn (or any other tool) really mess with people's filesystems?
What do you mean by this? Yarn, by definition needs to create files on the user's system.
Also, should Yarn (or any other tool) really mess with people's filesystems?
What do you mean by this? Yarn, by definition needs to create files on the user's system.
I say that because hardlinks always felt like some low level "approach with caution" feature rather than a regular thing to do.
AFAIK, hardlinks are indifferentiable from regular files (unless inspected with the right tool) which make them obscure. (If both functioned the same, I'd prefer symlinks because they are obvious.)
Also, IIRC, you can't remove a source if it's still linked to a destination (which would prevent clearing the global cache) and you can't remove a destination by usual means and you have to use the hardlink tool to remove them. I hope I'm wrong with these though.
I'd really like to see some sort of linking as this would speed up things even further. Yarn is already lightning-fast :P
I've used hardlinks in the past on macOS and Windows and AFAIK after creating a hardlink there is no obvious way to see which file was the "original". That is actually a nice feature since that means that deleting the cache is fine anytime (unlike for symlinks): The copies in the node_module/
folders would stay perfectly unharmed.
The only gotcha for linking in gerneral is of course that modifying files in a node_module/
folder for debug purposes would be a big no-no.
I have currently an install time of 280 seconds. With hardlinks I'd image it could be down to about 25s or so.
I wonder if anyone has benchmarked Yarn on a filesystem that supports deduplication, such as zfs or btrfs. In theory, deduping data at the filesystem-level should be better than anything we could do in userland.
Also, IIRC, you can't remove a source if it's still linked to a destination (which would prevent clearing the global cache) and you can't remove a destination by usual means and you have to use the hardlink tool to remove them. I hope I'm wrong with these though.
You just delete the file like normal. If anything else uses the same file, it'll still work fine.
[...] Yarn on a filesystem that supports deduplication [...]
@Daniel15 Neither macOS nor Windows support these file systems out of the box. Hardlinks are more practical. But I get that you're talking theory :)
[...] hardlinks pretend to be the file system so well, deleting them usually deletes way more files than you're expecting [...]
@wycats I know that this was a big issue in Broccoli. I don't quite understand it, though. Does the rm -rf
problem exist if you only hardlink files (not folders)?
[...] Yarn on a filesystem that supports deduplication [...]
@Daniel15 Neither macOS nor Windows support these file systems out of the box. Hardlinks are more practical. But I get that you're talking theory :)
In theory both Apple and Microsoft's next-gen filesystems will be copy-on-write, which would change the landscape here considerably. APFS is allegedly due out in 2017, but it's not clear when ReFS will be ready for general-purpose use, so ¯_(ツ)_/¯
I believe hard links to the files would work well as long as you are not modifying the node_modules directly.
My program https://github.com/jeffbski/pkglink can be used to setup and maintain those hard links to your node_modules packages. In addition to checking the package version it also checks the file size and date before deciding to link a file so modified files will not be linked. On windows the dates are always different so they are not used as a comparison but file size still is.
hard links are not implemented in ReFS AFAIK but they work in NTFS, HPFS+, and ext. I didn't try running on a windows user without admin permissions.
So if you aren't modifying node_modules, you can use pkglink anytime.
If you are modifying your node_modules, do that first before using pkglink so that it won't link packages that have been modified.
hard links are really only supported and useful for files. Mac OS X allows on directories but then it only allows one link so it really is only useful for backup purposes. pkglink thus only does hard links at the file level. I have never run into any problems blowing away the node_modules directory since hard links reference count at the file system level, it tracks all that for you.
How is the current status of this issue? In the end, will it be implemented?
pnpm
takes the hardlink approach on all platforms, if prior art is desired for strategizing: https://github.com/pnpm/pnpm
I vote for an option to use symlinks
@dhakehurst Symlinking whole folders has its problems. The pnpm team explains that here. I think their reasoning applies to all node package managers. A symlink only approach thus cannot work. As a result I think pnpm uses some kind of mixture between hard- and symlinks. But I'd be perfectly happy if Yarn offered an option to simply hardlink all the files. Also, it'd be good if it made them read-only, so that you cannot modify them accidentally.
Not sure I agree with that explanation (or maybe I don't understand it)
In addition to speed for builds, and disk space usage (which can be solved with hard links) another reason for needing this is to do with backups. I develop on a mac. Mac Time Machine does not allow exclusion of folders with wild cards, only full paths I have several node developments, I can't add them all as exclusions. TimeMachine will back up every node_modules directory, even if they are hard links
hence I want soft links
(or a way to tell Time Machine not to be so stupid)
Surely, If each module@version is downloaded and unpacked to one place on the filesystem, then any module that wants to reference it, just creates a soft link to the relevant directory.
i.e.
@dhakehurst This does not work if you have two different versions of the same package in your project. This happens when dependencies have different version requirements for a common dependency.
sorry I don't see why not. If copying files works, why does having a symlink not work. Are the files that are downloaded somehow modified?
@dhakehurst Its much more complicated than you think.
Node resolver algorithm works with realpaths not symlinked paths. Peer deps / singletons can't be resolved properly without --preserve-symlinks
.
E.g.
/dev/foo/node_modules/bar -> /dev/bar
/dev/bar
# /dev/bar symlinked search paths
/dev/bar/node_modules
/dev/node_modules
/node_modules
# /dev/bar non-symlinked search paths
/dev/foo/node_modules/bar/node_modules
/dev/foo/node_modules
/dev/node_modules
/node_modules
Notice how symlinked bar
cannot access foo
's node_modules anymore.
if the dependency is from foo to bar, i.e. foo depends on bar, then bar should not need to access foo's node modules.
@dhakehurst deduping means that bar and foo’s common deps get hoisted; bar wouldn’t be able to access any of its own modules.
@dhakehurst And peer deps like react.
ok, understood about peer deps. I think I see the problem.
basically the whole approach/"Node resolver algorithm" seems broken.
I guess it because, importing is done by some kind of file inclusion, rather than a properly designed module system, and every solution is a hack on top of that.
I wonder if it could be designed to work properly, or if it is a fundamental issue in node and the javascript language. I guess I need to dig into node a bit more to find out.
a fundamental issue in node
Yes. But you can use --preserve-symlinks
to get the behavior you want but that creates more problems. There are tons of long, long threads on this.
The node module system is probably one of the most well thought-out module systems around. The ability to effortlessly get multiple (but not more than necessary) versions of the same package in one project, being able to use paths relative to the current file (within a package) and to define peer dependencies are all strong points. Besides, this issue is about Yarn. And, whether you like these features or not, they're all things that need to work with hardlinks/symlinks.
use paths relative to the current file
Therein lies the problem. A good module system is not simply about importing files, and paths.
However, as you say, "this issue is about Yarn", so problems with node/javascript are out of scope, sorry to have brought it up.
As the title of this thread is about hard links rather than copy, If hardlinks would work, then so would softlinks/symlinks. Otherwise hardlinks have the same peer dependency problem as described by @vjpr
There seems to be a lot of confusion about what hardlinks actually are. Most filesystems store files in two parts. The filename which points to the storage location, and the actual data. A symlink is a special filename that tells you to go look for another file. A hardlink is an second (or third, or ...) filename that points to the same data location.
Therefore hardlinked files do not suffer the same problems as symlinks, because they truly look like copies of the original files.
Also, assuming I only hardlink files and not directories, then if I do_ rm -rf node_modules
then my system will delete the filename my-hardlink
, but then it will notice that the underlying data storage is still referenced by yarn-cache/original-file
and it won't delete the original file.
Basically, unless you are examining inode numbers, hardlinks look exactly like files copied from the originals, but they share the same storage location as the original files. So we will need to warn people not to modify the contents of their node_modules directories.
Another potential problem is that on linux you can't make a hardlink across filesystem boundaries. I don't know about windows or mac os. So we would need to fall back on true copying when hardlinking doesn't work.
Until something like this is implemented, I am going with the following approach...
hardlink -t -x '.*' -i '^(.*/node_modules/|/home/user/.cache/yarn/v1/npm-)' ~/.cache/yarn ~/code
Where ~/code
is the directory I store all my projects in.
@jpeg729 one problem that causes tho is that you’re supposed to be able to edit any file inside node_modules and see that change when you run your program - and if you have two places in node_modules that point to the same data location, editing one will end up editing the other, which might not be desired.
@ljharb
You are not supposed to manually edit files in node_modules
.
We can have that feature configurable, so you can turn it off if you really want to edit node_modules
.
@KSXGitHub "you are not supposed to" where does that rule come from? It's always been both possible, and something node and npm explicitly supports.
As for being configurable, the problem is that users aren't going to know that this normal node ecosystem behavior behaves differently, and they could end up silently getting surprising behavior.
As for being configurable, the problem is that users aren't going to know that this normal node ecosystem behavior behaves differently, and they could end up silently getting surprising behavior.
If the default is to not use hard-links, and the user has to manually enable it, then that's not a problem: they know they're using weird yarn-specific behavior.
This was touched on in a comment on #480, but I thought it's worth pulling into its own separate issue.
Currently, each app that uses Yarn (or npm) has its own
node_modules
directory with its own copies of all the modules. This results in a lot of duplicate files across the filesystem. If I have 10 sites that use the same version of Jest or React or Lodash or whatever else you want to install from npm, why do I need 10 identical copies of that package's contents on my system?We should instead consider extracting packages into a central location (eg.
~/.yarn/cache
) and hardlinking them. Note that this would be a hardlink rather than symlink, so that deleting the cache directory does not break the packages.