Looking for brilliant yarn member who has first-hand knowledge of prior issues with symlinking modules

yarnpkg / yarn

The 1.x line is frozen - features and bugfixes now happen on https://github.com/yarnpkg/berry

https://classic.yarnpkg.com

Other

41.44k stars 2.73k forks source link

Looking for brilliant yarn member who has first-hand knowledge of prior issues with symlinking modules #1761

Closed ghost closed 8 years ago

ghost commented 8 years ago

I was recently informed by @ thejameskyle that:

Before being open sourced, Yarn used to do symlinking. However it broke far too much of the ecosystem to be considered a valid option.

If you are a yarn member, who has first hand knowledge of what actually broke, and can technically explain why, I implore you to respond to this issue.

I believe I have found a solution, and my initial experiments indicate it is entirely viable. However, like thejameskyle, the only responses I seem to get as to why it wouldn't work at all, let alone on a broad scale, are entirely ancetodal.

Please, I'm desperate to find someone who knows what they are talking about, that has the technical acumen and understanding of node module resolution, as well as first hand knowledge of the issues yarn encountered, who can repudiate my solution by technical means of cause and effect, by first explaining what issues yarn encountered in its earlier attempts to exploit symlinking.

With my solution, one can have independent physical dependency tree renderings, were the specific module@versions used within a given tree remain specific to that particular physical tree, and can be locked, but where all modules across all physical trees on a given machine can all be symlinked to a single machine wide copy centrally stored, even when within each physical tree a common module@version is used in several trees, yet still resolves it's specific dependency versions based on the physical tree it is used in, and where those dependencies versions are slightly different between trees for whatever reason (but still within the semver range spec in package.json).

You're probably the right person if a) you understand exactly what I just wrote, and b) believe it's impossible.

Having this ability would mean modules no longer need to be copied all over the place, saving gigabytes of storage, and once centrally stored, all 'installs' could symlink all modules (but wouldn't have to), reducing install times from minutes to seconds.

ghost commented 8 years ago

@kittens Would you be a possible candidate to speak to this issue? Or know a more appropriate person?

sebmck commented 8 years ago

I've talked about this briefly in other places but haven't really given comprehensive reasoning as to the difficulties we faced with it. The truth is that the existing ecosystem does not cooperate very well when you start using symlinks.

Operating system differences

Symlinks are supported differently on various operating systems. In Windows they aren't allowed unless you're an administrator for example. You can however use NTFS junctions that operate in a similar way but with the following restrictions:

Only work on NTFS partitions (as the name implies).
They can only refer to folders on the same drive.
They can only operate on folders (this is fine for Yarn since we only need junctions for folders).
They don't operate across network shares.

If we want seamless Windows support then we'd need to impose restrictions on the development environment of Yarn users and when the existing alternatives don't have these same restrictions it's hard to justify.

Alternatively we could support both symlinks/junctions and the current flat version but one of the big motivators behind Yarn is determinism and having different ways of representing the files on disk that are distinguishable from one another goes against this. It'd also lead to an explosion on support since we'd be forking the workflow and internals to support symlink resolution. (In fact somewhere in the git history you'd find we once supported both of these installation methods)

Tooling not supporting file system cycles

Tooling such as Jest would run into weird recursion errors when crawling the file system since symlinks allow cycles to appear. ie. a nested directory referencing another included in it's heirarchy. Jest is a lot better now and it probably fixed but this is a common problem that existing tools don't take into consideration.

Poor support for file watching

File watching across operating system is already a massive issue with their being a lot of inconsistencies and problems with normal files and folders. This issue is even more exaseparted when you take symlinks into consideration. Tools such as watchman don't support symlinks for specific.

Tooling relying on `node_modules` hierarchy

Tools like ESLint rely on the hierarchy of node_modules to use plugins. For example eslint-plugin-foobar needs to be accessible in the hierarchy of eslint to be accessible and includable in your projects. Node resolves symlinks which means that their absolute path (unsymlinked) is used which breaks this assumption that ESLint makes. A lot of tools use this sort of resolution to work around particular issues with module loading and path issues.

To summarise the advantages of symlinks:

Slightly faster package installation.
Less disk usage

Are outweighed by the significant disadvantages of:

Operating system differences
Making Yarn less deterministic by adding multiple modes of install
File watching incompatibility
Poor existing tool compatibility
Tool recursion errors due to cycles
etc...

I'm going to leave this issue open if there's any rebuttals to the points I've made and for future discussion.

Daniel15 commented 8 years ago

Thanks for the wonderful comment, @kittens! ❤️

I'm going to leave this issue open if there's any rebuttals to the points I've made

I'm just passing through and don't know anywhere near as much as @kittens does with regards to the topic of symlinks and their usage here, but thought I'd add some quick points.

In the end, I think a filesystem that supports copy on write will make everything better. With a file system that uses copy on write, copies are "lazy". This means that creating a copy of a file doesn't actually copy the data, it simply makes the new file a pointer to the old one, similar to what you'd get if you hardlinked the file. A true copy (allocating space on disk for the file, and actually copying the bytes across) is only performed when you modify the file. The end result is that you get all the benefits of symlinking (faster installations, less disk usage) with none of the disadvantages (modifying a copy doesn't modify the original, and tooling just sees everything as regular files, since they are regular files).

On Linux, you can use btrfs or zfs for this. Even with Yarn in its current state today, using btrfs should give you a nice performance boost due to the CoW semantics. However, neither Mac OS nor Windows have a good copy-on-write filesystem today. In #499, @dfreeman said:

In theory both Apple and Microsoft's next-gen filesystems will be copy-on-write, which would change the landscape here considerably. APFS is allegedly due out in 2017, but it's not clear when ReFS will be ready for general-purpose use, so ¯_(ツ)_/¯

Hardlinks solve some of the drawbacks that symlinks have, but they also have their own issues.

Positive (advantages of hardlinks over symlinks):

Hardlinks don't have an issue with cycles, as you can only hardlink files
Hardlinks still give you the same disk space savings you'd get with symlinks, as there's just a single copy of the files
Tooling doesn't need to know anything about hardlinks, as a hardlink looks exactly the same as a regular file. No need to treat it specially

Negative (or negative-ish):

Hardlinking a lot of files is slower than simply symlinking the directory (but it should still be faster than copying the files)
Hardlinks still have one of the "problems" symlinks have: Editing a file in one project will actually modify it everywhere it's used. That can be pretty confusing. This can be solved by using a copy-on-write filesystem, neither Windows nor Mac OS support any COW filesystems so that's limited to Linux
By default, hardlinks require admin permissions on Windows :(

We have a backlog item to investigate hardlinking (#499), nobody's actively working on that right now though.

Alternatively we could support both symlinks/junctions and the current flat version but one of the big motivators behind Yarn is determinism and having different ways of representing the files on disk that are distinguishable from one another goes against this.

This is a great point! The fewer possible combinations there are, the easier debugging becomes.

ghost commented 8 years ago

Sebastian, Thank you kindly for taking the time to explain your experience with symlinking. It's greatly appreciated.

OS File System Linking Differences

Regarding file system linking, I'm presuming yarn did something similar to how ied works, in that folders lower in the physical dependency tree were linked to top level folders under project/node_modules where the modules physically existed?

If true, and given that FAT32 filesystems are also prevented from storing links when mounted to 'nixs, the only difference in linking behavior between the OS's that node supports, that would have a material impact in how yarn was using links, is that in Windows, folder links can't reference network volumes (junctions can cross local volumes, but not to network shares; see here). Would it therefore be fair to state that in the context of how node resolves module loading, all the other enumerated differences in linking behavior, while accurate, would have no relevance and could be taken out of scope?

Assuming the singular difference in linking behavior between OS's relevant to yarns use of linking, in your experience how many developers were you aware of, either first or second hand, that were attempting to run yarn in such a way that the folder from which it was ran against, containing the project's package.json, was somehow expected to have links to module folders that existed on a network share? And how was yarn configured to know to deploy the physical tree in that way?

Given the one difference that would have had some impact, when you say:

If we want seamless Windows support then we'd need to impose restrictions on the development environment of Yarn users and when the existing alternatives don't have these same restrictions it's hard to justify.

..is my understanding correct then that the only imposed restriction would have been, on non Windows systems, yarn will not create local links to network shares to modules, ensuring that is behaved identically across all OS's?

If that were something never required; i.e. a developer having local links to network shares for modules within a project being developed, would it be fair to say that, practically speaking, there really wouldn't be any issues that needed justification of restrictions to account for the difference in OS linking behavior in order to seamlessly support Windows?

I ask only because the way in which I'm proposing to use symlinks, while not at all like how ied or yarn is able to, might be impacted if lots of developers were needing, for some reason, to have the their project/node_modules folders contain local links to network module folders. I would be shocked if that were ever the case, but you've clearly had a great deal more experience and exposure to the ecosystem. If true, could you elaborate on the purpose these developers where needing that for?

This was under Operating System Differences, but I couldn't understand how it was related to OSs?

Alternatively we could support both symlinks/junctions and the current flat version but one of the big motivators behind Yarn is determinism and having different ways of representing the files on disk that are distinguishable from one another goes against this. It'd also lead to an explosion on support since we'd be forking the workflow and internals to support symlink resolution. (In fact somewhere in the git history you'd find we once supported both of these installation methods)

When you say having different ways of representing the files on disk that are distinguishable in the context of determinism, are you implying that when yarn used symlinks, it used a different folder organization than when no symlinks where used, or rather that the folder structure's were identical but the physical location of the files in the folders was some how different between the two? Or more specifically, where is your physical boundary separating what must be physically deterministic and what is irrelevant?

I ask because I can't see a practical difference with respect to how node would deterministically walk the folder structure that organizationally didn't change, regardless if the folder was a physical copy of a module folder vs. a symlink to a module folder; the contents and path would be identical from node's perspective (assuming launched with --preserve-symlinks)

The way in which I'm intending to use symlinks would not bind the folder structure in any way to any given folder's physical representation. I.e. regardless if any particular module folder was symlinked or a copy, and even if that characteristic of the folder varied from time to time, node would always deterministically walk the structure in exactly the same way, and see exactly the same content on every walk (to the degree node itself was looking for modules deterministically). If that were the case, would that qualify as deterministic enough to meet yarns guarantees of determinism?

Tooling not supporting file system cycles

If it could be guaranteed that it was physically impossible for cycles to exist in the organization of the dependency tree, as represented by the physical folder structure, even though every single module folder was a symlink, would the issue of tooling not supporting file system cycles be irrelevant and could be taken out of scope?

Poor support for file watching

While watching for changes of a project's source files is a common place thing, like how yarn itself is shipped to have gulp watch for changes to .js to then compile with babel, can you elaborate on the use cases you encountered that required watching files that lived somewhere within the project/node_folders hierarchy? I would be shocked if this ever was required, and even more so if it was a common need, but again you have a great deal more experience and exposure to the ecosystem, and your specific experiences with this use case would be immensely valuable.

Tooling relying on `node_modules` hierarchy

Node has a --preserve-symlinks switch to address such things, although it does not preserve the symlink of the entry.js arg passed on its command line. I have been told by nodejs members they would prefer the default be that all symlinks always be preserved, but there are currently some edge cases that have given them pause to change the default behaviour of resolving symlinks to their realpath.

My intended use of symlinks would require they be preserved in all cases. The only issue that's been brought to my attention regarding this behavior (and in no way am I implying it's the only one), is that if different symlinks pointed to the same physical addon.node file, node would crash as a consequence of how it loads binary dynamic link libraries and associates them to node module instances. First, the way in which I intend to use symlinks would never allow for such a state of the dependency tree to occur, but if in the remote chance there was a reason such a state would be required, the package manager would know and simply make enough copies of the addon module, linking to each copy individually, so that the OS would have no idea they were logically identical.

If symlinks were always preserved, would the issue of _tooling relying on node_modules hieararchy_ be effectively mitigated and for practical purposes could be taken out of scope?

Advantages of a different way of symlinking

I'm going to make a very bold claim:

I know of a way to reduce the total overall module storage on a given developer's machine by at least 10x's (50GB to 5GB) or more, and install times by at least 50x's (50 sec to 1 sec) or more

This is fundamentally achieved by only ever storing a single physical copy of any given module to a central machine wide location, then symlinking to it from where ever it's used anywhere on the machine. Edge cases effected by this change may be less than one half of one percent of the ecosystem, although this still needs to be empirically proven.

To summarize, assuming best case answers to above questions

Potential Advantages:

10x's or more reduction of total space required to store modules on a given machine
50x's or more reduction in install times (after first install)

Potential Disadvantages:

~~Operating system differences~~
Difference in support of links to network shares
- (It's atypical a project/node_modules folder would require this)
~~Making Yarn less deterministic by adding multiple modes of install~~
- (The solution would still allow yarn to render deterministically regardless if using symlinks or not)
~~Tooling not supporting file system cycles~~
- (Solution would guarantee this would be physically impossible)
~~Poor support for file watching~~
- (It's atypical files within a project/node_modules folder would require watching during development)
~~Tooling relying on node_modules hiearachy~~
- (Solution requires symlinks be preserved in all cases, and addresses known issues)
There may be breaking edge cases, but potentially within a minimal acceptible spectrum, and the proposed solution would be opt-in on a case by case basis to effectively mitigate while offering a path to migration

For the moment, if the above were in any way possible, would you at least say continued work on developing a POC to empirically demonstrate would be worth perusing?

I know it sounds almost too good to be true, but I want to find out, and I'm already working on the POC of this. I have done some initial experiments with tweaks to node (yes node, about 10 lines of relevant code) and npm to at least show me it's feasible, and that success is spurring me on, but I prefer to continue the package manager side of the POC with yarn.

Are you interested in the solution? In the end, it's unbelievably simple, and comes down to using a ., in addition to a / at just the right time and place!!

Daniel15 commented 8 years ago

@phestermcs - What do you think of the idea of using a copy-on-write filesystem to achieve the disk space reduction? That's something I've been thinking about a bit as well (as per my comment above)

ghost commented 8 years ago

@Daniel15 What do I think? I think any solution that fixes this !@$# problem needs to be implemented IMMEDIATELY!!! hahaha.

One of the great things of node is how it normalizes the OS to irrelevancy. If an OS effectively supports some COW solution, that is unknown to node, and if you can use it you should.

I have machines with all three major OS's (and others), and I've done lots of both OS agnostic and OS specific development on all of them. But I prefer Windows. I could of course use a VM to run linux within, but I like speed... like, alot! (hence this issue).

ghost commented 8 years ago

@Daniel15 I've spent a little time pondering you comments.

I think using hard links could be a decent option. I was so locked on to linking the whole folder (but never it's node_modules subfolder) as the absolute fastest way, I missed the hardlink approach entirely (blush)! In fact, I'm going to explore that a little more, as you're correct that it should still save just about the same amount space (by there being only one physical copy on the machine), and should still be faster than copies (my first gating test will be to take that measure). Hardlinks also have only one realpath so there would be no issue with node having to "preserve" the link as with symlinks, which is good. It also has similar constraints to linking a folder (around lifecycle events), that I had to address while researching that approach, so I think I have a tad head start on it. And my cross off's of the disadvantages @kittens raised would still be crossed off.

My approach requires a tiny change to node Modules, which is locked, and I already know faces resistance, so trying with hardlinks is more likely to sooner get into the hands of those who want the benefit (assuming it's significant enough and works with out breaking too much). Hardlinks do have the constraint they can only link to files on the same volume, but at least that's the same across OS's (speaking to @kittens concern of seamless Windows support). Assuming the benefit is significant, I'd take the trade.

Windows 8+ no longer requires a user have administrative rights to make hard links. So while that story is a little better, it's still not the best on Windows NT < 8. However, its the more common case developers are administrators of their machines, even in many enterprise IT departments, so still a pretty good story. And while creating hard links does require administrative rights on Win <8, it does not require running with elevated permissions on any Win version, contrary to creating symbolic links. So if a user has admin rights, yarn could link without having to run elevated, which means it's as user friendly as it ever was.

The way I was thinking of using symlinks, would have come with the constraint that module folders where always read only (I see them logically like *.so's or *.dll's that shouldn't be altered regardless of where used), and I would still apply that same constraint when using hardlinks.

Which leads to one of the changes that could create issues with linking to a global location, and the application of the read-only constraint. That is that the preinstall, install, and postinstall lifecycle events would only run once, when a module was first "installed" to the global location, while the build event would still need to run on every link-install for a given project (which all worked in my proto, because 'build' puts things to local folders). When I prototyped using npm, just to see if it would work at all (by making copies of projects, tweaking each slightly to have different version specs, then installing to ensure each still had their specific dependency tree), I did not find modules that altered their local install from one place to the next, based on their surroundings or specific versions of their dependencies that were installed locally with them. However, I was not at all looking super hard for that, and this is definitely something that could be a problem.

If that issue does show itself to be a problem, only if even marginally, it would obviously need to be addressed. The first way is that you just don't use linking on a particular project; that fallback will always be available (although some might use the issue as justification to just not even provide a linking feature to begin with, as they'd fear it would confuse noobs (which frankly are already usually confused by things, so what's a little more confusion? giggle)). Another approach is that packages could flag themselves as linkable, and then yarn could ignore or honor the flag based on a switch (and I can hear that same someone complain about more switches, to whom I'd say "maybe you should get into a different line of work, because this line is based on billions of switches all working together in imperfect harmony!" j/k).

Regarding COW, I'm just starting to dig into yarn code, but at first glance it appears to first unpack a module into a global cache if not already there, then copy that into a given install location. So COW might work without having to change anything in yarn. (This would be different with npm which always unpacks a tar into the install location, and that would have to change to leverage COW). But like you pointed out it's not available on all OS's, so not much to move forward with using that approach.

I'm going to run some tests to measure performance of hardlinking thousands of files, to see just how much faster it is than copying. If it's at least 4 or 5 times, I'm going to wrench yarn enough to actually use hardlinks so can start POC'ing actually using with projects (but not going to worry about things like if yarn running concurrently, and things like that; later if the POC shows further promise). yarn should still be able to guarantee determinism in the same way it currently does, if the guarantee is really about just the folder structure and hierarchy (which I would opinion is all that's actually required to prove node would be deterministic in its walk of that structure when loading modules, without actually running node)

I will say it's a quite a bit nicer working in yarn as it's better organized and using async/await; npm was really starting to show it's age, as I would at times start looking around for a jar of Ragu sauce. So thanks @kittens!

ghost commented 8 years ago

Finally took some initial measurements using react repo ~ 21k files. Ran the tests 3 times each, so here's their ballparks.

SSD "Install" (482MBs seq, 30k iops rand)

yarn lnkdep	hardlink	adj-nm & symlink
70 sec	28 sec	2 sec

SSD Delete node_modules

yarn lnkdep	hardlink	adj-nm & symlink
35 sec	35 sec	< 1 sec

7200RPM HD "Install"

yarn lnkdep	hardlink	adj-nm & symlink
118 sec	30 sec	5 sec

So hardlinking can be twice as fast on SSD's, and up to 4x's as fast on HD. It's interesting it takes about the same time regardless if SSD or HD.

But clearly, --adjacent-node-modules with symlinking is way faster

Pauan commented 8 years ago

@phestermcs Thank you for pushing for this. In my opinion, slow installation is the current biggest problem with yarn (though it's still faster than npm).

It makes sense to me that hardlinks would be the same performance on both SSD and HDD, because they both do the same amount of work: they simply increment the ref count, but don't do any actual copying.

Symlinks are faster than hardlinks because you only need to create a single symlink, rather than creating multiple directories and multiple hardlinks.

Although I am in favor of hard/soft links, there is one downside: if somebody modifies a file in node_modules, it will affect other (completely unrelated) packages.

Personally, I think people should not be doing that, instead they should use local packages if they want to make file modifications. So I see it as more of an education/documentation problem.

ljharb commented 8 years ago

@Pauan yet, the ability to do that is a long-standing important part of node, and since node's require only cares about what's on disk, i should always be able to edit a file on disk and have the changes show up appropriately.

ghost commented 8 years ago

I wouldn't necessarily characterize slow install times as a problem with any package manager in particular, as it's primarily a consequence of a constraint in node that no package manager can get around (which Is why I've submitted a PR to change, and is what the "adj-nm" refers to).

I'm still surprised creating hardlinks took the same time between the SSD and HD, because in both cases the OS still had to physically write 21k entries (although understanding each entry was probable a handful of bytes or so) into the directory structures, which is not quite just simply a ref count increment. That just tells me the OS is spending more time doing things other than writing to the disk, which was just surprising those things (cpu bound) were taking more time than actually writing to the disk (io bound); it's usually the other way around.

I should clarify, symlinking module folders in the way I'm intending (adj-nm & symlink) is not currently possible with any version of node (hence my nodejs PR), and how ied, pnpm, and presumable yarn use to symlink, would not really be much faster than straight copy, and certainly not faster than hardlinking. It's going to require a fair amount of people in the community being vocal in support of the PR before nodejs accepts, as it's a change (although an incredibly simple and low risk one) to the locked Modules subsystem.

As I mentioned in an earlier comment, any kind of linking from multiple projects to a machine wide store requires read-only access on the store, precisely to prevent a change in the store effecting multiple trees. But the way I'm intending to implement would still allow one to have most modules symlinked, and then selectively have copies of modules should they want to muck with in that particular project, while still allowing yarn to guarantee the tree to be deterministic with respect to specific versions, and the way in which the folder structure enforces nodes dependency version resolution; i.e the folder hierarchy itself would not change as a consequence of any module folder being a symlink or a copy. But again, not currently possible with any version of node.

At the moment, I'm still on the fence about creating a branch of yarn to use hardlinks given it's only a 2-3x's improvement, compared to adj-nm w/symlinks... hardlinks offer a small gain but don't require a PR on node, but with the PR we're talking 30x's or more improvement.... need to ponder a bit.

@ljharb For those that find it important to be able to change content under node_modules, and would rather have that ability than a 30x's improvement in install/update times, then I would recommend never using links. It isn't like this is something were everyone has to do it one way or the other; we each get to decide on a project by project basis, or not. Even then there would still be the option that should you want to change a module locally within a project, you could just install it again but as a copy rather than a link, and still leave all the others as links.

ghost commented 8 years ago

@ljharb I'm curious if changing content under node_modules is something you do on a routine basis? And if so what are your uses cases the necessitate that? Just curious is all.

In my own experience, I have on rare occasions tweaked an installed module while trying to understand some behavior. I've then gone on to some other thing, and when I come back to the project, I've forgotten I made a change, and then spend a fair amount of time (maybe 30 minutes) scratching my head when my stuff relying on that module doesn't seem to behave like I expect, until I slap my forehead and go "Oh that's right, I changed that module!", and then I reinstall to bring it back. So I typically really try to not do that, or limit doing it, or if I do to immediately put back to its original state.

I can say, again just my own experience, that I've spent way, way, way more time waiting for a package manager to copy modules, or have to delete node_modules and then reinstall, and so on, and if I had to trade fixing that problem, with never being able to directly change a locally installed module, I think I'd make that trade in a heart beat. Fortunately, should the install-time problem be fixed in the way I'm intending, you'd still get to have your cake and eat it too.

Just some advice, it's possible that a change to a locally installed module could be forgotten, and someone checks their stuff in that's relying on the changed module, thinking everything's working, and than another gets the project and installs the original modules, only to have things not working as expected. A problem similar to this is one of the reasons yarn was created; to ensure everyone is using the same dependencies. If this is something you do regularly, I would suggest being careful, but that's only because of my own experience's; you may have a much better memory than I :)

ptim commented 8 years ago

...I have on rare occasions tweaked an installed module while trying to understand some behavior...

I do this too, and I wonder if this could be helped by a convenience command which copies the module to the project and links it? maybe too wacky... but, in my experience, editing node_modules directly is a foot gun :laughing: not wanting to get offtopic here, but if making node-modules read only yields a 30x speed boost.. I'm in!

ghost commented 8 years ago

For those interested, I'm creating an experimental branch of yarn, that will implement a new switch: --mount. 'Mount' was chosen so one can think of modules as having been attached from a different physical location, and because link was already being used for a different thing. You could then do yarn --mount which would result in symlink-installing all modules, with the links pointing to a machine-wide, single physical copy of the modules, to which all symlink-installs from anywhere else on the machine would link to. You could then do yarn add lodash which would copy-install just the lodash module, so you could have the options to either blow some toes off, or your whole foot if you were feeling especially ambitious. With regard to yarns guarantees of determinism, I believe they would all still be held in the ways that matter (which I will thoroughly address in the readme for the branch)

This version will not use hardlinks, but rather will require a version of node that implements the --adjacent-node-modules switch. I currently have a pull request to nodejs, but there's understandable hesitance. However, I believe with some community support and evidence, and once it's clear it's a non-breaking (in the sense all current installations of node apps behave as they always have with the switch off), opt-in, tiny change (the effective change to the node source is about 10 lines, very easy to review and understand), that enables real improvements to be had, they may adopt as a type of "experimental, undocumented" feature to ease exploring it, while not in any way promoting it.

I chose not to use hardlinks, even though that approach doesn't require a change to node, for a couple of reasons. First, in my initial experiments with hardlinking, while it's 2-3 times faster, I still feel like I'm waiting way too long; going from 2 minutes to 1 minute, when I know I could go from 2 minutes to 2 seconds, is still like death-by-a-thousand-paper-cuts, but now with the added bonus of rubbing alcohol being poured over the whole time. Also, the perceived and real issues that may arise with hardlinking vs. adj-nm symlinking are identical, but in the later case the benefit is so much more enormous, there's much greater probability more will be motivated to overcome the issues, and their preconceived notions.

That means for the meantime those interested in exploring 2 sec install times will have to use my forked branches of both yarn and node. Also, as nothing like this has been done before, it's not yet known what issues will actually occur. I know there's potential for certain kinds, and already have approaches to address, but as to their real-world frequency and as to issues I wasn't imaginative enough to, well, imagine, it will be a wait-and-see.

I'm highly motivated to address this problem, and confident I have the technical where-with-all (fwiw, 35+ years building gobs of all sorts of software; I'll leave it at that) to change yarn and evaluate issues, provide solutions to address, and so on. What I don't have is a broad base of use-cases to test. I will certainly be doing my own testing, but the more out there willing to machete through the jungle to help lay a new path forward, the faster mounting modules may became business-as-usual. My experience allows me to have great confidence there absolutely is a way to make this happen that doesnt require breaking the ecosystem any more so than other changes may have, and most likely less so.

But the reality is this will take at least months, probably longer, not for any particular technical reason, but in just changing peoples minds, both that the problem is real, the solution works and provides huge benefit, and the ecosystem can continue to run just fine and thrive.

I'm doing this on the side, so it will be a few weeks before I have something for the initial brave few to step into and take for a spin. If you're interested in participating (and the more the better, so tell your friends), just comment on this issue. At some point in the future, I will mention you all on an issue within my yarn fork/branch so you know there's something real to start poking at.

S-YOU commented 8 years ago

If you are a yarn member, who has first hand knowledge of what actually broke, and can technically explain why, I implore you to respond to this issue.

Not a yarn member, and no plan to advertise my experimental 0 star repo, but I use symlink only approach on my package installer for myself and the project I work for my company. You can see how its fail on repositories which use babel, gulpfile with babel, and some module loaders which does not use built-in require('module_name').

node_modules folder is really clean with only symlinks though.

% ls node_modules                                                                                                                                    
./            cheerio@      dota-server@      fastpbkdf2@   gulp-altered@        gulp-multi-process@  marked@    ng-annotate@        passport-google-oauth@  tslint@      yarn@
../           compression@  dtsm@             fast-uglify@  gulp-any@            gulp-retouch@        monduck@   nodemailer@         passport-local@         typescript@
aws-sdk@      cookie@       errorhandler@     galk@         gulp-clean-plumber@  jshint@              mongodb@   node-sass@          passport-twitter@       uglify-js@
.bin/         coupon-code@  esprima@          gm@           gulp-compass@        jsonwebtoken@        mongoose@  npm-check-updates@  request@                vinyl@
bluebird@     del@          express@          googleapis@   gulp-dota-template@  kuni@                morgan@    passport@           slim-jade@              wiredep@
body-parser@  dota-render@  express-session@  gulp@         gulp-live-server@    lodash@              multer@    passport-facebook@  sqwish@                 xxhash@

ghost commented 8 years ago

@S-YOU Thanks for your input.

I must clarify, the approach that I'm taking uses symlinks in a way that is completely impossible to do today. In order to make it possible, two changes are required in node, one is to augment the list of search paths node evaluates when resolving require() calls regardless as to how they're made, and two, forcing node to preserve all symlinks in all cases all the time. node can do neither of these things today, but I have a branch/fork that does via an opt-in switch.

Currently, when a module's dependencies must be precisely installed, because one of those dependencies was also used somewhere else in the local tree, but @ a different version, the only physical way node allows you to be precise, is by storing the dependencies in a subfolder of the module, module/node_modules. This fundamentally prevents modules from being installed globally, yet symlinked to locally, and still preserve the specific versions of all modules within a given local tree. (yes, hardlinks could possibly work, but aren't much faster than a copy-install).

The --adjacent-node-modules switch simply allows a module's dependencies to also be installed in a folder that's adjacent to the module's folder, module.node_modules, effectively allowing node to first look in the sub-folder of the module folder, then the adjacent folder that is a real folder in the local tree, and not a subfolder under the symlinked module. With this simple change, and with node preserving symlinks in all cases, and with a package manager installing and linking in the appropriate way, it becomes technically possible to install modules globally, link to them locally, preserve all local module versions, and have things like plugins, bundlers, and lifecyle scripts be none-the-wiser

and have it only take 2 seconds instead of 2 minutes.

But I would be a fool to say there wont be any issues, and before I even created this issue I had already encountered and researched ways to address with a hacked version of npm, that I've yet discussed. This is because for some strange reason it seems there's many who just don't want this approach to work, and will take any potential issue, and even adjacent but unrelated issues, and present them as much more than they may in fact be.

We here have been using node for a couple of years now, a lot, and absolutely love it and its impact on just javascript and deployment generally, (truly said in loving jest) BUT I'M SICK AND TIRED, OF !@#$ING WAITING FOR INSTALLS, AND DELETION OF NODE_MODULES FOLDERS. I. HAVE. HAD IT!!!! (see my avatar? this isn't the first time by far, I've come up against crazy software issues, but that's what you end up looking like after going through lots of them)

So, for those of you who feel anything like me, I'm going to make every attempt to fix this problem!! Your help and support would be appreciated, as clearly it will be an uphill battle for perception more than anything :).

@S-YOU If you're interested in participating, I'll include in the list of those I notify of the first working version of yarn that does all this, if you'd like.

ghost commented 8 years ago

I'm going to close this issue, as it's served its purpose, and at the moment it doesn't seem to be of much interest to the actual yarn members.

However, if you're interested in being notified when somethings ready, please just comment on here.

S-YOU commented 7 years ago

the only physical way node allows you to be precise, is by storing the dependencies in a subfolder of the module, module/node_modules. This fundamentally prevents modules from being installed globally, yet symlinked to locally, and still preserve the specific versions of all modules within a given local tree. (yes, hardlinks could possibly work, but aren't much faster than a copy-install).

Symlinks works too with current node, here is the output of one of my module's node_modules folder

%ls -Gg node_modules/express                                                                                                                                                               
lrwxrwxrwx 1 29 Nov 15 07:36 node_modules/express -> /var/tmp/npmln/express/4.14.0/
%ls -Gg node_modules/express/node_modules                                                                                                                                                  
total 4
drwxr-xr-x 2 4096 Nov 15 07:36 ./
drwxr-xr-x 4  116 Nov 15 07:36 ../
lrwxrwxrwx 1   28 Nov 15 07:36 accepts -> /var/tmp/npmln/accepts/1.3.3/
lrwxrwxrwx 1   34 Nov 15 07:36 array-flatten -> /var/tmp/npmln/array-flatten/1.1.1/
lrwxrwxrwx 1   40 Nov 15 07:36 content-disposition -> /var/tmp/npmln/content-disposition/0.5.1/
lrwxrwxrwx 1   33 Nov 15 07:36 content-type -> /var/tmp/npmln/content-type/1.0.2/
lrwxrwxrwx 1   27 Nov 15 07:36 cookie -> /var/tmp/npmln/cookie/0.3.1/
lrwxrwxrwx 1   37 Nov 15 07:36 cookie-signature -> /var/tmp/npmln/cookie-signature/1.0.6/
lrwxrwxrwx 1   26 Nov 15 07:36 debug -> /var/tmp/npmln/debug/2.3.2/
lrwxrwxrwx 1   25 Nov 15 07:36 depd -> /var/tmp/npmln/depd/1.1.0/
lrwxrwxrwx 1   30 Nov 15 07:36 encodeurl -> /var/tmp/npmln/encodeurl/1.0.1/
lrwxrwxrwx 1   32 Nov 15 07:36 escape-html -> /var/tmp/npmln/escape-html/1.0.3/
lrwxrwxrwx 1   25 Nov 15 07:36 etag -> /var/tmp/npmln/etag/1.7.0/
lrwxrwxrwx 1   33 Nov 15 07:36 finalhandler -> /var/tmp/npmln/finalhandler/0.5.1/
lrwxrwxrwx 1   26 Nov 15 07:36 fresh -> /var/tmp/npmln/fresh/0.3.0/
lrwxrwxrwx 1   38 Nov 15 07:36 merge-descriptors -> /var/tmp/npmln/merge-descriptors/1.0.1/
lrwxrwxrwx 1   28 Nov 15 07:36 methods -> /var/tmp/npmln/methods/1.1.2/
lrwxrwxrwx 1   32 Nov 15 07:36 on-finished -> /var/tmp/npmln/on-finished/2.3.0/
lrwxrwxrwx 1   29 Nov 15 07:36 parseurl -> /var/tmp/npmln/parseurl/1.3.1/
lrwxrwxrwx 1   35 Nov 15 07:36 path-to-regexp -> /var/tmp/npmln/path-to-regexp/0.1.7/
lrwxrwxrwx 1   31 Nov 15 07:36 proxy-addr -> /var/tmp/npmln/proxy-addr/1.1.2/
lrwxrwxrwx 1   23 Nov 15 07:36 qs -> /var/tmp/npmln/qs/6.3.0/
lrwxrwxrwx 1   33 Nov 15 07:36 range-parser -> /var/tmp/npmln/range-parser/1.2.0/
lrwxrwxrwx 1   26 Nov 15 07:36 send -> /var/tmp/npmln/send/0.14.1/
lrwxrwxrwx 1   34 Nov 15 07:36 serve-static -> /var/tmp/npmln/serve-static/1.11.1/
lrwxrwxrwx 1   29 Nov 15 07:36 type-is -> /var/tmp/npmln/type-is/1.6.13/
lrwxrwxrwx 1   32 Nov 15 07:36 utils-merge -> /var/tmp/npmln/utils-merge/1.0.0/
lrwxrwxrwx 1   25 Nov 15 07:36 vary -> /var/tmp/npmln/vary/1.1.0/
%node                                                                                                                                                                                      
 > typeof require('express')
'function'

If you're interested in participating, I'll include in the list of those I notify of the first working version of yarn that does all this, if you'd like.

I would like to get notified, thanks. Not very sure about I have ability to participate in the project itself though.

ghost commented 7 years ago

@S-YOU Thanks for the input.

Your example assumes that if you had multiple express@4.14.0 projects installed on your machine, that in every case their accepts dependency would always be @ version 1.3.3. This is the fundamental constraint with node's use of a node_modules subfolder that prevents symlinking all local module folders to single global instances. There can be several ways in which a given set of express@4.14.0 projects installed on a single machine, will have ended up with different versions of the accepts dependency, where each is still within the version spec range specified in the express@4.14.0's package.json; i.e.: "accepts": "^1.0.0" means any version starting with 1.

This is also one of the fundamental reasons yarn uses a lock file for everything, because for a given spec-version of a tree (the abstract tree based on package.json dependencies defined with ranges of versions), it's logical representation (the specific latest released versions of modules at the time of install) can change over time.

This is also exacerbated by package managers bubbling common versions as far up to non conflicting ancestors as possible, which today is done to cut down on copies and shorten the depth (path name length) of the node_modules based folder hierarchy.

The other issues you ran into with bundlers, plugins, etc. where most likely a consequence of either you not running node with --preserve-symlinks or the fact even with that flag, node does not preserve the symlink of the entry.js file passed on it's command line.

ghost commented 7 years ago

@S-YOU With the approach I'm taking --adjacent-node-modules (and without bubbling), your machine might look like this

projectA     *** installed three months ago
    /node_modules     *** real folder
        /express -> /var/tmp/nm-cache/express/4.14.0      *** link to global copy sans /node_modules
        /express.node_modules     // real folder
            /accepts -> /var/tmp/nm-cache/accpets/1.1.2     *** symlink to global copy
            /accepts.node_modules     // real folder and so on

*** notice the accepts version
projectB     *** installed today
    /node_modules     *** real folder
        /express -> /var/tmp/nm-cache/express/4.14.0      *** symlink global copy without /node_modules
        /express.node_modules     // real folder
            /accepts -> /var/tmp/nm-cache/accpets/1.3.3     *** symlink to global copy
            /accepts.node_modules     // real folder and so on
                /modwithbndl ->  /var/tmp/modwithbndl/1.1.1
                    /node_modules    *** under global copy, so bundled modules still supported

projectA ended up with an accepts@1.1.2, while projectB ended up with accepts@1.3.3. The way in which you symlink can't handle this.

However, because a node_modules is still searched first, global module copies with bundled node modules still works as expected

S-YOU commented 7 years ago

projectA ended up with an accepts@1.1.2, while projectB ended up with accepts@1.3.3. The way in which you symlink can't handle this.

You are absolutely right on that case that my approach won't work, but why? how do you get two versions of accepts installed on same express/4.14.0 with same package.json?

ghost commented 7 years ago

Because the version specifiers of dependencies in package.jsons are almost always some range of versions, and not specific versions, and so what version is actually installed (and without something like yarn's lock file) dependends on what version happens to be current and within the specified range.

I just went to github.com/express/express/package.json:

dependencies": {
    "accepts": "~1.3.3",

the ~1.3.3 means any version beginning with 1.3 and >= 1.3.3. Today accepts may be at 1.3.3 in the npm registry, tomorrow they might release and publish 1.3.4. Both would be within the specified range.

S-YOU commented 7 years ago

the ~1.3.3 means any version beginning with 1.3 and >= 1.3.3. Today accepts may be at 1.3.3 in the npm registry, tomorrow they might release and publish 1.3.4. Both would be within the specified range.

I see what you mean now, with my approach projectC will still use acepts 1.3.3, until new version of express release with accepts ~1.3.4, It won't break the module loading, but that was my design choice. Thanks for the lights up.

ghost commented 7 years ago

Thats right. If we appreciate version dependencies across and down an entire tree can change quite a bit, it's possible your application can have subtly changing behavior. Hence yarn to ensure once a logical tree has been stamped to a lock file containing the exact/precise versions, that lock file can be checked in, and now everyone else can be assured they get the same behavior because the get the same versions of the entire tree

ghost commented 7 years ago

I see what you mean now, with my approach projectB will still use acepts 1.3.3, until new version of express is released. It won't break the module loading, but that was my design choice. Thanks for the lights up.

Actually with your symlinking, projectA would end up using accepts@1.3.3 once you installed projectB (assuming it overwrote the global folder), and if projectA was in some way indirectly dependent on some behavior in accepts@1.1.2 that changed in accepts@1.3.3, you might have a rough day.

ghost commented 7 years ago

@S-YOU Not really knowing the environment you're doing this in, I would not recommend it if you're dealing with either several projects and/or several developers, and/or also symlinking on production servers.... I mean I just flat out wouldn't recommend.

You're creating the potential for a nightmare situation.. but that's just my opinion... I'm just thinking about the world of hurt you might be setting yourself up for, that I would not want to be in. Things could be working on your dev machine, and you deploy to production, and several other things break, and then trying to put stuff back in way you know would work... I mean....gulp... be careful

S-YOU commented 7 years ago

if projectA was in some way indirectly dependent on some behavior in accepts@1.1.2 that changed in accepts@1.3.3, you might have a rough day.

@S-YOU Not really knowing the environment you're doing this in, I would not recommend it if you're dealing with either several projects and/or several developers, and/or also symlinking on production servers.... I mean I just flat out wouldn't recommend.

You're creating the potential for a nightmare situation.. but that's just my opinion... I'm just thinking about the world of hurt you might be setting yourself up for, that I would not want to be in. Things could be working on your dev machine, and you deploy to production, and several other things break, and then trying to put stuff back in way you know would work... I mean....gulp... be careful

Thanks for the input, and yes, you are right and I am aware, projects I am using my approach is fully under my dedicated control and I always use and test latest version of all libraries, so it's possible at least for me.

ptim commented 7 years ago

I'd like to get notified! Thanks for the mad science - love your work :100:

ghost commented 7 years ago

Some here may be interested in this node PR: Symlinks Just Work. If so, your 👍 would be appreciated.

Pauan commented 7 years ago

@phestermcs I would like to be notified as well.

I know there hasn't been a lot of vocal support in this issue (or in others), but I'm sure there are a lot of developers who have felt the pain of slow npm installations. They would appreciate faster install times, but they are unaware that these GitHub issues exist.

ghost commented 7 years ago

@ptim @pauan @Daniel15 @S-YOU I'm still trying to move the mountain. I have a new issue on nodejs/node that's simpler/shorter to consume & understand (I hope). I've since ran their citgm testing tool and the fixes/improvements passed with flying colors. I've alse created a purpose built testing tool that shows quit clearly how great symlinks could work with node.

FWIW, I have the definite impression yarn members aren't generally very interested in any of this for some strange reason. Also nodejs/node's testing tool citgm uses npm, and I think it would be yet another testament that symlinks can work just fine if I modified npm to use adjacent.node_modules to a machine store, and then let citgm use that version to install modules. So I'm going to change npm rather than yarn as another step in changing minds. I'll let you know when I've got something working, if still interested, but it will be a couple weeks or so, maybe a month.

But tweaking a package manager will be just about be the last thing I can do to show symlinks can work great without breaking things. It will take people like you (and your friends and coworkers) being vocal about the value to yarn, node, and wherever, to actually get the needed changes into a shipping version of node. Your continued support is greatly appreciated.

whmountains commented 7 years ago

@ghost I'm behind you 100% on this. Keep it up 👍 and let me know if something happens.

ScottFreeCode commented 7 years ago

Is there anywhere that the discussion formerly at https://github.com/yarnpkg/rfcs/issues/18 could still be viewed? It's linked from one of the Node PRs but apparently that repo has turned off issues (and GitHub retroactively hides existing issues when that happens, I guess).

bestander commented 7 years ago

screencapture-github-yarnpkg-rfcs-issues-18-1496310591605

bestander commented 7 years ago

Attached a screenshot of it

bestander commented 7 years ago

I think at this point we don't want to symlink/hardlink from cache.

However Yarn should be more open to such experiments and allow third party code to override the linking phase with plugins, e.g. replace copy operations with linking, or JS copy commands with Native copy commands or some smarter hoisting algorithms.

If anyone wants to lead this effort, speak up and send an RFC.

ScottFreeCode commented 7 years ago

Thanks for the screenshot!

Daniel15 commented 7 years ago

apparently that repo has turned off issues (and GitHub retroactively hides existing issues when that happens, I guess).

Yeah, this is a very annoying behaviour of GitHub. The RFC repo was never supposed to have issues enabled (RFCs are only submitted via pull requests), but issues were accidentally enabled in the beginning. We gave people time to create new PRs based on issues before disabling the issue tracker. It would have been nice for GitHub to keep read-only access to the existing issues. Oh well.

yarnpkg / yarn

Looking for brilliant yarn member who has first-hand knowledge of prior issues with symlinking modules #1761

Operating system differences

Tooling not supporting file system cycles

Poor support for file watching

Tooling relying on node_modules hierarchy

OS File System Linking Differences

Tooling not supporting file system cycles

Poor support for file watching

Tooling relying on node_modules hierarchy

Advantages of a different way of symlinking

I know of a way to reduce the total overall module storage on a given developer's machine by at least 10x's (50GB to 5GB) or more, and install times by at least 50x's (50 sec to 1 sec) or more

To summarize, assuming best case answers to above questions

and have it only take 2 seconds instead of 2 minutes.

Tooling relying on `node_modules` hierarchy

Tooling relying on `node_modules` hierarchy