[Feature] Explicitly allow or disallow licenses

Describe the user story

I'm a project manager and I want to prevent my employees from adding to our codebase new code distributed under untested licenses (ex WTFPL).

Describe the solution you'd like

I'd like to set a list of accepted (or rejected) SPDX license strings in my .yarnrc that would cause a validation error at install time. For example:

valid-licenses:
  - MIT
  - BSD

Note that this would require to also validate transitive dependencies.

Describe the drawbacks of your solution

While it might be important for various entities, I don't think it would belong to the "core". In that sense I'm pretty sure it should be distributed as a separate plugin (either developed on this repository or by an interested third-party).

It would also require a better plugin workflow than what we currently have (we would definitely need this yarn plugin add command!).

Describe alternatives you've considered

The exact validation mechanism is TBD as while there are multiple ways to achieve a similar result, all have their own drawbacks. In particular:

We could pass a validate named parameter to the resolvers that they would have the responsibility to call themselves before returning, passing the manifest of the package they're currently handling. This validate function wouldn't be set in most cases except for yarn add which would implement it by calling the project validation hook (I think I like this option the most).
We could simply make it an extra command whose sole purpose would be to check the versions. I think it makes sense to do that regardless of the case (it might be especially important when changing the set of accepted dependencies, since in those cases the "validateNewPackage" hook wouldn't trigger), but the overall experience isn't great since it requires an extra command to be run after each install.

Click here to see a bunch of other possible solutions

- We could implement it as a validation hook that would be called from `Project` with the `Package` value returned by the resolvers, but that would require us to add the license field to the `Package` type - and thus serialize it within `yarn.lock`. It's not clear how scalable this is - if we need to add all the fields from the manifests to the `yarn.lock` it kinda defeats the purpose *(this might be alleviated if the plugins were able to specify a list of fields that need to be persisted, but even then it's dubious)*. - We could store the `Manifest` instance within the returned `Package` (at least for every resolver where it makes sense) and instruct the lockfile serializer to not save this field (then we would still have this validation hook I mentioned in `Project` that would be able to access all values from the manifest). But while it would solve the lockfile format scalability issue that would make the behavior different from one execution to the next (sometimes the manifest for a package would be there and sometimes it wouldn't), and I don't like that very much. - We could add a validation hook to the cache ("if a file should be added, first validate it"), but that wouldn't work well with the global cache approach. At the same time, we don't want to validate the packages *every time* as that would be a waste of resources. - We could make the `resolveEverything` function accept a list argument that would be populated with the list of locators that couldn't be resolved from the lockfile (this is a bit tricky because the resolvers don't return this information at the moment). Then after calling `fetchEverything` we would iterate over those new locators and trigger a validation hook. The downside is that rejected packages (+ dependencies + dependents) will still have reached the cache, unless we somehow manage to remove them. - There's a reasonable case that this could be implemented through the constraints engine by exposing all the packages in the dependency tree (rather than just the workspaces). In practice I have some concerns it might grow the fact list exponentially, although I don't have numbers. It would also be an after-the-fact validation, which I'd like to avoid (it wouldn't be a good experience to use a package then at PR-time you figure out you can't use it).

Additional context

I received the stats from the npm survey and interestingly enough licenses were proeminently featured:

I went a bit overboard with the "alternative" analysis because this issue is mostly about "how do we validate packages before adding them", which is an important security question even if not for the license check itself.

Would it make sense to provide two sets of whitelists/blacklists: runtime dependencies and build-time dependencies? An example use case: the GPL is not allowed as runtime dependency of a closed source package, but it's fine as build-time dependency.

A more complex question: Can I apply these rules only for certain packages? I've got a library at work that's divided into categories of packages:

demo/
core/
  package-a/
  package-b/
  ...
angular/
  package-x/
  package-y/
  ...
devkit/
  package-0/
  package-1/
  ...

The devkit packages are never distributed: they're build-time dependencies. I would want to impose different license restrictions to these groups of packages, e.g. the GPL is okay for devkit packages but not for others. Does berry take .yarnrc files in subdirectories into account? That would be a solution for this use case. It would probably create problems though, e.g. what happens if I define a registry in .yarnrc in a subfolder? Is that simply ignored, is that an error, or is that somehow used?

At the time the resolvers are called they can't know yet what are the dependents of the requested package. A logic like the one you described ("only certain packages can have GPL licenses") wouldn't be possible under the proposed resolver-time validate function.

Similarly, resolvers can't be made aware of the reason why a dependency is needed (regular dependency or dev dependency) because the results are cached and thus need to be the same in both cases. At the same time I don't think this use case is needed if you can specify the enforced licenses based on the dependency dependents (because then you can manually configure the package lists).

Maybe something we could do:

Add a hook that would get triggered after the fetch step but before the build step.
If the hook throws an exception, optionally (configurable behavior, default true) rollback the cache.

This way you could just make a contrib plugin that would add a validation hook that would fetch the manifests from all the dependencies of all the workspaces for which some licenses are enforced (it wouldn't be through the .yarnrc files though - more likely through Manifest, and potentially by accessing the raw field since that would be a plugin-specific field).

It might be a bit slow if multiple plugins are doing this, though, since each would hydrate the zip archives again unless we implement some kind of cache ... 🤔

I have the need to enforce license checking.

This needs enforced on users who are not well versed in licenses (ie, not a developer by trade).
After a very cursory search it has become clear that it is not unusual for smaller packages to have incompatible licenses with their dependencies

There's a reasonable case that this could be implemented through the constraints engine by exposing all the packages in the dependency tree (rather than just the workspaces).

I already use the constraints engine in CI/CD already for verifying version consistency and outlawing certain packages or versions, its great but its also lacking - specifically in being able to identify violations in the dependency tree (e.g, outlawing jQuery from appearing in bundled code, or the versions of fastest-validator with broken GUID handling). So exposing the lot sounds like a win to me.

In practice I have some concerns it might grow the fact list exponentially, although I don't have numbers.

Sure. In terms of dependencies a small production repo of ours has nearly 500 entries in .yarn/cache, and our largest production repos top out at 1200. As soon as every key in the manifest gets added to the fact list it does get pretty nasty.

It would also be an after-the-fact validation, which I'd like to avoid (it wouldn't be a good experience to use a package then at PR-time you figure out you can't use it).

I'd love to have the version constraints also show up at install time and in commands like up and upgrade-interactive. Rolling back a bunch of repos to v3.9 typescript was a pain.

That all said. I can work around my issues in CI/CD with a lot of grumbling. The user-experience side of it I can't improve without contributing to yarn (which I'd love to do, but I'm doubtful contributing to this issue would lead to a merge).

It strikes me that this feature is fringe - but of clear interest to the kinds of people that care about zero-config installs, ie ops leaning.

We might be interested in this as well. With NPM/Classic Yarn, we were using license-checker to export a list of licenses that our dependencies and subdependencies used, and then we'd manually go over the list. license-checker isn't compatible with Yarn 2's PnP feature, and Yarn 2 doesn't support the yarn licenses command, either, so life has been made slightly difficult for us 🙂

yarnpkg / berry

[Feature] Explicitly allow or disallow licenses #89