yeoman / yeoman

Yeoman - a set of tools for automating development workflow
http://yeoman.io
10.07k stars 735 forks source link

Automatically validate our project dependencies licenses #1523

Open SBoudrias opened 9 years ago

SBoudrias commented 9 years ago

It's been a couple time we've been pinged about license issues with our dependencies.

With npm starting to enforce valid spdx license naming, I think it's time we automate our validation process.

This feature could be added within generator-node. Then we just need to start using generator-node on all our Yeoman projects (FWIW, that's in the pipeline. From my side, I'm just ironing the last couple issues in the generator)

This check should be done at least as a pre-publish step - and maybe as part as our testing phase.

Any tool suggestion to automate this process is welcomed! A quick search brought up node-license-validator from @myndzi as a potential candidate.

myndzi commented 9 years ago

I wrote that tool for exactly that reason, but I'm currently struggling with a few problems with nlf and the spdx database in general, as an alternate to nlf's file matching. Please see this issue for details: https://github.com/iandotkelly/nlf/issues/27

node-license-validator does currently depend on nlf to determine licenses; the issue above was filed after I had created it and was beginning to use it, and noticed some strange results. I think it's still useful, but it's not as reliable as I'd like it to be. I'm unaware of an alternative.

If you have any ideas on how to approach the problem I'd be happy to incorporate them!

iandotkelly commented 9 years ago

What are the requirements for the validation process?

myndzi commented 9 years ago

Uncertain what you mean? for node-license-validator, you specify up front what licenses are allowed, and, in the case of specific package exceptions (perhaps to deal with licenses that aren't detected correctly), which packages are allowed. It checks dependencies, and either succeeds or fails based on whether the dependencies meet those criteria.

Checks are only done one dependency deep; our legal team determined that was acceptable, and the onus of ensuring that sub-dependency licenses are compatible is on the dependency that includes them.

iandotkelly commented 9 years ago

I meant what are the requirements for the yeoman project specifically.

SBoudrias commented 9 years ago

Here are the restrictions I'd like to see:

  1. Every dependency have a valid SPDX "license" in their package.json (e.g. https://github.com/yeoman/generator/blob/master/package.json#L5)
  2. Every dependency have no limitation license (free use for anything and however we wish)
  3. We need to go down the tree. We need to ensure any transitive dependency is also valid

We don't want to fuzzy match on readmes or license text. We want clear SPDX values from the package.json with the packages published to npm.

myndzi commented 9 years ago

One problem you will encounter immediately is that many published packages do not have valid SPDX licenses in their package.json, even if they have valid SPDX licenses.

Deeper validation should be an easy change; I've got a few simple extensions planned for node-license-validator including depth and whether to verify development / optional dependencies, etc.

Define "fuzzy match"; the SPDX project defines matching criteria that they have determined to be acceptable when matching a potential license against their database. This is not something like an edit distance under a threshold, but specific rules that normalize the data to be matched into a similar format without altering its meaning. Is such a matching scheme acceptable in this case?

SBoudrias commented 9 years ago

@myndzi If we can't get a PR fixing one of our dependencies invalid license out, then I don't think it's a package we want to use.

The opinion of other team member might differ.

That being said, if the criterias to validate a package license are clear and reliable, then I'm okay with not solely relying on a valid SPDX license in package.json.

sindresorhus commented 9 years ago

If we can't get a PR fixing one of our dependencies invalid license out, then I don't think it's a package we want to use.

:+1:

myndzi commented 9 years ago

If we can't get a PR fixing one of our dependencies invalid license out, then I don't think it's a package we want to use.

I can't quite understand what you mean here. Do you mean the case where you have a dependency that doesn't have a valid SPDX identifier as its license, and you submit a PR to that dependency to correct this, and nobody merges the request?

I guess you can be that rigid, but there are a lot of useful modules that fall into this category. Presumably most of the useful/active ones would accept a PR, but just for an example, here's the packages from the top ~ 160 dependencies on NPM that don't have a valid SPDX identifier in their package.json:

myndzi commented 9 years ago

(This list does not include dependents of these modules, of which there are many)

iandotkelly commented 9 years ago

@myndzi - FYI you don't have to take nlf's file results - if you look at the data it produces, it separates results from package.json from results from looking at files.

You can then compare those to the SPDX list and reject those that are not valid SPDX identifiers.

I personally think that trying to implement SPDX matching criteria is a long shot - so if you need completely accurate results, we need to encourage people to use package.json properly.

myndzi commented 9 years ago

I'm using nlf primarily in order to take advantage of its file results; my spdx matcher was not usable at the time, and still really isn't, and I was frustrated with spdx matching in general. I didn't really look too close at how nlf accomplished this goal until later ;)

SBoudrias commented 9 years ago

@myndzi What is broke in your module? I don't really feel like writing a new one completely. We might be able to help fixing the issues in it.

myndzi commented 9 years ago

It's not broken per se, it just doesn't offer as solid of a guarantee as I'd like, given the behavior of its dependency nlf. We're discussing what might be done about that on the nlf issue tracker right now, and I spent some time implementing a stricter license checker that I could switch it over to. I mostly popped in here since I was pinged in the OP to mention this caveat and show that I'm active and willing to work with you guys if you need it :)

SBoudrias commented 9 years ago

That's great @myndzi, we'll obviously prefer to work with an existing project and an active maintainer.

So what are your though and status for our requirements then? I see two main things currently:

  1. Inspecting the whole dependency chain
  2. Stricter license match (maybe we could have warnings about licenses where we're not 100% confident but not fail the build or something similar)
myndzi commented 9 years ago

I can implement deep inspection easily (it's just an argument I'm passing to nlf anyway)

As for 2, I can implement the ability to specify what kind of matches are allowed. You would maybe specify something like --match-level=1, and it would only inspect package.json values. After some kind of SPDX content-matching is available, the behavior should remain the same for new versions of node-license-validator unless you remove or modify the --match-level argument.

I'll try to push some new code with the both of those in the next few days.


I'm forming a plan with regards to confidence level of licenses; I think that we can "rank" matches in order from most confidence to least confidence:

It should be noted that multiple possible licenses can be discerned from most of these steps, and my behavior so far has been to return all found licenses as a set; the validator itself performs license matching as set membership; that is, if a package returns the licenses ['MIT', 'GPL-3.0'], and you allow 'MIT', that would be a success (interpreted as (MIT OR GPL-3.0)).

From the comments above, I think you would accept the first two of these, and reject the rest. For my own usage, I would probably reject the last item, and for iandotkelly's usage, he would want all of them.

myndzi commented 9 years ago

How do you guys feel about development dependencies? I'm doing some testing, and I find that, recursively, one of my largest projects has 397 dependencies if I count only direct dependencies. With the dev, optional, peer, bundled dependency keys involved, I have 920 dependencies... that can be resolved. There are 782 more (likely with lots of overlap) that are not installed, since npm does not install development dependencies by default. This makes license analysis for such a case tricky, since the actual data to analyze is not present. It's possible to either fetch the packages or just some of their info from NPM, but I'm not sure that's actually worthwhile. It does make the addition of a switch to analyze dev dependencies pretty worthless, though.

Edit: I'm talking about development dependencies of dependencies of the root package, e.g. devDependencies of express if you use express in the project being analyzed.

SBoudrias commented 9 years ago

I'm not sure if we want to analyze dev dependencies, but I believe that should be possible in the analyzing tool. I'm sure some people will want to analyze the dev dependencies.

iandotkelly commented 9 years ago

The default setting in nlf is to include all dependencies, but you can exclude dev dependencies with a configuration property. As you say, npm won't install them by default, but you do have your project's immediate dependencies.

I think the development dependencies of your project's production dependencies are irrelevant - if they are not included in your application for it to be built or to run in production then their license is not needed. Your immediate development dependencies should be an option as @SBoudrias says.

tl;dr

I think that its not required to dig out the licenses for what npm does not install.

myndzi commented 9 years ago

Makes sense, and is the only sensible solution.

xizhao commented 6 years ago

A bit of a gravedig, but I wanted to this up: http://fossa.io/ (I work here)

If you're still looking to do something, we'd love offer free license scanning / certifications for Yeoman (we use this ourselves!)

We've done the same for the teams at Webpack, Mocha, ESLint, etc... Some screenshots from their readmes (we're the [ license scan | passing ] badge)

image

image

Feel free to reach out to kevin@fossa.io and we'll upgrade your account if interested.

P.S. We also have a completely open source tool to help with this too if you prefer avoiding a service:

https://github.com/fossas/fossa-cli (~450 stars on GH)

SBoudrias commented 6 years ago

@xizhao Hey, that's an amazing product. I'd love to see this integrated on the main Yeoman repos!