Add the registry of each pulled package to `yarn install` for audit purposes

BrianMcBrayer commented 2 years ago

[x] I'd be willing to implement this feature (contributing guide)
[x] This feature is important to have in this repository; a contrib plugin wouldn't do

Describe the user story

As a secdevops engineer, I want my CI runs to output where each package is pulled from as the packages are being pulled. This is important as the industry moves towards a secure supply chain by default. Currently, there does not seem to be true solution to this that happens during the build as the packages are being pulled.

See discussion at https://github.com/yarnpkg/berry/discussions/3744 .

I would be happy to implement this (I think), but I don't want to put the effort in unless it will likely be accepted.

Describe the solution you'd like

In the log output given by https://github.com/yarnpkg/berry/blob/717c48c4162db6b405bc30405dba167cab38d791/packages/plugin-npm/sources/NpmHttpFetcher.ts#L33 I would like the full URL to the package to be outputted. This could maybe change based on the verbosity level.

Describe the drawbacks of your solution

I believe that some people will not like the extra verbosity, so perhaps we should only output this if the log level is set to info (or some other solution).

Describe alternatives you've considered

I considered making a plugin, but that would be a plugin to add a single logging change to something that industry is going to want (I think).

We could consider adding the --verbose flag back to yarn install to output things like this.

We could also output this information to yarn.lock -- that would be my preference because pulling the same package from a different registry could end up being a different package. That would involve a sophisticated hash collision, but... nation states are fully capable of doing this.

paul-soporan commented 2 years ago

Adding the URL at the end of the cache-miss messages seems reasonable to me, especially since they are always truncated when progress bars are enabled (so most users won't see it on their development machines but they'll see it on CI).

In the log output given by berry/packages/plugin-npm/sources/NpmHttpFetcher.ts

That's the wrong one, NpmHttpFetcher is the one that handles unconventional tarball URLs (the ones we have to store the URL in the lockfile for - in that case the URL gets displayed in the message since it's part of the stringified locator's __archiveUrl query parameter). NpmSemverFetcher is the one that handles the conventional tarball URLs.

I believe that some people will not like the extra verbosity, so perhaps we should only output this if the log level is set to info (or some other solution).

We currently don't have verbosity levels more granular than info, warning, and error. Also (like I said above), those lines are truncated anyways when progress bars are enabled so most users won't see it unless they have a really wide terminal.

We could also output this information to yarn.lock

Removing hostnames from the lockfile was decided in https://github.com/yarnpkg/yarn/issues/5892 (a heavily upvoted issue).

that would be my preference because pulling the same package from a different registry could end up being a different package.

:thinking: I don't think we invalidate the currently cached package after changing registries, but it would probably be a good idea to do it.

BrianMcBrayer commented 2 years ago

Thank you for your detailed response.

I'm going to try to code this. The yarn codebase looks really well architected.

After looking through the code, I see that at the point that we log the cache-miss (https://github.com/yarnpkg/berry/blob/a2b645c6a8e75ae771b10adfe918a1cdb570bbb3/packages/plugin-npm/sources/NpmSemverFetcher.ts#L35), we have not yet determined which registry will be used to pull the package. In fact... we might try a few registries before we find the right one (in the case of multiple registries), right?

It doesn't seem quite right to try to derive the registry up in the fetch method, since we would only use it for logging and also that could diverge from the logic used at the time of making the network request.

One option that I didn't like as much would be to have npmHttpUtils return the URL that it resolved. That might involve touching a number of files.

A better option (I think?) is... I am wondering if we could instead log the get requests in npmHttpUtils? That file doesn't seem to have any logging yet, but if we were to pass, say, logInfo down into the get call via the options, then we could log the actual fetch. That way we log each attempt at accessing a registry, which would be helpful when translating output for consumption for tools like in-toto.

That is a bigger change though, and while I've gotten it working I wanted to discuss before making someone review my code or adding tests. If it would be better for me to just create the PR and we can review there, I'm happy to do that though.

BrianMcBrayer commented 2 years ago

I went ahead an put up a reference PR (very rough -- just to show a general idea of what I was thinking) to see if yall like the approach.

https://github.com/yarnpkg/berry/pull/3768

yarnpkg / berry

Add the registry of each pulled package to `yarn install` for audit purposes #3765