ziglang / zig

General-purpose programming language and toolchain for maintaining robust, optimal, and reusable software.
https://ziglang.org
MIT License
35.09k stars 2.56k forks source link

package manager #943

Closed andrewrk closed 1 year ago

andrewrk commented 6 years ago

Latest Proposal


Zig needs to make it so that people can effortlessly and confidently depend on each other's code.

~Depends on #89~

phase commented 6 years ago

My thoughts on Package Managers:


Thoughts that might not be great for Zig:

andrewchambers commented 6 years ago

This is a good reference for avoiding the complexity of package managers like cargo, minimal version selection is a unique approach that avoids lockfiles, .modverify avoids deps being changed out from under you.

https://research.swtch.com/vgo

The features around verifiable builds and library verification are also really neat. Also around staged upgrades of libraries and allowing multiple major versions of the same package to be in a program at once.

bnoordhuis commented 6 years ago

Packages should be immutable in the package repository (so the NPM problem doesn't arise).

I assume you mean authors can't unpublish without admin intervention. True immutability conflicts with the hoster's legal responsibilities in most jurisdictions.

minimal version selection

I'd wait a few years to see how that pans out for Go.

andrewchambers commented 6 years ago

Note that by minimal, they mean minimal that the authors said was okay. i.e. the version they actually tested. The author of the root module is always free to increase the minimum. It is just that the minimum isn't some arbitrary thing that changes over time when other people make releases.

BraedonWooding commented 6 years ago

My top three things are;

A good package manager can break/make a language, one of the reasons why Go has ditched atleast one of its official package managers and completely redid it (it may even be two, I haven't kept up to date with that scene).

andrewrk commented 6 years ago

The first thing I'm going to explore is a decentralized solution. For example, this is what package dependencies might look like:

const Builder = @import("std").build.Builder;
const builtin = @import("builtin");

pub fn build(b: &Builder) void {
    const mode = b.standardReleaseOptions();

    var exe = b.addExecutable("tetris", "src/main.zig");
    exe.setBuildMode(mode);

    exe.addGitPackage("clap", "https://github.com/Hejsil/zig-clap",
        "0.2.0", "76c50794004b5300a620ed71ef58e4444455fd72e7f7e8f70b7d930a040210ff");
    exe.addUrlPackage("png", "http://example.com/zig-png.tar.gz",
        "00e27a29ead4267e3de8111fcaa59b132d0533cdfdbdddf4b0604279acbcf4f4");

    b.default_step.dependOn(&exe.step);
}

Here we provide a mapping of a name and a way for zig to download or otherwise acquire the source files of the package to depend on.

Since the build system is declarative, zig can run it and query the set of build artifacts and their dependencies, and then fetch them in parallel.

Dependencies are even stricter than version locking - they are source-locked. In both examples we provide a SHA-256 hash, so that even a compromised third party provider cannot compromise your build.

When you depend on a package, you trust it. It will run zig build on the dependency to recursively find all of its dependencies, and so on. However, by providing a hash, you trust only the version you intend to; if the author updates the code and you want the updates, you will have to update the hash and potentially the URL.

Running zig build on dependencies is desirable because it provides a package the ability to query the system, depend on installed system libraries, and potentially run the C/C++ compiler. This would allow us to create Zig package wrappers for C projects, such as ffmpeg. You would even potentially use this feature for a purely C project - a build tool that downloads and builds dependencies for you.

ghost commented 6 years ago

and potentially run the C/C++ compiler cmd/go: arbitrary code execution during “go get” #23672

although you might argue

Dependencies are even stricter than version locking - they are source-locked. In both examples we provide a SHA-256 hash, so that even a compromised third party provider cannot compromise your build.

in that case you'd have to check all the reps of all your reps recursively (manually?) on each shape change though to be really sure

andrewrk commented 6 years ago

in that case you'd have to check all the reps of all your reps recursively (manually?) on each shape change though to be really sure

This is already true about all software dependencies.

BraedonWooding commented 6 years ago

I've been considering how one could do this for the past few days, here is what I generally came up with (this is based off @andrewrk 's idea), I've kept out hashes to make it easier, I'm more talking about architecture then implementation details here;

This would also solve the issue of security fixes as most users would keep the second option which is intended for small bug fixes that don't introduce any new things, whereas the major version is for breaking changes and the minor is for new changes that are typically non-breaking.

Your build file would have something like this in your 'build' function;

...
builder.addDependency(builder.Dependency.Git, "github.com.au", "BraedonWooding", "ZigJSON", builder.Versions.NonMajor);
// Or maybe
builder.addDependency(builder.Dependency.Git, "github.com.au/BraedonWooding/ZigJSON", builder.Versions.NonMajor);
// Or just
builder.addGitDependency("github.com.au/BraedonWooding/ZigJSON", builder.Versions.NonMajor);
...

Keeping in mind that svn and mercurial (as well as plenty more) are also used quite a bit :). We could either use just a folder system of naming to detect what we have downloaded, or have a simple file storing information about all the files downloaded (note: NOT a lock file, just a file with information on what things have been downloaded). Would use tags to determine versions but could also have a simple central repository of versions linking to locations like I believe what other things have.

isaachier commented 6 years ago

How would you handle multiple definitions of the same function? I find this to be the most difficult part of C/C++ package management. Or does Zig use some sort of package name prefixing?

BraedonWooding commented 6 years ago

@isaachier Well you can't have multiple definitions of a function in Zig, function overloads aren't a thing (intended).

You would import a package like;

const Json = @Import("JSON/index.zig");

fn main() void {
    Json.parse(...);
    // And whatever
}

When you 'include' things in your source Zig file they are exist under a variable kinda like a namespace (but simpler), this means that you should generally never run into multiple definitions :). If you want to 'use' an import like using in C++ you can do something like use Json; which will let you use the contents without having to refer to Json for example in the above example it would just be parse(...) instead of Json.parse(...) if you used use, you still can't use private functions however.

If for some reason you 'use' two 'libraries' that have a dual function definition you'll get an error and will most likely have to put one under a namespace/variable, very rarely should you use use :).

isaachier commented 6 years ago

I don't expect a clash in the language necessarily, but in the linker aren't there duplicate definitions for parse if multiple packages define it? Or is it automatically made into Json_parse?

Hejsil commented 6 years ago

@isaachier If you don't define your functions as export fn a() void, then Zig is allowed to rename the functions to avoid collisions.

isaachier commented 6 years ago

OK that makes sense. About package managers, I'm sure I'm dealing with experts here 😄, but wanted to make sure a few points are addressed for completeness.

andrewrk commented 6 years ago

These are important questions.

The first question brings up an even more fundamental question which we have to ask ourselves if we go down the decentralized package route: how do you even know that a given package is the same one as another version?

For example, if FancyPantsJson library is mirrored on GitHub and BitBucket, and you have this:

// in main package
exe.addGitPackage("fancypantsjson", "https://github.com/mrfancypants/zig-fancypantsjson",
    "1.0.1", "76c50794004b5300a620ed71ef58e4444455fd72e7f7e8f70b7d930a040210ff");

// in a nested package
exe.addGitPackage("fancypantsjson", "https://bitbucket.org/mirrors-r-us/zig-fancypants.git",
    "1.0.1", "76c50794004b5300a620ed71ef58e4444455fd72e7f7e8f70b7d930a040210ff");

Here, we know that the library is the same because the sha-256 matches, and that means we can use the same code for both dependencies. However, consider if one was on a slightly newer version:

// in main package
exe.addGitPackage("fancypantsjson", "https://github.com/mrfancypants/zig-fancypantsjson",
    "1.0.2", "dea956b9f5f44e38342ee1dff85fb5fc8c7a604a7143521f3130a6337ed90708");

// in a nested package
exe.addGitPackage("fancypantsjson", "https://bitbucket.org/mirrors-r-us/zig-fancypants.git",
    "1.0.1", "76c50794004b5300a620ed71ef58e4444455fd72e7f7e8f70b7d930a040210ff");

Because this is decentralized, the name "fancypantsjson" does not uniquely identify the package. It's just a name mapped to code so that you can do @import("fancypantsjson") inside the package that depends on it.

But we want to know if this situation occurs. Here's my proposal for how this will work:

comptime {
    // these are random bytes to uniquely identify this package
    // developers compute these once when they create a new package and then
    // never change it
    const package_id = "\xfb\xaf\x7f\x45\x86\x08\x10\xec\xdb\x3c\xea\xb4\xb3\x66\xf9\x47";

    const package_info = @declarePackage(package_id, builtin.SemVer {
        .major = 1,
        .minor = 0,
        .revision = 1,
    });

    // these are the other packages that were not even analyzed because they
    // called @declarePackage with an older, but API-compatible version number.
    for (package_info.superseded) |ver| {
        @compileLog("using 1.0.1 instead of", ver.major, ver.minor, ver.revision);
    }

    // these are the other packages that have matching package ids, but
    // will additionally be compiled in because they do not have compatible
    // APIs according to semver
    for (package_info.coexisting) |pkg| {
        @compileLog("in addition to 1.0.1 this version is present",
            pkg.sem_ver.major, pkg.sem_ver.minor, pkg.sem_ver.revision);
    }
}

The prototype of this function would be:

// thes structs declared in @import("builtin");
pub const SemVer = struct {
    major: @typeOf(1),
    minor: @typeOf(1),
    revision: @typeOf(1),
};
const Namespace = @typeOf(this);
pub const Package = struct {
    namespace: Namespace,
    sem_ver: SemVer,
};
pub const PackageUsage = struct {
    /// This is the list of packages that have declared an older,
    /// but API-compatible version number. So zig stopped analyzing
    /// these versions when it hit the @declarePackage.
    superseded: []SemVer,

    /// This is the list of packages that share a package id, but
    /// due to incompatible versions, will coexist with this version.
    coexisting: []Package,
};

@declarePackage(comptime package_id: [16]u8, comptime version: &const SemVer) PackageUsage

Packages would be free to omit a package declaration. In this case, multiple copies of the package would always coexist, and zig package manager would be providing no more than automatic downloading of a resource, verification of its checksum, and caching.

Multiple package declarations would be a compile error, as well as @declarePackage somewhere other than the first Top Level Declaration in a Namespace.

Let us consider for a moment, that one programmer could use someone else's package id, and then use a minor version greater than the existing one. Via indirect dependency, they could "hijack" the other package because theirs would supersede it.

At first this may seem like a problem, but consider:

Really, I think this is a benefit of a decentralized approach.

Going back to the API of @declarePackage, here's an example of power this proposal gives you:

const encoding_table = blk: {
    const package_id = "\xfb\xaf\x7f\x45\x86\x08\x10\xec\xdb\x3c\xea\xb4\xb3\x66\xf9\x47";

    const package_info = @declarePackage(package_id, builtin.SemVer {
        .major = 2,
        .minor = 0,
        .revision = 0,
    });

    for (package_info.coexisting) |pkg| {
        if (pkg.sem_ver.major == 1) {
            break :blk pkg.namespace.FLAC_ENCODING_TABLE;
        }
    }

    break :blk @import("flac.zig").ENCODING_TABLE;
};

// ...

pub fn lookup(i: usize) u32 {
    return encoding_table[i];
}

Here, even though we have bumped the major version of this package from 1 to 2, we know that the FLAC ENCODING TABLE is unchanged, and perhaps it is 32 MB of data, so best to not duplicate it unnecessarily. Now even versions 1 and 2 which coexist, at least share this table.

You could also use this to do something such as:

if (package_info.coexisting.len != 0) {
    @compileError("this package does not support coexisting with other versions of itself");
}

And then users would be forced to upgrade some of their dependencies until they could all agree on a compatible version.

However for this particular use case it would be usually recommended to not do this, since there would be a general Zig command line option to make all coexisting libraries a compile error, for those who want a squeaky clean dependency chain. ReleaseSmall would probably turn this flag on by default.


As for your second question,

Are the packages downloaded independently for each project or cached on the local disk (like maven and Hunter). In the latter case, you have to consider the use of build flags and their effect on the shared build.

Package caching will happen like this:

Caching is an important topic in the near future of zig, but it does not yet exist in any form. Rest assured that we will not get caching wrong. My goal is: 0 bugs filed in the lifetime of zig's existence where the cause was a false positive cache usage.

andrewrk commented 6 years ago

One more note I want to make:

In the example above I have:

exe.addGitPackage("fancypantsjson", "https://github.com/mrfancypants/zig-fancypantsjson",
    "1.0.2", "dea956b9f5f44e38342ee1dff85fb5fc8c7a604a7143521f3130a6337ed90708");

Note however that the "1.0.2" only tells Zig how to download from a git repository ("download the commit referenced by1.0.2"). The actual version you are depending on is the one that is set with @declarePackage in the code that matches the SHA-256.

So the package dependency can be satisfied by any semver-compatible version indirectly or directly depended on.

With that in mind, this decentralized strategy with @declarePackage even works if you do any of the following things:

You can also force your dependency's dependency's dependency (and so on) to upgrade, simply by adding a direct dependency on the same package id with a minor or revision bump.

And to top it off you can purposefully inject code into your dependency's dependency's dependency (and so on), by:

This strategy could be used, for example, to add @optimizeFor(.Debug) in some tricky areas you're trying to troubleshoot in a third party library, or perhaps you found a bottleneck in a third party library and you want to add @optimizeFor(.ReleaseFast) to disable safety in the bottleneck. Or maybe you want to apply a patch while you're waiting for upstream to review and accept it, or a patch that will be coming out in the next version but isn't released yet.

andrewrk commented 6 years ago

Another note: this proposal does not actually depend on the self hosted compiler. There is nothing big blocking us from starting to implement it. It looks like:

clownpriest commented 6 years ago

maybe worth considering p2p distribution and content addressing with ipfs?

see https://github.com/whyrusleeping/gx for example

just a thought

costincaraivan commented 6 years ago

One important thing to note, especially for adoption by larger organization: think about a packaging format and a repo structure that is proxy/caching/mirroring friendly and that also allows an offline mode.

That way the organization can easily centralize their dependencies instead of having everyone going everywhere on the internet (a big no-no for places such as banks).

Play around a bit with Maven and Artifactory/Nexus if you haven't already 😉

andrewrk commented 6 years ago

The decentralized proposal I made above is especially friendly to p2p distribution, ipfs, offline modes, mirroring, and all that stuff. The sha-256 hash ensures that software is built according to expectations, and the matter of where to fetch the resources can be provided by any number of "plugins" for how to download something:

costincaraivan commented 6 years ago

Looks good but I'd have to try it out in practice before I can say for sure 😄

I'd have one suggestion: for naming purposes, maybe it would be a good idea to also have a "group" or "groupId" concept?

In many situations it's useful to see the umbrella organization from which the dependency comes. Made up Java examples:

  1. group: org.apache, name: httpclient.
  2. group: org.apache, name: regexutils.

Otherwise what happens is that people basically overload the name to include the group, everyone in their own way (apache-httpclient, regexutils-apache). Or they just don't include it and you end up with super generic names (httpclient).

It also prevents or minimizes "name squatting". I.e. the first comers get the best names and then they abandon them...

isaachier commented 6 years ago

Structs provide the encapsulation you are looking for @costincaralvan. They seem to act as namespaces would in C++.

demircancelebi commented 6 years ago

I agree with @costincaraivan. npm has scoped packages for example: https://docs.npmjs.com/getting-started/scoped-packages.

In addition to minimizing name squatting and its practical usefulness (being able to more easily depend on a package if it is coming from an established organization or a well-known developer), honoring the creators of a package besides their creation sounds more respectful in general, and may incentivize people to publish more of their stuff :).

On the other hand, generic package names also come in handy because there is one less thing to remember when installing them.

costincaraivan commented 6 years ago

I didn't want to clutter the issue anymore but just today I bumped into something which is in my opinion relevant for the part I posted about groups (or scoped packages in NPM parlance):

http://bitprophet.org/blog/2012/06/07/on-vendorizing/

Look at their dilemma regarding the options, one of the solutions is forking the library:

Fork and release our own package on PyPI as e.g. fluidity-invoke.

  • This works, but has many the drawbacks of the vendorizing option and offers few of the benefits.
  • It also confuses things re: project ownership and who should receive/act on bug reports. Users new to the space might focus on your fork instead of upstream, forcing you to either handle their problems, or redirect them.

This would be easily solvable with another bit of metadata, the group. In Java world their issue would be solved by forking the library and then publishing it under the new group. Because of the group it's immediately obvious that the library was forked. Even easier to figure out in a repository browser of sorts since the original version would have presumably many versions while the fork will probably have 1 or 2.

thejoshwolfe commented 6 years ago

importers provide the name of the package that they will use to import the package. It's ok to have everyone try to name their module httpclient. When you want to import the module, give it whatever identifier you want. There are no name collisions unless you do it to yourself.

Name squatting is not meaningful in a distributed package manager situation. There is no central registry of names. Even in an application, there's no central registry of names. Each package has its own registry of names that it has full control over.

The only collisions possible in this proposal are collisions on the package id, which is a large randomly generated number used to identify if one package dependency is an updated version of another. You can only get collisions on package id if someone deliberately does so.

costincaraivan commented 6 years ago

A package manager cannot be detached from social issues. Yes, technically things would ideally be fully distributed, you would pull files from everywhere. But in real life, let's take the 3 most popular distributed protocols on the net:

All of them have a higher level that effectively centralizes them or at least makes some nodes in this decentralized stronger much, much "stronger" than the average node, thereby centralizing the system to a great degree.

Email: Gmail, Microsoft, Yahoo. Probably 80+% of public mail goes through a handful of email hosters.

Bittorrent: torrent trackers, see the outcry when The Pirate Bay went down.

Git: Github 😃 Gitlab, Bitbucket.

A package name tells me what the thing is. Generally it isn't unique, sometimes it's even non-descriptive (utils...). A hash is very precise, but far from human friendly. Any kind of other metadata I can get from the source is greatly appreciated.

What I'm saying is: make the package collection fully distributed but have provisions in the package format for centralization. It will happen anyway if the language becomes popular (Maven, npm.js, Pypi, etc.).

thejoshwolfe commented 6 years ago

make the package collection fully distributed but have provisions in the package format for centralization.

That's already in the proposal.

I'll work on some more clear documentation on how packages will work in Zig, because there seems to be a lot of confusion and misunderstanding here.

magicgoose commented 6 years ago

I think it could be a good thing to also support digital signatures in addition to hashes.
For some software authors, Bob might trust Alice for some reason, but not have time to read every diff of Alice's package, and in this situation Bob may add a requirement that Alice's package must be signed with a key with specific fingerprint.

419928194516 commented 6 years ago

Hey @andrewrk , I just watched your localhost talk (and backed you, good luck!). You centered the talk around a notion of making perfect software possible. I agree with this sentiment. However, relying on other's work in the way you propose (without additional constraints) leads away from that goal. It seems you've focused primarily on the "how do we have a decentralized store of packages" part of packages, and less on "what packages are", and what that means for creating stable software.

The assumption seems to be "semver and a competent maintainer will prevent incompatibilities from arising.". I am asserting that this is incorrect. Even the most judicious projects break downstream software with upgrades. These "upgrades" that fail are a source of fear and frustration for programmers and laymen alike. (See also: the phenonmena of single file, no dep C libs, and other language's dependency free libraries and projects)

When you talk about a package:

exe.addGitPackage("fancypantsjson", "https://github.com/mrfancypants/zig-fancypantsjson", "1.0.1", "76c50794004b5300a620ed71ef58e4444455fd72e7f7e8f70b7d930a040210ff");
You have decided that the identity of a package is:
    ID == (name: str, url: url, version: semver, id: package_id, sha: hash) + other metadata

As the consumer of a package, the identity of the package is relevant only to find the package. When working with the package, what matters is only the public API it exposes. For example:

API|1.0.0: {
    const CONST: 1234
    frobnicate: fn(usize) -> !usize  // throws IO
    unused: fn(u8): u8
}

Let's imagine my project only relies on frobnicate and CONST. It follows that I only care about these two functions. No other information about the the version, url, or name matters in the slightest. Whether an upgrade is safe can be determined by observing the signatures of the things that I rely on. (Ideally we'd rely not on the signatures, but on the "the exact same behavior given the same inputs", but solving the halting problem is out of scope.)

Some time later, the author of the package releases:

API|1.1.0: {
    const CONST: 1234
    frobnicate: fn(usize) -> !usize // throws IO // now 2x faster
    unused: fn(u8): u8
}
API|1.2.0: { // oops breaking minor version bump, but nobody used frobnicate.. right?
    const CONST: 1234
    unused: fn(u8): u8
}
API|1.3.0: { // added frobnicate back
    const CONST: 1234
    frobnicate: fn(usize) -> !usize // throws IO + BlackMagicError
    unused: fn(u8): u8
}

I cannot safely use API 1.2.0 or API 1.3.0 1.2.0 breaks the API contract with the omission of frobnicate 1.3.0 breaks the contract by adding an error that my project (maybe) doesn't know it needs to handle.

Your note here:

// these are the other packages that have matching package ids, but // will additionally be compiled in because they do not have compatible // APIs according to semver

implies that I can trust library author to not make mistakes when evaluating how their library upgrades will proceed in my project. They cannot know that. They should not have to know that. It is the job of the compiler/package manager to understand the relationship between what a package provides and what a project requires. API 1.2.0 and 1.3.0 might as well be completely alien packages from the perspective of frobnicate, the functions just happen to share a name. However, if I only relied on CONST, all upgrades would have been safe.

What I am proposing is that package upgrading should be a deterministic proccess. I should be able to ask the compiler: "Will this upgrade succeed without modification to my codebase". I should also be able to ask the compiler: "What was incompatible" to be able to understand the impact of a breaking upgrade before biting the bullet. The compiler needs to look at more than the pointer (id + metadata), it must also look at the value of the package as determined by its API. This is not the check that I want:

Author determined API 1.2.0 superceeds API 1.1.0:
    all OK using API 1.2.0

This is:

{CONST,frobnicate: fn(usize) -> !usize + throws IO} != {CONST,frobnicate: fn(usize) -> !usize + throws IO + BlackMagicError}
    API 1.2.0 returns a new unhandled error type BlackMagicError which results in (trace of problem), do you wish to proceed? (y/N)

TL;DR:

tiehuis commented 6 years ago

@419928194516 See also #404.

thejoshwolfe commented 6 years ago

If you're proposing only checking the subset of compatibility that is knowable at comptime, then that sounds like #404. If you're proposing a general distrust of software changes, you can control all the dependencies that go into your application, and only upgrade packages when you choose.

419928194516 commented 6 years ago

@thejoshwolfe I'm basically proposing what Andrew mentioned on #404 an hour after you comment here.

for example when deciding to upgrade you could query the system to see what API changes - that you actually depend on - happened between your current version and the latest.

Major version bump enforcement just means that the API broke for somebody maybe. And it prevents a certain class of error, but crudely. What's relevant to the consumer of the library is what changed for them. And that is inexpressible as a version number, but could be part of a package management system.

Edit: I do mean that subset, and yes I do generally distrust software and people, however well intentioned. If rules are not enforced, they will be broken, and their brokenness will become an unassailable part of the system, barring serious effort. Rust seems to be doing an ok job at undoing mistakes without major breakage, but most other projects and languages don't. See also: the linux kernel's vow to never break userspace, with the attending consequences (positive and negative). Edit2: weird double paste of Edit1? removed.

renatoathaydes commented 6 years ago

I think I've seen this discussion before (just joking, but it's a little bit similar) here :D

Given Zig goals of allowing programmers to write reliable software, I agree with @419928194516 's thoughts... I wrote a little bit about the version problem myself, though my own thoughts were and still are rather unpolished, to be honest... anyway, it seems a lot of good ideas coming from many different people and communities are converging... specially, the idea that a version number is really not a good way to handle software evolution (though it still makes sense from a pure "marketing" perspective). I would +1 a proposal to automatically handle version updates and have the compiler (or a compiler plugin?) check that automatically (like japicmp does for Java APIs). This, together with the hash checks, makes Zig capable to offer something quite unique: perfect software evolution ;)

binary132 commented 6 years ago

In case this has not been mentioned yet, I strongly recommend reading this blog series on a better dependency management algorithm for Go.

isaachier commented 6 years ago

Related to @binary132's earlier post, one of the Go package manager developers posted on Medium about his advice for implementing a package manager: https://medium.com/@sdboyer/so-you-want-to-write-a-package-manager-4ae9c17d9527. Old article, but still has some interesting insights.

ghost commented 6 years ago

so sdboyer is actually as far as I followed the discussion the developer of dep (which is not the official go package manager) and if you look at some really long thread he disagrees with the now accepted vgo and minimum version selection from russ which now is becoming the official go version manager in go 1.11.

anyway its probably worth seeing both sdboyer and russ arguments https://sdboyer.io/blog/vgo-and-dep/ although I found sdboyers hard to follow at times.

xtian commented 6 years ago

Is there an idea of how package discovery would work with this decentralized model? One of the benefits of a centralized system is having a single source for searching packages, accessing docs, etc.

andrewrk commented 6 years ago

Is there an idea of how package discovery would work with this decentralized model? One of the benefits of a centralized system is having a single source for searching packages, accessing docs, etc.

I agree that this is the biggest downside of a decentralized system. It becomes a community effort to solve the problem of package discovery. But maybe it's not so bad. If there becomes a popular package discovery solution made by the community, people will use it, but won't be locked in to it.

I can imagine, for example, one such implementation where each package is audited by hand for security issues and quality assurance. So you know you're getting a certain level of quality if you search this third party index. At the same time, maybe there's another third party package repository that accepts anything, and so it's a good place to look for more obscure implementations of things.

And you could include dependencies from both at the same time in your zig project, no problem.

Anyway, I at least think it's worth exploring a decentralized model.

binary132 commented 6 years ago

I don't think a centralized model is a good idea. Imagine if C had implemented a centralized model in the 1970's or 1980's.

rishavs commented 6 years ago

One suggestion, the compiler itself should be a package in the repository so that updating the language is as simple as zig update zig. Haxe does this and I love their implementation.

jedahan commented 5 years ago

New to the project, so forgive me if I am missing a ton of context.

tl;dr Instead of add{Git,Http[s]}Package, how about resolve(URI) since URI is reasonably flexible, and make it easy for people to register resolvers for URIs?


I have seen lots of differing requirements about source code management in general, and most of them completely valid but conflicting with one another.

Are hashes necessary to be part of the spec? What about exposing some convenience functions around uri resolvers, and have resolvers (or the caching layer) decide how to handle integrity?

For example with ipfs, hashes are part of the address, but for https, maybe someone will add subresource integrity like so: https+sri://example-repository.com/package/v1.4/package-debug.zig#dea956b9f5f44e38342ee1dff85fb5fc8c7a604a7143521f3130a6337ed90708.

Or maybe someone will add package.lock-like functionality to their own https resolver.

Or maybe someone will register a stackoverflow:// resolver that just searched for the first .zig with the keywords provided in stackoverflow://how+to+quick+sort.

By providing good hooks into resolving URIs, and managing the artifacts, people can implement exactly the semantics they want for different projects, which they would otherwise be hacking around whatever decisions zig makes now.

There are downsides of course, but even if zig does coalesce around a single way of managing external code, building an api around uri resolution and file/code management will help with building that package manager.

andrewrk commented 5 years ago

Another idea I had, as an alternative to random ids for decentralized package management:

Packages could be signed, and contain the public key of who signed them. Then instead of the package having a random id, it has a name that the maintainer chooses. The public key acts as a namespace, so the name only has to be unique to their own public key.

Then there is no concept of "hijacking". Only the package author would be able to publish new versions to supersede old ones.

Third party sites can do code audits and keep track of the "trust" and "quality" of public keys. Zig could support a configuration text file which is a blacklist of known-to-be-compromised public keys, and a third party service could keep this text file updated. Auditing tools could display the dependency tree of a Zig project and break it down by author (pubkey) potentially augmented with data from third party sites. This data could be things like: which pub keys are verified by the third party service, which pub keys are known to have high quality code, which pub keys are known to have a large number of other packages depending on them.

Third party sites could support "log in" by users signing their login token with their private key and then they know you own that pub key, and you could edit metadata, verify your public key, etc.

When upgrading dependencies, you would notice when the public key changes, because that means the package changed authors. Better double check that package before upgrading. When a package changes authors, it would be considered by Zig to be an entirely different package. This would be equivalent to a major version bump - each package that depends on it would have to explicitly publish a new version depending on the new package with the different public key, or otherwise would continue to get the old version, even if another package in the dependency tree explicitly depended on the new version.

This also enables packages to have some metadata, which is verified to be believed by the author, because of the signature. Such metadata might be:

The package signature would take the place of the sha256 hash, and would be stored in the declarative lock data, whatever that ends up being. Once this is in place, package downloads could take place over insecure connections such as HTTP or FTP, because the signature would verify that the contents were not tampered with. Even if an author published a new package without bumping a version number, projects that depend on it would detect the problem when Zig notices multiple packages with the same name/version and different signatures.

jedahan commented 5 years ago

I love that this idea requires almost zero support on Zigs side. Even the shipped blocklist of keys can just be part of whatever audit tool a third party creates.

However, the package signature taking place of the sha256 hash requires people to buy-in to self-identification through public keys. It would be great if I can just send someone a uri to something and them be able to import without depending on key-signing.

Any audit or linting tool can warn or forbid unsigned code as necessary, and a templating tool for new projects could encourage signature checking.

andrewrk commented 5 years ago

I think that's a good point.

Some use cases I want to support are:

It should be possible for a package to be bundled in all of these ways simultaneously, and have the same signature in all of them. So you could resolve, for example, a website giving a 404 for a project's dependency, by replacing the http URL with a git tag, and leaving the signature the same, and then everything is fixed. You could also provide multiple ways a package could be fetched, e.g. mirrors.

I believe I have realistic expectations of what people are willing to do in practice. For example, with this signing idea, it would only work if there was a single, clear, discoverable command, e.g. zig publish, that "just worked" including generating key pairs and putting the pubkey and signature in the correct places. Of course it would be configurable for those who read the docs and wanted more control over how it worked, but the defaults have to make it the laziest, easiest way to share software with each other.

Note that it will always be possible to give someone an URL and they can use it as a package with no formality - they could simply download the file(s) and then add a package dependency on the main file, either with --pkg-begin ... --pkg-end command line args, or with the Zig Build System. We already have support for this today. You can always bypass the package manager. Reasons to use the package manager are:

The only language support for the package manager is a compiler builtin function that resolves the situation when one file should be superseded by another (see description above).

The other possibility that this facilitates is the idea of "trusted public keys" for which you would be willing to use precompiled binaries from. Then a third party service could pre-build packages for various combinations of targets and zig versions, and when using the package manager, if a match is found, you can save time in debug builds by using the pre-compiled artifacts. We're a long way away from something like this being practical, but it's important to consider what will and won't be possible.

Meai1 commented 5 years ago

I want to deposit my long time issues with package managers that I feel are hugely important but always forgotten:

  1. I need to be able to easily switch from a binary-only dependency to debugging and changing that dependency's source code. It it such a pain in the ass to go from using a .jar to debugging that .jar's source code, modifying it, recompiling it. Same story with nuget. And when I say easily, I mean truly easily. I dont want to extract a .zip somewhere or copy it into my project folder. I don't want to have to do anything except press F5 in my IDE to debug and then I can step into the library code. (F5 of course automatically invoking the zig build, which in turn should be running the package manager) I don't even want to modify my build configuration. It should probably be the default to use source code level debugging for all dependencies anyway. As far as I know golang always uses source level dependencies. I never ever want to do any extra work in gdb, for example telling gdb which directories contain new sources it should look for. That's the job of the package manager to make all this automatic, I don't want to care about that.

  2. I do not like symlinking, it never works properly with networked drives, shared folders and numerous other issues like copy paste. Please don't consider doing that in the package manager, I rather have multiple libraries duplicated on my drive than to have them all symlinked somewhere.

  3. I do not believe that a package manager will ever have security and comfort without a social aspect to it, probably in the form of star based ratings. Andrew mentioned that already, there has to be some kind of trust established socially via ratings. Objectively if I didnt write the code and I didnt review it line by line with 100% confidence then I'm installing somebody's code blind and that requires trust. It's how all marketplaces (e.g amazon, ebay, app store, play store) work, through ratings and you hope that people would make a big fuss if a developer/seller is releasing broken, dangerous products.

jedahan commented 5 years ago

For some interesting work being done on social trust, signing, code review, etc, check out https://github.com/dpc/crev

BenoitJGirard commented 5 years ago

Seems to me this CppCon talk, from video game industry veteran Scott Wardle, is relevant to this discussion: https://www.youtube.com/watch?v=NlyDUQS8OcQ .

andrewrk commented 5 years ago

@BenoitJGirard in https://github.com/ziglang/zig/issues/855#issuecomment-464392748 you mentioned "Policing the dependency graph for large projects is one of the headaches we have at my day job". I'm very interested in hearing about use cases like this. Can you share more about your pain points?

That goes for everybody in this thread. I want to hear about everybody's real world use cases of other package managers, and in what ways the experience is positive or negative.

Everybody has an opinion on how package managers should work, and not everybody can get their way, but it is certainly my goal that everyone can have their use cases solved reasonably well.

BenoitJGirard commented 5 years ago

Sure, here goes.

At my day job, we have a large code base (millions of lines of code), C++ and C#, all in one repository. The code is divided into packages ("projects", in Microsoft Visual Studio parlance) and the direct dependencies of a package are fairly clear.

It's the indirect (transitive) dependencies that are a pain.

Among our regular problems is someone changing code in package P without realizing that package is used, indirectly, in obscure and infrequently tested application A and breaking a key behavior in that app.

Another is a programmer adding a new dependency and indirectly breaking a poorly-written installer, which does not realize that now an extra library is needed for its application to work,

Compounding this problem, there is no friction to adding dependencies between packages. After all, code is in packages to be reused! So dependency graphs grow and do not shrink.

Not to go to far in imagining solutions, but if we kept track of the full, transitive list of dependencies of each application, we could have the build system emit a warning (or error!) when a new direct or indirect dependency is added to an application, or emit a warning (or error!) if, say, the commit message does not mention all applications affected has having been tested.

BenoitJGirard commented 5 years ago

Two other pain points from the day job, about NuGet, the C# package manager from Microsoft; these are keeping us from fully embracing NuGet.

  1. It's possible to (mis)configure NuGet so it always downloads the "latest and greatest" referenced packages, and their dependencies, on each build. This leads to the horror of non-deterministic builds, where there is no guarantee that syncing the code to a given change will give you the same executables every time.
  2. If a blocking bug is found in a NuGet package, how do we fix it quickly? The repository for that package has to be found, branched or forked, cloned locally, the code hacked to point to this local copy while we debug, the code changed in the branch/fork, then somehow we must create a new NuGet package from that branch/fork and use that. We find it less troublesome to just have the code of every package directly in our main repository.