ziglang / zig

General-purpose programming language and toolchain for maintaining robust, optimal, and reusable software.
https://ziglang.org
MIT License
32.16k stars 2.35k forks source link

new package hash format: `$name-$semver-$hash` #20178

Open andrewrk opened 1 month ago

andrewrk commented 1 month ago

Make package hashes generally more user-friendly, so that it is more practical to interact with package directories on the file system, as well as interact with stack traces, debuggers, and other tooling that uses source code paths.

The current hash format is a hex-encoded multihash SHA-256.

It looks like this:

andy@bark ~> ls ~/.cache/zig/p/
122004fa7e2ff0b3d472049743358f8fdf065cdf63bc0e5e3d54c6bb8d81d93e40da
1220060f743248be7cb57396b491a92e63403afb1d28fff6d1ff5fb06124b008a25e
1220138f4aba0c01e66b68ed9e1e1e74614c06e4743d88bc58af4f1c3dd0aae5fea7
122032707cdf94da394e309978146ee33c61a285300eeb916928af376ec1638a95f1
122048db601b6da2c69d0d783b0b19ff132e9a6d69b77820351d11c6e57553ac9433
12204cfebcccb9fb8a5c7b4a6ec663aea691d180f7d346d36f213b4e154a6be1f823
122050e58ca4d57f5e2cc3d6404691d3040bbe41e76e4ef93b52f2105f1157f7d429
122074e0bf09c3622780e697c11c6744e763dd63777e480baf2b583ee3ab6a02ff14
12207c40cefa38fe90e4230dfba2e5c76b37e1ee36602512cad8ff0501f892002a65
12207d353609d95cee9da7891919e6d9582e97b7aa2831bd50f33bf523a582a08547
1220884c1636f0e6dc92b6e74b97a2d25fe240a77bab9fed3af3e1581f80c3e7256f
12208b3c98b4dcc88608e65889abc853f625a06edbb835da90d902bed1ade4da0ac8
12209083b0c43d0f68a26a48a7b26ad9f93b22c9cff710c78ddfebb47b89cfb9c7a4
1220958bf550739591e62cd55fcd2009e72f9bd6c8168ceb7ad7dd8f92dda0b58a4d
122098b31c5b4412780898de969f7014f5c7d693f10acc8168bff86a811061d829da
1220b3e1fb33317c92f9ead09630f6b4be59e80d0a8780754f8aa4ee7da61cb7b47a
1220bee0fcf98bf6ad75b7bb09ff1f873ca38547a15b1e7a4532d20d94107d8d330a
1220c4a15f871f0784113c34e92e57b2862e7f678a467e5d246a6f2ebfadfca8d116
1220d9c400445c9c3ed46f71ebdbc364b7b349473231884c2f6e540817d7b68553ae
1220db11bb50364857ec6047cfcdf0938dea6af3f24d360c6b6a6103364c8e353679
1220dd6f0bbf4614f338d632473e4b0a879ec26eca445ed305dcdbc6b5cb6405e3cd
1220e783088aadba2eb7324e8dce8c6146c888a6835148dbbdc017ec2b6996a7dab8
1220e920d74980c0794a969e1fc0647c863023acbe935ed244a79ff8ec65f2875023
1220f9bd108d1e7097b27d388a7a65effd503598df61e34a2af02be00b22af567fc7

After this proposal, it would look like this instead:

andy@bark ~> ls ~/.cache/zig/p/
nasm-2.16.1-2-BWdcABvF_jM1
libsoundio-2.0.1-7-BmEKAAr47fud
zlib-1.3.1-3-IQwAAPXlgi9M
libffmpeg-7.0.1-3-ReEHBD4IapnL
StaticHttpFileServer-0.0.0-iDYAACr46GhU
mime-1.0.0-TSAAAANL2H_R
EiBQ5Yyk1X9eLMPWQEaR0wQLvkHnbk75O1LyEF8RV_fU
libvorbis-1.3.8-3-NQ8kAD5eWxrE
openssl-3.3.1-1-KLdkAMs-vt5n
EiB9NTYJ2VzunaeJGRnm2Vgul7eqKDG9UPM79SOlgqCF
libvorbis-1.3.8-2-Hw8kALYtGBJ0
mime-2.0.0-jiAAAL-BobCs
mime-2.0.1-hCAAAC1FfNe4
StaticHttpFileServer-0.0.0-ozYAAOnhf9Zq
cpython-3.11.4-cW8vBMOZSHPt
EiCz4fszMXyS-erQljD2tL5Z6A0Kh4B1T4qk7n2mHLe0
EiC-4Pz5i_atdbe7Cf8fhzyjhUehWx56RTLSDZQQfY0z
pulseaudio-16.1.1-2-kVA2ABuZh0op
EiDZxABEXJw-1G9x69vDZLezSUcyMYhML25UCBfXtoVT
StaticHttpFileServer-0.0.0-eTYAAFRBXp0H
libffmpeg-7.0.1-3-7dgHBCZFa3DD
libsoundio-2.0.1-7-7VwKAIRNMw_X
EiDpINdJgMB5SpaeH8BkfIYwI6y+k17SRKef+Oxl8odQ
nasm-2.16.1-2-J2lcAPu-2VWT

This proposal is to change the hash format to $name-$semver-$sizedhash where:

Package names gain new rules:

The version field gains new rules:

Packages gain new rules:

Packages which lack a build.zig.zon file will have a $hashiname-P-$sizedhash scheme instead:

The hash is broken up this way so that "sizedhash" can be calculated exactly the same way in both cases, and so that "name" and "hashiname" can be used interchangeably in both cases.

Related Future Work

Compatibility

Let's try to keep compatibility with the old hash format for at least 1 release cycle, so that there is 1 release cycle that supports both the old and new format at the same time.

nektro commented 1 month ago
  • - bytes not allowed

this is very common in the existing ecosystem and I'd recommend using _ or -- for the path instead

nektro commented 1 month ago

re the 16-byte name limit, https://github.com/nektro/zig-iso-3166-countrys and https://github.com/nektro/zig-iso-639-languages use package names that are both 17 in length

andrewrk commented 1 month ago

I edited the proposal with these changes:

BratishkaErik commented 1 month ago
  • - bytes not allowed

this is very common in the existing ecosystem and I'd recommend using _ or -- for the path instead

I would suggest to separate components by:

| (vertical bar or pipe)

instead, since all three components (name, SemVer if I understand spec correctly and sized-hash) already disallow them. So now - can be allowed in names. Example:

openssl-lib|3.3.1-1|KLdkAMs-vt5n

IMHO it's also easier to read by human and machine.

andrewrk commented 1 month ago

| is not allowed in Windows file names. Please see "filesystem-safe name required" in the list above.

BratishkaErik commented 1 month ago

| is not allowed in Windows file names. Please see "filesystem-safe name required" in the list above.

That's what I'm reading. If | character is disallowed inside all three components on all platforms (IIUC), then it surely can safely act as separator between components themselves? Did I miss something here?

nektro commented 1 month ago

the separator needs to be filesystem safe too because this scheme ends up as the name of a folder

alexrp commented 1 month ago

Did I miss something here?

The fact that the directory name would be $name$sep$semver$sep$hash. If $sep is defined to be |, you now have an invalid directory name on Windows.

BratishkaErik commented 1 month ago

Of course! Thanks to you all! How ignorant of me 🤦, to read and immediately forgot such basics. I apologise for unsensible message.

squeek502 commented 1 month ago

Note that - is an allowed character within semver, and a version can technically have an arbitrary number of - characters:

A pre-release version MAY be denoted by appending a hyphen and a series of dot separated identifiers immediately following the patch version. Identifiers MUST comprise only ASCII alphanumerics and hyphens [0-9A-Za-z-]. Identifiers MUST NOT be empty. Numeric identifiers MUST NOT include leading zeroes. Pre-release versions have a lower precedence than the associated normal version. A pre-release version indicates that the version is unstable and might not satisfy the intended compatibility requirements as denoted by its associated normal version. Examples: 1.0.0-alpha, 1.0.0-alpha.1, 1.0.0-0.3.7, 1.0.0-x.7.z.92, 1.0.0-x-y-z.--.

I'll throw ~ into the mix as a possible separator.

sno2 commented 1 month ago
  • - bytes not allowed

this is very common in the existing ecosystem and I'd recommend using _ or -- for the path instead

I disagree with allowing - even if a workaround is used to fix the issue that sqeek describes. Only one obvious separator should be allowed in package names.

andrewrk commented 1 month ago

Regarding the name length limit, thanks @shadeops for doing a bit of legwork:

Pypi has all their package metadata in BigQuery, so [here is the] number of bytes for their package names.

image

x axis is number of bytes y axis is number of packages where the name has that number of bytes. source is Google Cloud's Big Query:

SELECT BYTE_LENGTH(name) as bytes_per_name, COUNT(*) as name_count FROM (SELECT DISTINCT name FROM `bigquery-public-data.pypi.distribution_metadata`) GROUP BY bytes_per_name
ianprime0509 commented 1 month ago

I like this proposal. I have a few miscellaneous thoughts:

  1. Since the $sizedhash input is always 9 bytes, it will always encode to 12 bytes of base64, so there shouldn't be any ambiguity in recovering it even if the name and version contain -s. Relying on this does make the format less flexible, though.
  2. For packages without build.zig.zon, the current proposal of a 2-byte header and 31-byte truncated SHA-256 means that the package size information wouldn't be available in the hash. To more closely unify the hash formats for these cases, what if the -$sizedhash part is kept as-is, and the $name-$semver part is replaced with a base64 encoding of the remaining 27 bytes of the SHA-256 digest (36 bytes encoded)?
  3. The other idea for naming tarballs based on their root directory would definitely make them nicer to work with, but since the root directory component is stripped while unpacking, the hash could no longer be calculated from the package contents on disk (unless the current hash is allowed to be used as a "base" to recover that information when recalculating it).
ifreund commented 1 month ago

I currently use - in both the package name and version of zig-wayland, zig-wlroots, zig-xkbcommon, and zig-pixman. The semantic version for an untagged git master commit of these packages has the form 0.1.0-dev or similar. Only commits that have a git tag get a version without the -dev suffix.

I also think that using _ for the separator would be preferable to - due to the fact that semver allows - and due to the subjective higher prevalence of - in existing package names in the zig ecosystem compared to _.

castholm commented 1 month ago
  • Filesystem-safe name required.

The rules for legal tokens in paths on Windows are (unfortunately) more complicated than just those characters:

  • Do not use the following reserved names for the name of a file:

    CON, PRN, AUX, NUL, COM0, COM1, COM2, COM3, COM4, COM5, COM6, COM7, COM8, COM9, COM¹, COM², COM³, LPT0, LPT1, LPT2, LPT3, LPT4, LPT5, LPT6, LPT7, LPT8, LPT9, LPT¹, LPT², and LPT³. Also avoid these names followed immediately by an extension; for example, NUL.txt and NUL.tar.gz are both equivalent to NUL.

Paraphrasing, this means that path components starting with any of the above names immediately followed by a dot are forbidden on Windows, which means that package names like con.zig (console library?) and aux.zig (audio library?) will fail to create their corresponding directories on Windows. Whether Zig should encode these edge cases in the rules for legal package names or consider this an unfixable Windows quirk/bug I don't have any strong opinions on, but it's at least good to be aware of these limitations.

Related, I wonder if there's much of a point in allowing such a large set of characters in package names. Maybe it would be easier for both the package manager implementation and the users if the set of allowed names was restricted to just "legal unquoted identifier in Zig source code (approximately /[A-Za-z_][A-Za-z0-9_]*/) no longer than 32 characters" or something similarly restrictive.

I also want to mention ` (space) as a potential separator which won't collide with semver components, though it would also mean that paths like.cache/zig/p/StaticHttpFileServer 0.0.0 ozYAAOnhf9Zq/build.zig` would need to be quoted in shell scripts and the terminal if that is a concern.

ikskuh commented 1 month ago

Limited to 32 bytes

Another datapoint taken from my projects:

so i guess the bsp here is borderline, but it would still fit

neurocyte commented 1 month ago

Why not just use directories?

i.e. $name/$semver/$sizedhash

That would avoid adding new restrictions (assuming most package names are already valid file names) and provide a much cleaner and easier to browse ~/.cache/zig/p directory.

andrewrk commented 1 month ago

I edited the proposal with these changes:

andrewrk commented 1 month ago

I currently use - in both the package name and version of zig-wayland, zig-wlroots, zig-xkbcommon, and zig-pixman. The semantic version for an untagged git master commit of these packages has the form 0.1.0-dev or similar. Only commits that have a git tag get a version without the -dev suffix.

@ifreund I think you should remove "zig-" from those package names. It's redundant information. No Zig package should have "zig" in the name.

ifreund commented 1 month ago

@ifreund I think you should remove "zig-" from those package names. It's redundant information. No Zig package should have "zig" in the name.

I do not see zig- as redundant information for the project name of, for example, zig-wlroots. The project provides idiomatic Zig bindings for wlroots and Zig is a critical enough part of its identity to be in the project's name. The same basic naming scheme is used for all projects providing wlroots bindings for other languages and I see no reason to deviate. (go-wlroots, wlroots-ocaml, chicken-wlroots, wlroots-rs, hsroots, clwlroots, ...).

I also have no plans to change the name of my git repositories on online code forges to something other than zig-wlroots. The repository name should match the project name.

My intuition tells me that it is least confusing if the package's name matches the name of the repository and the name of the project. Perhaps I am wrong about this but I don't think the decision is as obvious as "It's redundant information."

I do use the plain wlroots name for the module exposed by the zig-wlroots package. This means there is no redundancy in consuming zig code. Users write @import("wlroots") as one would expect.

In any case, I don't see any technical benefit to disallowing - in package names. I see such a change as unnecessary and undesirable ecosystem churn.

Ascetically, I personally quite like @neurocyte's proposal of using sub directories instead, i.e. .cache/zig/p/zig-wlroots/0.17.0/$HASH/.

That proposal does have complexity tradeoffs though. I also quite like @squeek502's proposal of ~ as a separator and think that forbidding it in package name would cause significantly less churn than forbidding -. I subjectively find the format pleasing as well: zig-wlroots~0.17.0~THISISAHASH.

alexrp commented 1 month ago

@ifreund

The repository name should match the project name.

Is there a compelling reason for this? It seems to me that zig-wlroots as repository name and wlroots as project name in build.zig.zon would be reasonable. The repository name has to disambiguate itself from other wlroots-related projects, but that isn't an issue once we get down to the Zig project level.

ifreund commented 1 month ago

I edited the proposal with these changes:

  • Incorporate @castholm's suggestion to make package names required to be valid Zig identifiers...

This dosen't yet handle the presence of - in semver versions as allowed by the semver spec and used in practice by existing zig projects.

I do agree that requiring package names to be valid zig identifiers is a nice property despite the fact that I'm not excited about dealing with the churn of renaming zig-wlroots, zig-wayland, zig-xkbcommon, and zig-pixman to zig_wlroots, zig_wayland, zig_xkbcommon, and zig_pixman.

The part of this proposal that feels a bit strange to me is using enum literal syntax in the build.zig.zon but disallowing identifiers created with the (valid) .@"zig-wlroots" syntax. This is unexpected behavior IMO given knowledge of the zig language's semantics.

@alexrp I think you have confused "project name" with "package name" in my comment.

alexrp commented 1 month ago

@ifreund Just replace "project name" in my comment with "package name". To be clear, I'm suggesting that naming the repository zig-wlroots and the package just wlroots (matching the module name) would make sense to me. It's worth noting that e.g. wlroots-rs is just wlroots on crates.io, so there is at least some precedent there.

BratishkaErik commented 1 month ago

@ifreund Just replace "project name" in my comment with "package name". To be clear, I'm suggesting that naming the repository zig-wlroots and the package just wlroots (matching the module name) would make sense to me. It's worth noting that e.g. wlroots-rs is just wlroots on crates.io, so there is at least some precedent there.

TLDR: maybe we don't want to have "wlroots original project" package's and "wlroots bindings" package's names to be clashed/confusing.

I think it still makes sense to name package "zig-wlroots" and not "wlroots": AFAIK unlike cargo and other language-specific package managers, using Zig's build system and package manager by projects in C with no Zig code is one of the main priorities. Another way, hypothetically projects like wlroots has much higher chance to adopt build.zig(.zon) than build.rs etc.

If at some point in future SDL or wlroots (or other library) are brought to Zig package manager, IMHO it would be much less awkward to have "wlroots" package name for upstream project and "zig-wlroots" for bindings, rather than both of them having "wlroots" package.

ifreund commented 1 month ago

Using enum literals for zig package names would go along well with another proposal I recall from some time ago (but can't find a link for):

// Imports of zig files use a string literal argument to `@import()`. 
const foo = @import("foo.zig")
const bar = @import("foo/bar.zig");

// Imports of packages use an enum literal argument to `@import()`.
const std = @import(.std);
const wlroots = @import(.wlroots);`

This would have the advantage of removing some current ambiguity. What if there is both a file called wlroots and a package called wlroots or a file called foo.zig and a package called foo.zig?

BratishkaErik commented 1 month ago

Using enum literals for zig package names would go along well with another proposal I recall from some time ago (but can't find a link for):

This? https://github.com/ziglang/zig/issues/6279#issuecomment-688524037 https://github.com/ziglang/zig/issues/2206#issuecomment-692607482

andrewrk commented 1 month ago

I do not see zig- as redundant information for the project name of, for example, zig-wlroots. The project provides idiomatic Zig bindings for wlroots and Zig is a critical enough part of its identity to be in the project's name. The same basic naming scheme is used for all projects providing wlroots bindings for other languages and I see no reason to deviate. (go-wlroots, wlroots-ocaml, chicken-wlroots, wlroots-rs, hsroots, clwlroots, ...).

I also have no plans to change the name of my git repositories on online code forges to something other than zig-wlroots. The repository name should match the project name.

I think you are getting the project name mixed up with the Zig package name. I am not suggesting to rename your source code repository. I think zig-wlroots is the best name for the source repository.

The prefix "zig-" in the name field of build.zig.zon is, however, entirely redundant and should be omitted. This is so blindingly obvious to me that I'm finding it difficult to even express any reasoning for it.

Can you give a single example for when "zig-" in the zig package name would disambiguate anything?

mlugg commented 1 month ago

To be honest, I would also consider zig-wlroots to be a better package name than wlroots. The name makes it clear that this is a set of bindings to an existing library, rather than the library itself being implemented in Zig. This difference is key enough that I think it's worth making obvious. Naming the package wlroots, in my eyes, implies it is in some sense "authoritative", i.e. that it is the upstream wlroots implementation.

It also better handles the case of multiple sets of competing bindings existing for the same library. e.g. if I create a competing set of wlroots bindings exposing a different API, I perhaps name it something like zlroots (the name being different to avoid any potential confusion with the existing bindings); but if the existing bindings package were named wlroots, that wrongly implies it to be "more official" than mine.

I don't think any value comes from having the project name differ from the Zig package name; this seems to me like nothing but a potential avenue for confusion. (Indeed, our original proposed terminology surrounding the package manager called a "package" a "project"; while we changed this nomenclature for good reason, I think the idea it communicates is still valid, that the project and what we now call the package are the same thing.)

nektro commented 1 month ago

I do use the plain wlroots name for the module exposed by the zig-wlroots package. This means there is no redundancy in consuming zig code. Users write @import("wlroots") as one would expect.


repo name / package name / import name

these three also also independent from what a consumer chooses as the dependency name


the package name isnt used for much afaik (this proposal is the first explicit use im aware of) so it makes sense to me that there's some differences in what people align it with, and I can totally understand why someone might go either way

andrewrk commented 1 month ago

The name makes it clear that this is a set of bindings to an existing library, rather than the library itself being implemented in Zig.

If you're trying to indicate that the package has to do with bindings, then put the word "bindings" in the name.

Or, just keep it bare, to leave room for the fact that you might choose to expose both bindings, and a method of building the library from source with a future version.

In node.js land the convention is to use "node-foo" for the repo name and "foo" for the npm package name. Lots of people redundantly put "node-" also in the npm package name and it was redundant then, too, also, as well.

source: https://docs.npmjs.com/cli/v10/configuring-npm/package-json

Don't put "js" or "node" in the name. It's assumed that it's js, since you're writing a package.json file

silversquirl commented 1 month ago
  • Limited to total file bytes of 4 GiB or less
    • ...or, should the size field saturate for packages bigger than this?

I would be in favour of saturating the size field for larger packages. It's conceiveable that Zig packages may need to ship large binaries to avoid compiling things that take a long time (LLVM, Dawn, etc.), or to provide access to libraries that are closed source.

The Zig package manager can also be used to fetch non-code artifacts, such as texture or model data for a game for instance, which obviously can be quite large.

BratishkaErik commented 1 month ago

In node.js land the convention is to use "node-foo" for the repo name and "foo" for the npm package name. Lots of people redundantly put "node-" also in the npm package name and it was redundant then, too, also, as well.

source: https://docs.npmjs.com/cli/v10/configuring-npm/package-json

Don't put "js" or "node" in the name. It's assumed that it's js, since you're writing a package.json file

But you don't put entire wlroots library into npm ecosystem, like you (potentially) do in zig's ecosystem. They can safely assume "muhaha" package contains only JS bindings for "muhaha" library, again, unlike Zig PM where both "muhaha" library package and bindings package can coexist. Most of the time node ecosystem can't have this name conflict.

I agree that it's redundant to add "zig-" prefix for Zig projects (like river or libxev), since it's not ambigious on this level. Ny disagreement is only about supplementary Zig code for C projects.