zarf-dev / zarf

DevSecOps for Air Gap & Limited-Connection Systems. https://zarf.dev/
Apache License 2.0
1.4k stars 170 forks source link

Improve SBOM storage/format #3068

Open marshall007 opened 3 weeks ago

marshall007 commented 3 weeks ago

Describe what should be investigated or refactored

Currently the sboms.tar layer contains both JSON documents and generated HTML for an "SBOM viewer" page for each of the images in the Zarf package. The current approach has several downsides:

  1. adds non-trivial overhead to the size of every Zarf package stored in OCI
  2. downloading individual SBOMs (ex. for a particular image) is impossible
  3. the JSON documents are in a Syft-specific JSON format (not SPDX or CycloneDX) and thus not consumable by other tooling, like Trivy
  4. there is no way to tell what specific tooling/version was used to generate the SBOMs
$ oras blob fetch ghcr.io/defenseunicorns/packages/uds/gitlab@sha256:3269b4c33b0d452e6935fc7782dbf64da197b2e6225eb0ba7699831a6cabe877 --output sboms.tar
$ ls -lah sboms.tar
-rw-rw-r-- 1 marshall007 marshall007  77M Oct  3 15:07 sboms.tar

For comparison, compressed tarballs that contain only the JSON documents are <10x the size:

$ ls -lah sboms.tar
-rw-rw-r-- 1 marshall007 marshall007  77M Oct  3 15:07 sboms.tar
-rw-rw-r-- 1 marshall007 marshall007 5.4M Oct  3 15:09 sboms.tar.gz
-rw-rw-r-- 1 marshall007 marshall007 1.7M Oct  3 15:09 sboms.tar.xz

Proposed solution

  1. adopt standard SPDX JSON format
  2. store SBOM documents as OCI artifacts using in-toto attestations
  3. if we wish to keep the HTML "SBOM viewer", consider baking it into the CLI tool (i.e. start a web server that looks at local or remote SPDX JSON documents)
Racer159 commented 3 weeks ago

Currently Zarf has prioritized an agnostic format for SBOMs to capture the maximum amount of data that Syft (the tool Zarf uses under the hood) can give Zarf. The Syft JSON files can be downconverted to other formats and conversion is covered in the latter half of this docs section: https://docs.zarf.dev/ref/sboms/#extracting-a-packages-sbom

AustinAbro321 commented 3 weeks ago

For the tooling/version used are you looking to see Zarf or Syft?

As of v0.41.0, the Syft json has .descriptor.name and .descriptor.version, which evaluate to Zarf and Zarf version respectively. Additionally, under the .schema field there's the schema version of the Syft json.

marshall007 commented 2 weeks ago

Thanks guys, the Syft JSON makes sense. I'm sold.

For the tooling/version used are you looking to see Zarf or Syft?

I think I'd expect to see Syft, but maybe this is not so important afterall. I'm still looking into it but maybe all that matters is the schema version. I need to see if different syft versions produce different results (outside of schema version changes).

As of v0.41.0, the Syft json has .descriptor.name and .descriptor.version, which evaluate to Zarf and Zarf version respectively.

I see these fields, but Zarf is failing to populate .descriptor.version in the SBOMs I've looked at so far.


Another thing I discovered today is that Zarf is not preserving the original manifest digests in the generated SBOM. Here is the diff between the .source section of an SBOM in the Zarf package vs what I get from scanning with syft directly:

image

AustinAbro321 commented 1 week ago

This is good to know thanks for doing this analysis. The media types changing makes sense. I'm not sure why syft writes the mediatypes of layers in that format, even when the manifest in the registry is already using the newer vnd.oci.image.layer.v1.

Losing the architecture, manifest and manifest digests is a bit concerning though. The differences likely have to do with the fact that we're using the equivalent of syft scan oci-dir under the hood. It looks like it's missing a lot of information that should be in the image manifest in Zarf, however it does have the layers, which it must get from the manifest.

marshall007 commented 1 week ago

@AustinAbro321 something we're wrestling with on the sec-hub implementation is what to do when multiple Zarf packages include the same container images, but with slightly different SBOMs. We're noticing that syft does not really guarantee deterministic output even for the same syft-json schema version.

A good example is comparing these two SBOMs between 10.6.0-uds.0-upstream and 10.6.0-uds.1-upstream (which, by definition, include identical images):

# download artifacts
crane blob ghcr.io/defenseunicorns/packages/uds/sonarqube:10.6.0-uds.0-upstream@sha256:3c3c927030c26b05efa1c0504ed79722f747d46d4b671bd99c870ad5f7d72e42 | tar -Ox docker.io_library_sonarqube_10.6.0-community.json > sonarqube-uds.0.json
crane blob ghcr.io/defenseunicorns/packages/uds/sonarqube:10.6.0-uds.1-upstream@sha256:d0e3ff0e4e26779571a30e0633cede9e6692179e817c573da320fc09d6a0fcea | tar -Ox docker.io_library_sonarqube_10.6.0-community.json > sonarqube-uds.1.json

# structural diff
jd -color -mset -setkeys "artifacts.purl" sonarqube-uds.{0..1}.json

image

I thought that this was the result of bumping the syft dependency, but it turns out both packages were built using the same Zarf version (v0.36.1).


tl;dr: I think this is all good evidence suggesting we should store SBOMs as attestations and not bundle them with the Zarf package. We should be periodically rescanning packages so we can provide richer SBOMs (more metadata, up-to-date syft-json schema version, new/improved catalogers, etc).