youki-dev / youki

A container runtime written in Rust
https://youki-dev.github.io/youki/
Apache License 2.0
6.31k stars 346 forks source link

Kubernetes node e2e tests fail while deleting a container #730

Open harche opened 2 years ago

harche commented 2 years ago

I replaced the runc binary with youki to run kubernetes node e2e tests using youki. The delete container seems be returning invalid data.

I0223 05:59:31.852925   40220 kubelet.go:2138] "SyncLoop (housekeeping) end"
E0223 05:59:31.873594   40220 remote_runtime.go:510] "RemoveContainer from runtime service failed" err=<
    rpc error: code = Unknown desc = failed to delete container f6b72c56564b7cc16dfc7492cda08f1624035114212ab27b28668be0b052ea4b: `/usr/local/bin/runc --root /run/runc delete --force f6b72c56564b7cc16dfc7492cda08f1624035114212ab27b28668be0b052ea4b` failed: [DEBUG crates/youki/src/main.rs:92] 2022-02-23T05:59:31.857998519+00:00 started by user 0 with ArgsOs { inner: ["/usr/local/bin/runc", "--root", "/run/runc", "delete", "--force", "f6b72c56564b7cc16dfc7492cda08f1624035114212ab27b28668be0b052ea4b"] }
    [DEBUG crates/youki/src/commands/delete.rs:8] 2022-02-23T05:59:31.858176980+00:00 start deleting f6b72c56564b7cc16dfc7492cda08f1624035114212ab27b28668be0b052ea4b
    Error: could not load state for container f6b72c56564b7cc16dfc7492cda08f1624035114212ab27b28668be0b052ea4b

    Caused by:
        missing field `ociVersion` at line 1 column 14569
      (exit status 1)
 > containerID="f6b72c56564b7cc16dfc7492cda08f1624035114212ab27b28668be0b052ea4b"
E0223 05:59:31.873957   40220 kuberuntime_gc.go:146] "Failed to remove container" err=<
    rpc error: code = Unknown desc = failed to delete container f6b72c56564b7cc16dfc7492cda08f1624035114212ab27b28668be0b052ea4b: `/usr/local/bin/runc --root /run/runc delete --force f6b72c56564b7cc16dfc7492cda08f1624035114212ab27b28668be0b052ea4b` failed: [DEBUG crates/youki/src/main.rs:92] 2022-02-23T05:59:31.857998519+00:00 started by user 0 with ArgsOs { inner: ["/usr/local/bin/runc", "--root", "/run/runc", "delete", "--force", "f6b72c56564b7cc16dfc7492cda08f1624035114212ab27b28668be0b052ea4b"] }
    [DEBUG crates/youki/src/commands/delete.rs:8] 2022-02-23T05:59:31.858176980+00:00 start deleting f6b72c56564b7cc16dfc7492cda08f1624035114212ab27b28668be0b052ea4b
    Error: could not load state for container f6b72c56564b7cc16dfc7492cda08f1624035114212ab27b28668be0b052ea4b

    Caused by:
        missing field `ociVersion` at line 1 column 14569
      (exit status 1)
 > containerID="f6b72c56564b7cc16dfc7492cda08f1624035114212ab27b28668be0b052ea4b"

Let me know if you would like to see the complete journal logs.

utam0k commented 2 years ago

@harche Thanks for your report. Please tell me how to reproduce using some commands?

harche commented 2 years ago

After cloning kubernetes, and bringing up crio,

sudo make test-e2e-node RUNTIME=remote CONTAINER_RUNTIME_ENDPOINT="unix:///var/run/crio/crio.sock" FOCUS="\[NodeConformance\]|\[NodeFeature:.+\]" SKIP="\[Flaky\]|\[Slow\]|\[Serial\]" TEST_ARGS='--kubelet-flags="--cgroup-driver=systemd --cgroups-per-qos=true --cgroup-root=/ --runtime-cgroups=/system.slice/crio.service --kubelet-cgroups=/system.slice/kubelet.service" --extra-log="{\"name\": \"crio.log\", \"journalctl\": [\"-u\", \"crio\"]}"'

This works on Fedora CoreOS. But you can also run these tests with your choice of CRI implemention.

harche commented 2 years ago

We run these tests in upstream k8s CI (with runc and crio) - https://testgrid.k8s.io/sig-node-cri-o#ci-crio-cgroupv1-node-e2e-conformance

You can click on individual green box to get test report and click Raw Build-log.txt to see how the test job gets initialized.

harche commented 2 years ago

Another pointer - https://github.com/kubernetes/kubernetes/blob/master/hack/e2e-node-test.sh

harche commented 2 years ago

But eventually it boils down to this command,

/usr/local/bin/runc --root /run/runc delete --force f6b72c56564b7cc16dfc7492cda08f1624035114212ab27b28668be0b052ea4b

So you may not actually have to deal with k8s node e2e to reproduce this. @utam0k

laupse commented 1 year ago

Hello, I'm uping this since i end up in the same spot. Since i saw this #968 got merge, I replaced runc by youki cp youki /usr/sbin/runc on the worker nodes. But pod are not starting with Error: failed to create containerd task: failed to create shim: OCI runtime create failed: runc did not terminate successfully: exit status 1: unknown

And if i list the using runc -r /var/run/containerd/runc/k8s.io/ list i got

[ERROR crates/youki/src/main.rs:138] 2023-02-28T12:21:15.767166815+00:00 error in executing command: missing field `ociVersion` at line 1 column 9558
Error: missing field `ociVersion` at line 1 column 9558
utam0k commented 1 year ago

I'm fixing on https://github.com/containers/youki/pull/1884