Cannot run pipeline on Kubernetes: CreateContainerError

woodpecker-ci / woodpecker

Woodpecker is a simple, yet powerful CI/CD engine with great extensibility.

https://woodpecker-ci.org

Apache License 2.0

4.29k stars 369 forks source link

Cannot run pipeline on Kubernetes: CreateContainerError #2510

Open dominic-p opened 1 year ago

dominic-p commented 1 year ago

Component

agent

Describe the bug

I'm trying to get started with the Kubernetes backend (installed via the official helm chart). When I try to run a pipeline, the container gets stuck with CreateContainerError.

Error: container create failed: chdir: No such file or directory

After some debugging, it seems like the issue is related to the spec.workingDir being set to a directory that doesn't exist at the time that the container is starting. I'm not exactly an expert here, but maybe we could leave the working dir unset and then change directories after the repo is cloned?

I can provide more debugging information, if that would be help.

System Info

{"source":"https://github.com/woodpecker-ci/woodpecker","version":"next-24204ecdeb"}

This is a vanilla Kubernetes 1.28.2 cluster installed via kubeadm. The nodes are Debian 12. The container runtime is CRI-O which was installed following the install guide here.

Additional context

I'm running CRI-O (I'm not sure if that's relevant), and I discussed this issue a bit on the CRI-O repo here.

I'm using Forgejo as my forge. If I comment out the spec.workingDir in the Pod yaml and attempt to run the Pod manually I get an error like:

+ git fetch --no-tags --depth=1 --filter=tree:0 origin +master:
error: RPC failed; HTTP 403 curl 22 The requested URL returned error: 403
fatal: expected flush after ref listing
exit status 128

I'm not sure if that's another issue or somehow related, but I thought it was worth mentioning.

Validations

[X] Read the Contributing Guidelines.
[X] Read the docs.
[X] Check that there isn't already an issue that reports the same bug to avoid creating a duplicate.
[X] Checked that the bug isn't fixed in the next version already [https://woodpecker-ci.org/faq#which-version-of-woodpecker-should-i-use]
[X] Check that this is a concrete bug. For Q&A join our Discord Chat Server or the Matrix room.

zc-devs commented 1 year ago

Is there /woodpecker folder? What's the permissions (ls -Z for SELinux)? The same questions for /woodpecker/src/git-repo.

if I override the entrypoint to just get a shell, it drops me in the root dir

Can you create some folder under /woodpecker manually? Under /woodpecker/src/git-repo?

Test pipeline below, please:

skip_clone: true
steps:
  test:
    image: alpine
    commands:
      - echo Hello from test

Does it work?

Applied AppArmor profile crio-default to container

What if disable AppArmor?

dominic-p commented 1 year ago

Thanks for looking into this! Ok, I tried using the given test pipeline, and I get the same error. My Pod events look like this (note that I had to switch to the AWS registry as I got rate limited by docker hub):

Events:
  Type     Reason   Age                From     Message
  ----     ------   ----               ----     -------
  Normal   Pulling  13s (x2 over 13s)  kubelet  Pulling image "public.ecr.aws/docker/library/alpine"
  Normal   Pulled   13s                kubelet  Successfully pulled image "public.ecr.aws/docker/library/alpine" in 205ms (205ms including waiting)
  Normal   Pulled   13s                kubelet  Successfully pulled image "public.ecr.aws/docker/library/alpine" in 171ms (171ms including waiting)
  Warning  Failed   12s (x2 over 13s)  kubelet  Error: container create failed: chdir: No such file or directory

Since the container is failing at such an early stage (the container never gets created), it's really difficult for me to get a shell to check what the file system looks like. The approach I wound up taking was this:

Start the pipeline
Dump the YAML for generated Pod and PVC via kubectl get -o yaml
Delete the failing Pod
Modify the dumped YAML to remove extraneous status and annotations that will cause it fail
Modify the Pod yaml to set spec.workingDir to /
Apply both the Pod and PVC yaml (the container state goes to "Completed" quickly)
Create a debug copy of the completed pod: kubectl debug pod-name -it --copy-to=my-debug --container=container-name -n namespace -- /bin/sh

I wanted to show the process to see if you think this is a valid approach to debugging and also as a reference for me since it took me a while to figure it out. :)

Now that I have a shell on the debug pod I can answer a couple of your questions:

There is a /woodpecker folder. It is empty.
Yes, I can manually create files and folders inside /woodpecker without issues.
Here's the output of ls -al. If I run ls -Z I get "ls: unrecognized option: Z"

# whoami
root

# ls -al /woodpecker
total 8
drwxrwsrwx    2 root     root          4096 Oct  5 06:24 .

I tried disabling AppArmor with a Pod annotation, but it didn't make a difference. The container still fails to create unless I remove the workingDir config (or set the working dir to /).

By the way, if I look at the logs from the completed container it does appear to have run successfully:

+ echo Hello from test
Hello from test

Let me know if there's any other debugging I can do on my end. As far as I can tell the working dir needs to exist before the container is started and on my system it doesn't. I'm not sure what's responsible for creating it, but maybe an initContainer could be used to run something like mkdir -p /woodpecker/src/git-repo before the main container is started?

zc-devs commented 1 year ago

I wanted to show the process

Nice debug approach, thanks for sharing 👍 Could you update System Info with OS information and add some link to installation manual of CRI-O on that distro?

I tried disabling AppArmor with a Pod annotation

Although I don't think now that AppArmor is the cause of an error, but I meant completely disabling it temporarily. Because when you use local path provisioner with SELinux, there should be some policies and K3S had issues with that.

Besides, I've found similar issue in Podman. Taking into account this bug test, please, pipeline overriding workspace:

workspace:
  base: "/woodpecker"
  path: "/"

steps:
  test:
    image: alpine
    commands:
      - echo Hello from test

It should generate pod like:

  containers:
    - name: wp-01hbzthp86cx3k762kvyv0007e-0-clone
      image: docker.io/woodpeckerci/plugin-git:2.1.1
      workingDir: /woodpecker
      env:
        - name: CI_WORKSPACE
          value: /woodpecker

Screenshot 2023-10-05 145315

I'm not sure what's responsible for creating it

As I understand correctly, it depends on container runtime. I guess, CRI-O (like Podman) just throws error, but Containerd creates subfolder. BTW, workingDir is set up here by an Agent.

initContainer could be used to run something like mkdir -p /woodpecker/src/git-repo before the main container is started?

Good idea! Try this one also.

dominic-p commented 1 year ago

Yahtzee! the workspace config did the trick. I didn't bother testing an initContainer given that the initial approach worked, but I imagine that would solve the problem as well. I also updated the OP as requested with more system info.

I'm not sure if this should be considered a documentation issue (e.g. CRI-O users need to configure the workspace manually). Or, if some change can/should be made to the kubernetes backend. It seems like out-of-the-box CRI-O support would be nice, but, unfortunately, I'm not a go programmer, so I wouldn't be able to help much with a PR.

Secondary question: Now that I'm unstuck, I'm running into a new error trying to get my buildah container to run. I think it might be due to the securityContext of the container. I tried configuring it like this (building off of the docs here), but it doesn't seem to be working. If I dump the Pod YAML there is no securityContext config.

workspace:
  base: "/woodpecker"
  path: "/"

steps:

  # We'll remove this eventually, but for now it's nice just to make sure that the most basic
  # pipeline step works
  test:
    image: public.ecr.aws/docker/library/alpine
    commands:
      - echo Hello from test

  # The real work is done here. Build and push the container image
  build:
    image: quay.io/buildah/stable:v1.31
    commands:
      - /bin/sh ./build.sh
    backend_options:
      kubernetes:
        securityContext:
          runAsNonRoot: true
          runAsUser: 1000
          runAsGroup: 1000
          fsGroup: 1000
          capabilities:
            add: ["SETFCAP"]

Is it not possible to set the security context via the pipeline config? Or, am I just doing it wrong? I can also open a separate issue about this if that would be preferrable.

zc-devs commented 1 year ago

some change can/should be made to the kubernetes backend

Run InitContainer with mkdir -p $CI_WORKSPACE? :) As we found a workaround and there are only few users of crun (maybe you're the first even), I would close the issue. But I would like to mention here Podman does not create working directory from image issue.

if this should be considered a documentation issue (e.g. CRI-O users need to configure the workspace manually)

If you write installation manual, then link in the Posts & tutorials would be right place, I think. Anyway, this issue looks like documentation by itself.

Should be asked in Discussions or Matrix/Discord ;)

error trying to get my buildah container to run

🤣 You're not the first and not the last, I believe. As always, I push people to use kaniko 😄

Is it not possible to set the security context via the pipeline config?

No (almost), only Resources, serviceAccountName, nodeSelector, tolerations and Volumes by now. But you can run in Privileged mode.

dominic-p commented 1 year ago

Ok, I experimented a bit more buildah and I have a working test implementation running on my cluster. I would just need a couple of additional configuration options to make it work with Woodpecker. I opened #2545 to discuss that.

I have looked at kaniko a few times, but I really like that buildah lets me use a regular shell script to build my container images. I don't want to learn/workaround all of the gotchas that come with the Dockerfile format.

I'll leave this open for the time being in case you do want to implement an initContainer on the clone step. It seems like a good idea to me as it can't really hurt the current user base, and it would make CRI-O work out of the box. But, as you say, this isn't really a huge userbase right now.

For the time being I opened a PR to add a bit of documentation about this to the main docs website.

zc-devs commented 1 year ago

initContainer on the clone step

The problem is the clone step doesn't always exist. If run init container only for clone step and set skip_clone: true, then you'll get the same issue in the first step. Should we run init container for all steps?

Or should we consider skip_clone option?

InitContainer approach

skip_clone: false 1.1 For init container in clone step set workingDir to /woodpecker and create subdirectory src/git-repo. 1.2 For clone container in clone step set workingDir to /woodpecker/src/git-repo. 1.3 Run next steps with workingDir=/woodpecker/src/git-repo
skip_clone: true 2.1. Run all steps with workingDir=/woodpecker

Clone plugin approach

skip_clone: false 1.1 For clone step set workingDir to /woodpecker and create subdirectory src/git-repo by plugin, then clone repo. 1.3 Run next steps with workingDir=/woodpecker/src/git-repo
skip_clone: true 2.1. Run all steps with workingDir=/woodpecker

Then WorkingDir: step.WorkingDir is one string: in order to implement the first point, we have to supply workspace instead workingdir in pod.go. Probably, not only in Kubernetes backend, but others too.

2.1. Run all steps with workingDir=/woodpecker

What if I have custom workspace?

workspace:
  base: "/woodpecker"
  path: "/subdir"

Why does my container run in /woodpecker, not in /woodpecker/subdir? - would user's question (and issue here :)

Another concern is the issue with not creation of working dir is not unique for Kubernetes, but also exists in Podman at least. And if we implement Podman backend (PR or FR exists here), should we duplicate that tricky logic?

The best solution would be some option on ~~runc~~ crun side (Podman, CRI-O), I believe.

dominic-p commented 1 year ago

That does get a bit more complicated than I originally thought. That said, at the end of the day the problem is that we need to make sure the workingDir exists before the Pod is started. If we can't rely on the clone step always running, could we insert some kind of "init" step that does always run? Its job would be to just run mkdir -p $CI_WORKSPACE. That way we side step all of the complexity of whether or not skip_clone is set or if the user configured a custom workspace, and we don't tax performance on every step. A similar approach could probably be taken with the Podman backend as well.

I think the problem with implementing a fix at the crun level is that Podman and CRI-O can also use runC (I think they do by default on some distros). So, if CRI-O or Podman still choke if the working dir isn't set, adding an option to crun will only solve the problem for a portion of the userbase.

zc-devs commented 1 year ago

As I understand, problem is crun developers intentionally made this. Then did exception for Dockerfile.

What runtime do you use? I use runc runtime via containerd. And the vast majority of users on that side, I expect, because I don't see flurry of issues.

adding an option to crun will only solve the problem for a portion of the userbase

Yep, portion that is broken. Another part (runc) works just fine.

could we insert some kind of "init" step that does always run?

Nothing (almost) is impossible. Let's hear from other developers.

dominic-p commented 1 year ago

Good point. I am using crun, so that could be the issue. I'm not sure if it's CRI-O or crun (or both) that doesn't like the non-existent working dir. If CRI-O doesn't care, and crun wants to change behavior to match the rest of the ecosystem, bob's your uncle.

It would be a pretty big project for me to configure a cluster to run CRI-O with runC just to test this, but it might be worth it.

dominic-p commented 1 year ago

Interesting update from the CRI-O devs here. It appears that this behavior is not by design and they would be open to a PR to fix it. Again, go isn't my thing, so this might be out of reach for me. But, at the very least we know a fix would be welcome.