woodpecker-ci / woodpecker

Woodpecker is a simple, yet powerful CI/CD engine with great extensibility.
https://woodpecker-ci.org
Apache License 2.0
4.31k stars 371 forks source link

Agent stops taking jobs after server throws 5XX errors #4446

Open aaronriedel opened 5 hours ago

aaronriedel commented 5 hours ago

Component

agent

Describe the bug

When the server (running in kubernetes) restarts my docker agent refuses to take new jobs until restarted. In the agent logs I can see several 5XX Errors while the server reboots. After that the agent shows as online in the UI but does not take jobs.

Agent logs: See below

Steps to reproduce

  1. Install Woodpecker server in Kubernetes
  2. Install agent in seperate server using docker
  3. Kill the server so that it recreates
  4. Trigger pipeline that would use the docker agent
  5. See it pending

Expected behavior

The agent should properly reconnect to the Server via gRPC after the server restarts.

System Info

Server: {"source":"https://github.com/woodpecker-ci/woodpecker","version":"2.7.3"}

Helm values:

---
server:
  ingress:
    # -- Enable the ingress for the server component
    enabled: true
    # -- Add annotations to the ingress
    annotations:
      # kubernetes.io/ingress.class: nginx
      kubernetes.io/tls-acme: "true"
    hosts:
      - host: woodpecker.example.com
        paths:
          - path: /
            backend:
              serviceName: woodpecker-svc
              servicePort: 80
    tls:
      - hosts:
          - woodpecker.example.com
        secretName: woodpecker-tls-key
  statefulSet:
    replicaCount: 1
  env:
    WOODPECKER_ADMIN: 'aaron'
    WOODPECKER_HOST: 'https://woodpecker.example.com'
    WOODPECKER_OPEN: true
    WOODPECKER_FORGEJO: true
    WOODPECKER_FORGEJO_URL: 'https://git.example.com'
    WOODPECKER_LOG_LEVEL: "error"
  extraSecretNamesForEnvFrom:
    - woodpecker-forgejo

gRPC Ingress:

---
apiVersion: v1
kind: Service
metadata:
  name: woodpecker-grpc
  namespace: woodpecker
  annotations:
    traefik.ingress.kubernetes.io/service.serversscheme: h2c
spec:
  selector:
    app.kubernetes.io/instance: woodpecker
    app.kubernetes.io/name: server
  ports:
    - name: grpc
      protocol: TCP
      port: 9000
      targetPort: grpc
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  annotations:
    kubernetes.io/tls-acme: "true"
    traefik.ingress.kubernetes.io/loadbalancer.server.scheme: h2c
    traefik.ingress.kubernetes.io/service.serversscheme: h2c
  name: woodpecker-grpc
  namespace: woodpecker
spec:
  rules:
    - host: "woodpecker-grpc.apps.example.com"
      http:
        paths:
          - pathType: Prefix
            path: "/"
            backend:
              service:
                name: woodpecker-grpc
                port:
                  name: grpc
  tls:
    - hosts:
        - woodpecker-grpc.apps.example.com
      secretName: woodpecker-grpc-tls-key

docker-compose config for agent:

services:
  woodpecker-agent-1:
    container_name: woodpecker-agent-1
    image: woodpeckerci/woodpecker-agent:latest
    command: agent
    restart: unless-stopped
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
    environment:
      - WOODPECKER_SERVER=woodpecker-grpc.apps.example.com:443
      - WOODPECKER_AGENT_SECRET=${WOODPECKER_AGENT_SECRET}
      - WOODPECKER_MAX_WORKFLOWS=4
      - WOODPECKER_FILTER_LABELS="backend=docker"
      - WOODPECKER_BACKEND_DOCKER_ENABLE_IPV6=true
      - WOODPECKER_GRPC_SECURE=true
      - WOODPECKER_GRPC_VERIFY=true
    labels:
      - "com.centurylinklabs.watchtower.enable=true"

Additional context

Agent logs:

{"level":"info","time":"2024-11-23T08:44:52Z","message":"starting Woodpecker agent with version '2.7.3' and backend 'docker' using platform 'linux/amd64' running up to 4 pipelines in parallel"}
{"level":"warn","error":"rpc error: code = Unavailable desc = unexpected HTTP status code received from server: 503 (Service Unavailable); transport: received unexpected content-type \"text/plain; charset=utf-8\"","time":"2024-11-23T14:26:59Z","message":"grpc error: report_health(): code: Unavailable"}
{"level":"warn","error":"rpc error: code = Unavailable desc = unexpected HTTP status code received from server: 503 (Service Unavailable); transport: received unexpected content-type \"text/plain; charset=utf-8\"","time":"2024-11-23T14:27:00Z","message":"grpc error: report_health(): code: Unavailable"}
{"level":"warn","error":"rpc error: code = Unavailable desc = unexpected HTTP status code received from server: 503 (Service Unavailable); transport: received unexpected content-type \"text/plain; charset=utf-8\"","time":"2024-11-23T14:27:01Z","message":"grpc error: report_health(): code: Unavailable"}
{"level":"warn","error":"rpc error: code = Unavailable desc = unexpected HTTP status code received from server: 503 (Service Unavailable); transport: received unexpected content-type \"text/plain; charset=utf-8\"","time":"2024-11-23T14:27:02Z","message":"grpc error: report_health(): code: Unavailable"}
{"level":"warn","error":"rpc error: code = Unavailable desc = unexpected HTTP status code received from server: 503 (Service Unavailable); transport: received unexpected content-type \"text/plain; charset=utf-8\"","time":"2024-11-23T14:27:04Z","message":"grpc error: report_health(): code: Unavailable"}
{"level":"warn","error":"rpc error: code = Unavailable desc = unexpected HTTP status code received from server: 503 (Service Unavailable); transport: received unexpected content-type \"text/plain; charset=utf-8\"","time":"2024-11-23T14:27:06Z","message":"grpc error: report_health(): code: Unavailable"}
{"level":"warn","error":"rpc error: code = Unavailable desc = unexpected HTTP status code received from server: 503 (Service Unavailable); transport: received unexpected content-type \"text/plain; charset=utf-8\"","time":"2024-11-23T14:27:12Z","message":"grpc error: report_health(): code: Unavailable"}
{"level":"warn","error":"rpc error: code = Unavailable desc = unexpected HTTP status code received from server: 503 (Service Unavailable); transport: received unexpected content-type \"text/plain; charset=utf-8\"","time":"2024-11-23T14:27:19Z","message":"grpc error: report_health(): code: Unavailable"}
{"level":"error","error":"rpc error: code = Unknown desc = unexpected HTTP status code received from server: 500 (Internal Server Error); malformed header: missing HTTP content-type","time":"2024-11-23T14:27:21Z","message":"grpc error: next(): code: Unknown"}
{"level":"error","error":"rpc error: code = Unknown desc = unexpected HTTP status code received from server: 500 (Internal Server Error); malformed header: missing HTTP content-type","time":"2024-11-23T14:27:21Z","message":"runner done with error"}
{"level":"error","error":"rpc error: code = Unknown desc = unexpected HTTP status code received from server: 500 (Internal Server Error); malformed header: missing HTTP content-type","time":"2024-11-23T14:27:21Z","message":"grpc error: next(): code: Unknown"}
{"level":"error","error":"rpc error: code = Unknown desc = unexpected HTTP status code received from server: 500 (Internal Server Error); malformed header: missing HTTP content-type","time":"2024-11-23T14:27:21Z","message":"runner done with error"}
{"level":"error","error":"rpc error: code = Unknown desc = unexpected HTTP status code received from server: 500 (Internal Server Error); malformed header: missing HTTP content-type","time":"2024-11-23T14:27:21Z","message":"grpc error: next(): code: Unknown"}
{"level":"error","error":"rpc error: code = Unknown desc = unexpected HTTP status code received from server: 500 (Internal Server Error); malformed header: missing HTTP content-type","time":"2024-11-23T14:27:21Z","message":"runner done with error"}
{"level":"error","error":"rpc error: code = Unknown desc = unexpected HTTP status code received from server: 500 (Internal Server Error); malformed header: missing HTTP content-type","time":"2024-11-23T14:27:21Z","message":"grpc error: next(): code: Unknown"}
{"level":"error","error":"rpc error: code = Unknown desc = unexpected HTTP status code received from server: 500 (Internal Server Error); malformed header: missing HTTP content-type","time":"2024-11-23T14:27:21Z","message":"runner done with error"}
{"level":"warn","error":"rpc error: code = Unavailable desc = unexpected HTTP status code received from server: 503 (Service Unavailable); transport: received unexpected content-type \"text/plain; charset=utf-8\"","time":"2024-11-23T14:27:24Z","message":"grpc error: report_health(): code: Unavailable"}
{"level":"warn","error":"rpc error: code = Unavailable desc = unexpected HTTP status code received from server: 503 (Service Unavailable); transport: received unexpected content-type \"text/plain; charset=utf-8\"","time":"2024-11-23T14:27:34Z","message":"grpc error: report_health(): code: Unavailable"}
{"level":"warn","error":"rpc error: code = Unavailable desc = unexpected HTTP status code received from server: 503 (Service Unavailable); transport: received unexpected content-type \"text/plain; charset=utf-8\"","time":"2024-11-23T14:27:39Z","message":"grpc error: report_health(): code: Unavailable"}
{"level":"warn","error":"rpc error: code = Unavailable desc = unexpected HTTP status code received from server: 503 (Service Unavailable); transport: received unexpected content-type \"text/plain; charset=utf-8\"","time":"2024-11-23T14:27:53Z","message":"grpc error: report_health(): code: Unavailable"}
{"level":"warn","error":"rpc error: code = Unavailable desc = unexpected HTTP status code received from server: 503 (Service Unavailable); transport: received unexpected content-type \"text/plain; charset=utf-8\"","time":"2024-11-23T14:28:00Z","message":"grpc error: report_health(): code: Unavailable"}
{"level":"warn","error":"rpc error: code = Unavailable desc = unexpected HTTP status code received from server: 503 (Service Unavailable); transport: received unexpected content-type \"text/plain; charset=utf-8\"","time":"2024-11-23T14:28:15Z","message":"grpc error: report_health(): code: Unavailable"}
{"level":"warn","error":"rpc error: code = Unavailable desc = unexpected HTTP status code received from server: 503 (Service Unavailable); transport: received unexpected content-type \"text/plain; charset=utf-8\"","time":"2024-11-23T14:28:29Z","message":"grpc error: report_health(): code: Unavailable"}
{"level":"warn","error":"rpc error: code = Unavailable desc = unexpected HTTP status code received from server: 503 (Service Unavailable); transport: received unexpected content-type \"text/plain; charset=utf-8\"","time":"2024-11-23T14:28:40Z","message":"grpc error: report_health(): code: Unavailable"}
{"level":"warn","error":"rpc error: code = Unavailable desc = unexpected HTTP status code received from server: 503 (Service Unavailable); transport: received unexpected content-type \"text/plain; charset=utf-8\"","time":"2024-11-23T14:28:54Z","message":"grpc error: report_health(): code: Unavailable"}
{"level":"warn","error":"rpc error: code = Unavailable desc = unexpected HTTP status code received from server: 503 (Service Unavailable); transport: received unexpected content-type \"text/plain; charset=utf-8\"","time":"2024-11-23T14:29:02Z","message":"grpc error: report_health(): code: Unavailable"}

Validations

zc-devs commented 4 hours ago

Does it work if you deploy an agent in Kubernetes (direct Agent-Server connection, not via Traefik)?

JFYI, that is my IngressRoute, which worked a couple of months ago:

apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
  name: woodpecker-server
spec:
  entryPoints:
  - websecure
  routes:
  - kind: Rule
    match: Host(`wp.domain.tld`)
    services:
    - name: woodpecker-server
      port: http
  - kind: Rule
    match: Host(`wp.domain.tld`) && Headers(`Content-Type`, `application/grpc`)
    services:
    - name: woodpecker-server
      port: grpc
      scheme: h2c

However, I didn't restarted the server, if I remember correctly.

aaronriedel commented 3 hours ago

The kubernetes-agents work fine and are not affected by the problem. It is very likely that the 5XX errors come from Traefik mainly. However I would also expect the agent to not poop itself when there are errors for a few seconds.

Matching the application type is a good hint, I might implement this. I currently don't use IngressRoute objects and instead configure normal Ingresses with annotations.

zc-devs commented 3 hours ago

received unexpected content-type \"text/plain; charset=utf-8\"" errors come from Traefik

I think so and I had this.

The agent should properly reconnect

{"level":"warn","error":"rpc error: code = Unavailable desc = unexpected HTTP status code received from server: 503 (Service Unavailable); transport: received unexpected content-type \"text/plain; charset=utf-8\"","time":"2024-11-23T14:27:24Z","message":"grpc error: report_health(): code: Unavailable"}
{"level":"warn","error":"rpc error: code = Unavailable desc = unexpected HTTP status code received from server: 503 (Service Unavailable); transport: received unexpected content-type \"text/plain; charset=utf-8\"","time":"2024-11-23T14:27:34Z","message":"grpc error: report_health(): code: Unavailable"}
{"level":"warn","error":"rpc error: code = Unavailable desc = unexpected HTTP status code received from server: 503 (Service Unavailable); transport: received unexpected content-type \"text/plain; charset=utf-8\"","time":"2024-11-23T14:27:39Z","message":"grpc error: report_health(): code: Unavailable"}
{"level":"warn","error":"rpc error: code = Unavailable desc = unexpected HTTP status code received from server: 503 (Service Unavailable); transport: received unexpected content-type \"text/plain; charset=utf-8\"","time":"2024-11-23T14:27:53Z","message":"grpc error: report_health(): code: Unavailable"}
{"level":"warn","error":"rpc error: code = Unavailable desc = unexpected HTTP status code received from server: 503 (Service Unavailable); transport: received unexpected content-type \"text/plain; charset=utf-8\"","time":"2024-11-23T14:28:00Z","message":"grpc error: report_health(): code: Unavailable"}
{"level":"warn","error":"rpc error: code = Unavailable desc = unexpected HTTP status code received from server: 503 (Service Unavailable); transport: received unexpected content-type \"text/plain; charset=utf-8\"","time":"2024-11-23T14:28:15Z","message":"grpc error: report_health(): code: Unavailable"}
{"level":"warn","error":"rpc error: code = Unavailable desc = unexpected HTTP status code received from server: 503 (Service Unavailable); transport: received unexpected content-type \"text/plain; charset=utf-8\"","time":"2024-11-23T14:28:29Z","message":"grpc error: report_health(): code: Unavailable"}
{"level":"warn","error":"rpc error: code = Unavailable desc = unexpected HTTP status code received from server: 503 (Service Unavailable); transport: received unexpected content-type \"text/plain; charset=utf-8\"","time":"2024-11-23T14:28:40Z","message":"grpc error: report_health(): code: Unavailable"}
{"level":"warn","error":"rpc error: code = Unavailable desc = unexpected HTTP status code received from server: 503 (Service Unavailable); transport: received unexpected content-type \"text/plain; charset=utf-8\"","time":"2024-11-23T14:28:54Z","message":"grpc error: report_health(): code: Unavailable"}
{"level":"warn","error":"rpc error: code = Unavailable desc = unexpected HTTP status code received from server: 503 (Service Unavailable); transport: received unexpected content-type \"text/plain; charset=utf-8\"","time":"2024-11-23T14:29:02Z","message":"grpc error: report_health(): code: Unavailable"}

Seems, it is trying.


Do you have 2 ingresses: one for HTTP, another for gRPC? Could you show HTTP one?