Inefficient handling of verbose log output

hg commented 1 month ago

Component

server, agent

Describe the bug

Hello. I recently migrated some CI jobs from Drone and noticed their running time increasing from 4 seconds to about two minutes.

The reason seems to be rather inefficient handling of long job output. Moving it from PostgreSQL into JSON files (WOODPECKER_LOG_STORE) improved running time by ~10%, but it's still much slower than it should be.

The agent is running on an adjacent VM, there's pretty much no network latency between the server and the agent.

Disabling detailed logging reduced their running time to the same few seconds. The actual jobs run rsync --itemize-changes on large PHP projects (~15-20k files/ea, one line of output for each file) or running tests with very detailed logging, but it's easier to reproduce with a synthetic benchmark:

# woodpecker
when:
  event: push

steps:
  log:
    image: alpine:3.20
    commands:
      - seq 0 50000

# drone
kind: pipeline
type: docker
name: test

trigger:
  event: push

steps:
  - name: log
    image: alpine:3.20
    commands:
      - seq 0 50000

On my CI server, Drone runs this in less than two seconds, Woodpecker finishes in just under five minutes. Both are using the same VM and the same instance of Docker.

As Woodpecker is much nicer overall and is actually FOSS, I have no desire to go back. Any way we can fix this?

I don't write Go daily, so instead of profiling properly I decided to simply check if the bottleneck is in the storage procedure. While it helped with a synthetic benchmark:

BenchmarkLogAppend-16  269067   4411 ns/op   // before
BenchmarkLogAppend-16  2715338  442.3 ns/op  // after

the more realistic test described previously was sped up by only a few seconds.

Any thoughts? Maybe I'm doing something very wrong and missed some knob in the docs?

There was a similar problem in #2064 and it's marked as "fixed".

server:

  woodpecker:
    image: woodpeckerci/woodpecker-server
    restart: always
    environment:
      - WOODPECKER_OPEN=true
      - WOODPECKER_HOST=https://woodpecker.example
      - WOODPECKER_GITEA=true
      - WOODPECKER_GITEA_URL=https://gitea.example
      - WOODPECKER_GITEA_CLIENT=xxx
      - WOODPECKER_GITEA_SECRET=xxx
      - WOODPECKER_AGENT_SECRET=xxx
      - WOODPECKER_DATABASE_DRIVER=postgres
      - WOODPECKER_DATABASE_DATASOURCE=postgres://woodpecker:xxx@db:5432/woodpecker?sslmode=disable
      - WOODPECKER_LOG_STORE=file
      - WOODPECKER_LOG_STORE_FILE_PATH=/var/log/ci-logs
      - WOODPECKER_ADMIN=xxx
      - WOODPECKER_GRPC_SECRET=xxx
    networks:
      - woodpecker
      - db
    volumes:
      - ./data/woodpecker/log:/var/log/ci-logs
      - ./data/woodpecker/data:/var/lib/woodpecker
    ports:
      - 10.24.0.1:9000:9000

agent:

docker run --detach \
  --restart=always \
  --volume=/var/run/docker.sock:/var/run/docker.sock \
  --env=WOODPECKER_SERVER=10.24.0.1:9000 \
  --env=WOODPECKER_AGENT_SECRET=xxx \
  --name=woodpecker \
  --volume=woodpecker-agent-config:/etc/woodpecker \
  woodpeckerci/woodpecker-agent

Steps to reproduce

See repro above.

Expected behavior

Simple jobs with tens of thousands of lines of output should finish in 1-2 seconds.

System Info

- 2.7.0, the latest `next`, and built from trunk
- both agent and server are running under Docker 20.10.24
- Debian 12 amd64

Additional context

No response

Validations

[X] Read the docs.
[X] Check that there isn't already an issue that reports the same bug to avoid creating a duplicate.
[X] Checked that the bug isn't fixed in the next version already [https://woodpecker-ci.org/faq#which-version-of-woodpecker-should-i-use]

zc-devs commented 1 month ago

Confirm on next-2d039f85cb in Kubernetes.

Pipeline:

skip_clone: true
steps:
  log:
    image: alpine:3.20
    commands:
      - seq 0 50000

Pod success in few seconds and I can see end of logs in OpenLens. Then it waits for termination about 6 minutes while Agent uploads the logs to Server. At this time logs window in UI is empty. Then a pipeline finishes, a Pod deletes and the logs shows in the UI.

agent.log pod.yaml ui.jpg

Edit 1

Confirmed on 2.6.1. In both cases Postgres 16 is used as DB. While the logs are uploading, I see the load like:

> kubectl top pods -n devops
NAME                                 CPU(cores)   MEMORY(bytes)
gitea-0                              3m           142Mi
woodpecker-agent-68ccbc4d87-wjb4v    142m         17Mi
woodpecker-server-8546fd8f89-xgj47   337m         62Mi

> kubectl top pods -n database
NAME         CPU(cores)   MEMORY(bytes)
postgres-0   260m         124M

hg commented 1 month ago

I don't have much experience with Go, but I poked around the codebase and hacked this together:

send logs from the agent to the server in batches (~once per second) instead of line-by-line
cache the "step UUID → step ID" lookup instead of repeating it for every line. I think the real solution is to rewrite that code to work on many lines at a time.
update agent last active time once per batch.
avoid reopening and closing the log file for every line (the db log storage will probably need batch inserts).

Web clients are still being fed logs piecemeal. This does not seem to have much of an effect though, at least on decent hardware, so I have not changed it.

The 50k line job mentioned above now finishes in 2 seconds. If no one experienced with the project picks up this problem, maybe I'll try to have a proper go at it in a few weeks.

edit: I guess there should also be a counter for how much queued log data is batched by the agent to avoid consuming all available memory with logs from a runaway CI job.

zc-devs commented 1 month ago

cache the "step UUID → step ID" lookup instead of repeating it for every line. I think the real solution is to rewrite that code to work on many lines at a time

I think cache is a good solution, but it should be a "production"-grade cache with eviction.

fernandrone commented 1 month ago

To bring some extra context: we just updated from 2.3 to 2.7 and before we didn't have any issues with verbose logs (we had different issues, such as this: https://github.com/woodpecker-ci/woodpecker/issues/2253 which seems to be fixed).

Now we're hitting very serious performance problems associated with verbose logs. To give an example a somewhat misbehaved static analysis step on a java project which generated ~66k lines of log output was taking 3 minutes. Now it was stuck for 1 hour and never finished.

zc-devs commented 1 month ago

next-289f530b2b - ~~the same~~ 3minutes instead of 6, but behavior and load are the same.

Edit. Forgot to mention, that it is one-node installation - there's no network latency between the components.

zc-devs commented 1 month ago

Maybe some message broker would help to decouple Agent and Server in regards of the logs. For example, Kafka.

achristianson commented 5 days ago

We're running 2.7.0, installed via helm, with a basic k8s configuration. We're also seeing extremely slow logs. Notably, the k8s build pod starts and finishes, meanwhile the woodpecker ui slowly shows logs line by line. It finishes many minutes/hours later than the actual build, depending on the quantity of log statements.

woodpecker-ci / woodpecker