woodpecker-ci / woodpecker

Woodpecker is a simple, yet powerful CI/CD engine with great extensibility.
https://woodpecker-ci.org
Apache License 2.0
4.16k stars 360 forks source link

Improve step logging #3722

Closed anbraten closed 3 months ago

anbraten commented 4 months ago

Some testing similar to #2072

closes #2064 closes #3468 closes #2253 closes #3716 related #2072

Enhancements

Some inspiration from

woodpecker-bot commented 4 months ago

Deployment of preview was successful: https://woodpecker-ci-woodpecker-pr-3722.surge.sh

anbraten commented 4 months ago

@zc-devs @fernandrone @eliasscosta @praveen-livspace @davidhazell-tst (Hope all of you are actually interested in this topic, sorry if not)

I did some refactoring and hopefully achieved some improvements regarding the log streaming. Using the below workflow for testing I haven't received any more errors now. Could you verify that the :pull_3722 image solves your issues as well.

skip_clone: true

steps:
    - name: hello
      image: alpine
      commands: 
        - echo "hello world"
        - date +%s%N
        - apk add openssl
        - openssl rand 10000000 | base64 -w 0
        - date +%s%N
        - echo done

CC @6543 Thanks for your initial work in #2072

zc-devs commented 4 months ago

It does not always work for me.

  1. Often, the end of log doesn't appear until page refreshing Screenshot 2024-05-24 121548 F5 Screenshot 2024-05-24 121632 Even though page refreshes automatically a couple of times (step list and logs components disappear/appear).
  2. Sometimes step hasn't started yet Screenshot 2024-05-24 123628 F5 Screenshot 2024-05-24 123702
  3. Non-UTF stuff doesn't work (openssl rand 1000) Screenshot 2024-05-24 125549 Cancel Screenshot 2024-05-24 125704 agent.log

1, 2 probably are not related here, cause it is UI-Server interaction. 3 is mentioned in the PR, but I don't know if you actually want to fix it in this PR though.


limit max size per log line to 3mb

Seems, it works. Previously it was over 400 lines. Currently it is near 20:

woodpecker=# SELECT t.line, left(convert_from(t.data, 'UTF-8'), 30), octet_length(t.data::text) FROM log_entries AS t WHERE step_id = 535;

 line |              left              | octet_length
------+--------------------------------+--------------
    0 | + echo "hello world"           |           42
    1 | hello world                    |           24
    2 | + date +%s%N                   |           26
    3 | 1716544320                     |           22
    4 | + apk add openssl              |           36
    5 | fetch https://dl-cdn.alpinelin |          156
    6 | fetch https://dl-cdn.alpinelin |          166
    7 | (1/1) Installing openssl (3.3. |           72
    8 | Executing busybox-1.36.1-r28.t |           74
    9 | OK: 9 MiB in 15 packages       |           50
   10 | + openssl rand 10000000 | base |           76
   11 | nR/CT1Cy5ikpYMdOX9fqHrR6JNxX9V |      6291458
   12 | 6L0fZCCf5fufzoxtxGarbwstEyL1UH |      6291458
   13 | zHy74s1DcMm6SCTUuMC5D3/qC/38e1 |      6291458
   14 | SFgfKlnu3/WJxs+XCAZcZLpfLk24bh |      6291458
   15 | BYJdQCUwsmpImaAbz4Mwdzk4OoNWYk |      1500874
   16 | 1716544324                     |           22
   17 | + echo done                    |           24
   18 | done                           |           10
(19 rows)
anbraten commented 4 months ago

Thanks a lot for the testing and detailed reporting.

Often, the end of log doesn't appear until page refreshing

Strange. I haven't noticed such mismatches in the UI (like showing logs and the step hasn't started yet at the same time) yet. It seems that it is somehow missing data send via the SSE stream. Could you check if the browser console shows any errors?

The bug that previous logs are not in the list when reloading with f5 is still a thing I have to fix (maybe in another PR).

Non-UTF stuff doesn't work (openssl rand 1000)

Will check that one again.

anbraten commented 4 months ago

Changing grpc log messages to use bytes instead of string seems to have fixed openssl rand 1000 now as well.

zc-devs commented 4 months ago

Yeeaa, it works now! Thank you!

Screenshot 2024-05-24 211854 woodpecker-agent.log

zc-devs commented 4 months ago

Could you check if the browser console shows any errors?

2 is partially my fault.

I open the test repo in WP and then manually run the pipeline. To gather clean logs, I often restart the Server and Agent pods before run the pipeline. Then download the logs from OpenLens.

So, I ran the pipeline a couple of times. All went well, there were no errors in console. Then I restarted the Server and got them. Then I manually ran the pipeline and got step hasn't started yet. Screenshot 2024-05-24 213033 Screenshot 2024-05-24 213135 The event call was 404'ed and didn't try to reconnect. After refreshing the page all goes OK again.

However, look at four failed api/stream/logs/... requests. This is from ran the pipeline a couple of times mentioned above. Seems, streams do not close after leaving pipeline execution page (to repo page). Edit: and even to repo list page - Screenshot 2024-05-24 221819.

1 reproduces sometimes with 10M log. I run pipeline. Some piece of logs displayed, fill the display at least. Then step list and logs components disappear/appear again. Then I scroll down to the end - log is trimmed. There are no failed HTTP requests, but an error in the console unexpected non-whitespace character after JSON data Screenshot 2024-05-24 221205

anbraten commented 4 months ago

The event call was 404'ed and didn't try to reconnect. After refreshing the page all goes OK again.

I think that is basically how SSE works. I will probably have a look at the migration to a websocket at some point.

Seems, streams do not close after leaving pipeline execution page (to repo page).

That should be fixed now.

1 reproduces sometimes with 10M log. I run pipeline. Some piece of logs displayed, fill the display at least. Then step list and logs components disappear/appear again. Then I scroll down to the end - log is trimmed. There are no failed HTTP requests, but an error in the console unexpected non-whitespace character after JSON data

Might have fixed that one now.

davidhazell-tst commented 4 months ago

@anbraten I tried upgrading my agents to :pull_3722 and am seeing a GRPC version mismatch error:

{"level":"info","time":"2024-05-24T18:32:23Z","caller":"/src/shared/logger/logger.go:101","message":"log level: debug"}
{"level":"error","error":"GRPC version mismatch","time":"2024-05-24T18:32:23Z","caller":"/src/cmd/agent/core/agent.go:143","message":"server version 2.4.1 does report grpc version 7 but we only understand 8"}
{"level":"fatal","error":"GRPC version mismatch","time":"2024-05-24T18:32:23Z","caller":"/src/cmd/agent/core/run.go:51","message":"error running agent"}

Do I need to also update my server version to :pull_3722?

I am currently running server version docker.io/woodpeckerci/woodpecker-server:v2.4.1

anbraten commented 4 months ago

@davidhazell-tst Yes I later on updated some grpc related code and you therefore need to update the server as well. (It only has a tiny change, so there wont be any data migration and you can simply switch back if wanted)

zc-devs commented 4 months ago

docker.io/woodpeckerci/woodpecker-agent@sha256:328e3662285ce545e4081ef30299c039af72cd09a269bc642c24a51fc479acb2

streams do not close after leaving pipeline execution page (to repo page)

still reproduces Screenshot 2024-05-28 183709

Might have fixed that one now

As well as 1 Screenshot 2024-05-28 184941 two times out of five it fails Screenshot 2024-05-28 185715

Besides, I've got This step hasn't started yet at sixth run without restarting the server Screenshot 2024-05-28 190140

woodpecker-server.log

davidhazell-tst commented 4 months ago

@anbraten We applied the new builds to both the server and agent and our issue appears to be resolved.

anbraten commented 4 months ago

I still had no success replicating the mentioned issues with streams being not stopped, re-connected, etc. Might that even be related to some reverse-proxy setup 🤔 However as the general feedback seems that the changes are already quite an improvement I would move further improvements to new PRs to merge this one.

codecov[bot] commented 3 months ago

Codecov Report

Attention: Patch coverage is 22.38267% with 215 lines in your changes missing coverage. Please review.

Project coverage is 25.84%. Comparing base (044c3a6) to head (041c2c3). Report is 5 commits behind head on main.

Files Patch % Lines
pipeline/rpc/mocks/peer.go 7.78% 151 Missing and 3 partials :warning:
pipeline/rpc/proto/woodpecker_grpc.pb.go 0.00% 22 Missing :warning:
pipeline/log/utils.go 67.74% 7 Missing and 3 partials :warning:
agent/logger.go 0.00% 7 Missing :warning:
cli/exec/line.go 0.00% 7 Missing :warning:
server/api/stream.go 0.00% 7 Missing :warning:
pipeline/log/line_writer.go 90.32% 2 Missing and 1 partial :warning:
cli/exec/exec.go 0.00% 2 Missing :warning:
pipeline/rpc/proto/woodpecker.pb.go 0.00% 2 Missing :warning:
server/grpc/rpc.go 0.00% 1 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #3722 +/- ## ========================================== + Coverage 25.81% 25.84% +0.03% ========================================== Files 362 365 +3 Lines 26725 26913 +188 ========================================== + Hits 6899 6956 +57 - Misses 19293 19417 +124 - Partials 533 540 +7 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.