Closed anbraten closed 3 months ago
Deployment of preview was successful: https://woodpecker-ci-woodpecker-pr-3722.surge.sh
@zc-devs @fernandrone @eliasscosta @praveen-livspace @davidhazell-tst (Hope all of you are actually interested in this topic, sorry if not)
I did some refactoring and hopefully achieved some improvements regarding the log streaming. Using the below workflow for testing I haven't received any more errors now. Could you verify that the :pull_3722
image solves your issues as well.
skip_clone: true
steps:
- name: hello
image: alpine
commands:
- echo "hello world"
- date +%s%N
- apk add openssl
- openssl rand 10000000 | base64 -w 0
- date +%s%N
- echo done
CC @6543 Thanks for your initial work in #2072
It does not always work for me.
step hasn't started yet
Screenshot 2024-05-24 123628
F5
Screenshot 2024-05-24 123702openssl rand 1000
)
Screenshot 2024-05-24 125549
Cancel
Screenshot 2024-05-24 125704
agent.log1, 2 probably are not related here, cause it is UI-Server interaction. 3 is mentioned in the PR, but I don't know if you actually want to fix it in this PR though.
limit max size per log line to 3mb
Seems, it works. Previously it was over 400 lines. Currently it is near 20:
woodpecker=# SELECT t.line, left(convert_from(t.data, 'UTF-8'), 30), octet_length(t.data::text) FROM log_entries AS t WHERE step_id = 535;
line | left | octet_length
------+--------------------------------+--------------
0 | + echo "hello world" | 42
1 | hello world | 24
2 | + date +%s%N | 26
3 | 1716544320 | 22
4 | + apk add openssl | 36
5 | fetch https://dl-cdn.alpinelin | 156
6 | fetch https://dl-cdn.alpinelin | 166
7 | (1/1) Installing openssl (3.3. | 72
8 | Executing busybox-1.36.1-r28.t | 74
9 | OK: 9 MiB in 15 packages | 50
10 | + openssl rand 10000000 | base | 76
11 | nR/CT1Cy5ikpYMdOX9fqHrR6JNxX9V | 6291458
12 | 6L0fZCCf5fufzoxtxGarbwstEyL1UH | 6291458
13 | zHy74s1DcMm6SCTUuMC5D3/qC/38e1 | 6291458
14 | SFgfKlnu3/WJxs+XCAZcZLpfLk24bh | 6291458
15 | BYJdQCUwsmpImaAbz4Mwdzk4OoNWYk | 1500874
16 | 1716544324 | 22
17 | + echo done | 24
18 | done | 10
(19 rows)
Thanks a lot for the testing and detailed reporting.
Often, the end of log doesn't appear until page refreshing
Strange. I haven't noticed such mismatches in the UI (like showing logs and the step hasn't started yet
at the same time) yet. It seems that it is somehow missing data send via the SSE stream. Could you check if the browser console shows any errors?
The bug that previous logs are not in the list when reloading with f5 is still a thing I have to fix (maybe in another PR).
Non-UTF stuff doesn't work (openssl rand 1000)
Will check that one again.
Changing grpc log messages to use bytes
instead of string
seems to have fixed openssl rand 1000
now as well.
Yeeaa, it works now! Thank you!
Could you check if the browser console shows any errors?
2 is partially my fault.
I open the test repo in WP and then manually run the pipeline. To gather clean logs, I often restart the Server and Agent pods before run the pipeline. Then download the logs from OpenLens.
So, I ran the pipeline a couple of times. All went well, there were no errors in console. Then I restarted the Server and got them. Then I manually ran the pipeline and got step hasn't started yet
.
Screenshot 2024-05-24 213033
Screenshot 2024-05-24 213135
The event
call was 404'ed and didn't try to reconnect. After refreshing the page all goes OK again.
However, look at four failed api/stream/logs/...
requests. This is from ran the pipeline a couple of times
mentioned above. Seems, streams do not close after leaving pipeline execution page (to repo page).
Edit: and even to repo list page - Screenshot 2024-05-24 221819.
1 reproduces sometimes with 10M log.
I run pipeline. Some piece of logs displayed, fill the display at least. Then step list and logs components disappear/appear
again. Then I scroll down to the end - log is trimmed. There are no failed HTTP requests, but an error in the console unexpected non-whitespace character after JSON data
Screenshot 2024-05-24 221205
The event call was 404'ed and didn't try to reconnect. After refreshing the page all goes OK again.
I think that is basically how SSE works. I will probably have a look at the migration to a websocket at some point.
Seems, streams do not close after leaving pipeline execution page (to repo page).
That should be fixed now.
1 reproduces sometimes with 10M log. I run pipeline. Some piece of logs displayed, fill the display at least. Then step list and logs components disappear/appear again. Then I scroll down to the end - log is trimmed. There are no failed HTTP requests, but an error in the console unexpected non-whitespace character after JSON data
Might have fixed that one now.
@anbraten I tried upgrading my agents to :pull_3722 and am seeing a GRPC version mismatch error:
{"level":"info","time":"2024-05-24T18:32:23Z","caller":"/src/shared/logger/logger.go:101","message":"log level: debug"}
{"level":"error","error":"GRPC version mismatch","time":"2024-05-24T18:32:23Z","caller":"/src/cmd/agent/core/agent.go:143","message":"server version 2.4.1 does report grpc version 7 but we only understand 8"}
{"level":"fatal","error":"GRPC version mismatch","time":"2024-05-24T18:32:23Z","caller":"/src/cmd/agent/core/run.go:51","message":"error running agent"}
Do I need to also update my server version to :pull_3722
?
I am currently running server version docker.io/woodpeckerci/woodpecker-server:v2.4.1
@davidhazell-tst Yes I later on updated some grpc related code and you therefore need to update the server as well. (It only has a tiny change, so there wont be any data migration and you can simply switch back if wanted)
docker.io/woodpeckerci/woodpecker-agent@sha256:328e3662285ce545e4081ef30299c039af72cd09a269bc642c24a51fc479acb2
streams do not close after leaving pipeline execution page (to repo page)
still reproduces Screenshot 2024-05-28 183709
Might have fixed that one now
As well as 1 Screenshot 2024-05-28 184941 two times out of five it fails Screenshot 2024-05-28 185715
Besides, I've got This step hasn't started yet
at sixth run without restarting the server
Screenshot 2024-05-28 190140
@anbraten We applied the new builds to both the server and agent and our issue appears to be resolved.
I still had no success replicating the mentioned issues with streams being not stopped, re-connected, etc. Might that even be related to some reverse-proxy setup 🤔 However as the general feedback seems that the changes are already quite an improvement I would move further improvements to new PRs to merge this one.
Attention: Patch coverage is 22.38267%
with 215 lines
in your changes missing coverage. Please review.
Project coverage is 25.84%. Comparing base (
044c3a6
) to head (041c2c3
). Report is 5 commits behind head on main.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Some testing similar to #2072
closes #2064 closes #3468 closes #2253 closes #3716 related #2072
Enhancements
writer.Write
is called once per line)Some inspiration from