web-platform-tests / wpt.fyi

web-platform-tests dashboard
https://wpt.fyi/
Other
188 stars 88 forks source link

GitHub Actions results not appearing on the dashboard #4020

Closed gsnedders closed 4 days ago

gsnedders commented 3 weeks ago

From https://github.com/web-platform-tests/wpt/issues/48223#issuecomment-2375384661:

So:

This seems surprising that it succeeded to upload, and yet the upload doesn't appear as pending, invalid, or show up as complete.

We seem to have logging going into Google Cloud Logging, but I don't have any access to that.

cc @past @DanielRyanSmith @jcscottiii @KyleJu

KyleJu commented 3 weeks ago

Checking the log, I can confirm that /api/checks/github-actions/ was called with run_id=11039052681 and returned 200. Also I see the logging

Adding %!s(*string=0xc0002990a0) for web-platform-tests/wpt run 11039052681 to upload...

from this line, but it didn't log properly:

log.Infof("Adding %s for %s/%s run %v to upload...", artifact.Name, owner, repo, runID)

It should be *artifact.Name instead.

If I have to guess, it could be an issue with the uploader. Is there a link that I can inspect the wpt results from the artifacts?

gsnedders commented 3 weeks ago

@KyleJu https://github.com/web-platform-tests/wpt/actions/runs/11039052681 has downloads links for all the artifacts.

gsnedders commented 3 weeks ago

I guess I'm more interested in any logging from the request we issue for https://wpt.fyi/api/results/create? Because that's presumably working too and giving 200 back.

https://github.com/web-platform-tests/wpt.fyi/blob/a24fc662deecac9bab58d5e6b0a240bff3a3cc17/api/receiver/create_run.go#L115-L117 is one case where we could silently fail?

But it seems like it should presumably be managing to schedule the task, it's just for whatever reason it never completes.

gsnedders commented 3 weeks ago

I guess also logging from /api/results/process?

Because something somewhere is clearly failing.

KyleJu commented 3 weeks ago

Confirmed that /api/results/create has no error log around the time this run was created.

Looking at /api/checks/github-actions/ again, I noticed successful logging for 11039052681: Screenshot 2024-09-26 11 51 44 AM

but 10 minutes before that, it also logged failures:

Screenshot 2024-09-26 11 42 42 AM

But it seems like it should presumably be managing to schedule the task, it's just for whatever reason it never completes.

I would agree with this. A job could be scheduled and returned 200, but the task might be failing in the Task Queue. There seems to be lots failures around the same timestamps:

Screenshot 2024-09-26 11 56 07 AM

gsnedders commented 3 weeks ago

but 10 minutes before that, it also logged failures:

Can you find what run ID this was? I wonder if that's the point at which I tried to submit all the Safari runs, but I can't find any that gave 400 all the way back to the initial request, which I would've expected?

gsnedders commented 3 weeks ago

Is there any way to find any logging from /api/results/process (i.e., from the results-processor) when the task runs? Because it's kinda surprising they're failing in the results processor.

KyleJu commented 3 weeks ago

but 10 minutes before that, it also logged failures:

Can you find what run ID this was? I wonder if that's the point at which I tried to submit all the Safari runs, but I can't find any that gave 400 all the way back to the initial request, which I would've expected?

I cannot find any logging for run_id and the payload is empty as well, but the user agent of that request is curl. Maybe that rings a bell? Screenshot 2024-09-26 4 12 32 PM

After digging through logs in results-precossor, I found these around the same timestamp (note that the severity level isn't labelled properly in processor): Screenshot 2024-09-26 4 18 19 PM

e.g.

ERROR:processor:Failed to fetch (403): https://api.github.com/repos/web-platform-tests/wpt/actions/artifacts/1978878338/zip

in the url (https://github.com/web-platform-tests/wpt/actions/runs/11039052681) you showed me above, it should be

https://github.com/web-platform-tests/wpt/actions/runs/11039052681/artifacts/1978878338

Would that be the issue?

gsnedders commented 3 weeks ago

It'll be the difference between what it gives as the URL via the API versus via the web interface; I guess we're getting 403 (Forbidden) because we're not passing a GitHub API token.

We'd have to notice that we're fetching api.github.com from the results-processor and then fetch the GitHub token from the datastore.

That said, the even more significant failing here is that they aren't showing up as invalid runs.

gsnedders commented 3 weeks ago

From https://cloud.google.com/appengine/docs/standard/services/taskqueue/push/retrying-tasks?tab=python:

If a handler fails to execute a task (and therefore returns any HTTP status code outside of the range 200–299), App Engine retries the task until it succeeds.

So are we just retrying these indefinitely?

Because I don't think we're catching the error anywhere, so it should just bubble up to the Flask endpoint and then Flask will turn it into a 500 Internal Server Error.

gsnedders commented 3 weeks ago

Reading the code, I think we're not actually retrying here:

https://github.com/web-platform-tests/wpt.fyi/blob/5d65d4e5ebc70d13d4b0c8e0943b599ef63a58d7/results-processor/processor.py#L329-L333

We should just be setting the status as EMPTY, and manually hitting https://wpt.fyi/api/status/empty confirms this.

gsnedders commented 1 week ago

4025 should fix this, but given this is about API tokens it's hard to verify this is actually fixed aside from waiting for the next deployment.

gsnedders commented 1 week ago

OK, even after #4025, https://staging.wpt.fyi/status is showing all the runs stuck as pending. Can someone take a look at the logs for the staging instance?

gsnedders commented 1 week ago

Ah, after #4025 we're getting:

Traceback (most recent call last):
  File "/env/lib/python3.9/site-packages/flask/app.py", line 1473, in wsgi_app
    response = self.full_dispatch_request()
  File "/env/lib/python3.9/site-packages/flask/app.py", line 882, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/env/lib/python3.9/site-packages/flask/app.py", line 880, in full_dispatch_request
    rv = self.dispatch_request()
  File "/env/lib/python3.9/site-packages/flask/app.py", line 865, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)  # type: ignore[no-any-return]
  File "/app/main.py", line 69, in decorated_func
    return func(*args, **kwargs)
  File "/app/main.py", line 52, in decorated_func
    return func(*args, **kwargs)
  File "/app/main.py", line 114, in task_handler
    resp = processor.process_report(task_id, flask.request.form)
  File "/app/processor.py", line 357, in process_report
    p.download(results, screenshots, archives)
  File "/app/processor.py", line 211, in download
    self._download_archive(archive_url)
  File "/app/processor.py", line 185, in _download_archive
    artifact = self._download_http(archive_url)
  File "/app/processor.py", line 142, in _download_http
    'Authorization': 'Bearer ' + self._github_token,
  File "/app/processor.py", line 125, in _github_token
    return self._secret('github-wpt-fyi-bot-token')
  File "/app/processor.py", line 121, in _secret
    return self.datastore.get(key)['secret']
KeyError: 'secret'

…which is surprising given that's what wpt.fyi itself is reading. So somehow we don't have the secret accessible?

KyleJu commented 6 days ago

Ah, after #4025 we're getting:

Traceback (most recent call last):
  File "/env/lib/python3.9/site-packages/flask/app.py", line 1473, in wsgi_app
    response = self.full_dispatch_request()
  File "/env/lib/python3.9/site-packages/flask/app.py", line 882, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/env/lib/python3.9/site-packages/flask/app.py", line 880, in full_dispatch_request
    rv = self.dispatch_request()
  File "/env/lib/python3.9/site-packages/flask/app.py", line 865, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)  # type: ignore[no-any-return]
  File "/app/main.py", line 69, in decorated_func
    return func(*args, **kwargs)
  File "/app/main.py", line 52, in decorated_func
    return func(*args, **kwargs)
  File "/app/main.py", line 114, in task_handler
    resp = processor.process_report(task_id, flask.request.form)
  File "/app/processor.py", line 357, in process_report
    p.download(results, screenshots, archives)
  File "/app/processor.py", line 211, in download
    self._download_archive(archive_url)
  File "/app/processor.py", line 185, in _download_archive
    artifact = self._download_http(archive_url)
  File "/app/processor.py", line 142, in _download_http
    'Authorization': 'Bearer ' + self._github_token,
  File "/app/processor.py", line 125, in _github_token
    return self._secret('github-wpt-fyi-bot-token')
  File "/app/processor.py", line 121, in _secret
    return self.datastore.get(key)['secret']
KeyError: 'secret'

…which is surprising given that's what wpt.fyi itself is reading. So somehow we don't have the secret accessible?

https://github.com/web-platform-tests/wpt.fyi/pull/4057. I think this must be the cause

gsnedders commented 6 days ago

We seem to be downloading a bunch of artifacts, with no errors (because otherwise we'd get an ERROR logged from requests), and then still hitting "ERROR:processor:No results successfully downloaded": downloaded-logs-20241015-141730.json.

past commented 6 days ago

The capitalization fix is now deployed to production.

gsnedders commented 5 days ago

https://staging.wpt.fyi/results/?run_id=6174988006588416 successfully made it up, after #4061 was fixed (though still only on staging).

That said, it does seem to be not labelled with master, which is somewhat problematic.

past commented 4 days ago

https://staging.wpt.fyi/results/?run_id=6174988006588416 successfully made it up, after #4061 was fixed (though still only on staging).

4065 should have gotten the fix on production as well.

gsnedders commented 4 days ago

Let's just treat #4063 as a separate follow-up and resolve this. 🎉