Open craig-willis opened 6 years ago
@Xarthisius This has been running on staging now for several days and I'm seeing at least one error per day during the create call-- occasionally 400 and 401, and more often 500. Looking at the staging girder error log, it's not immediately clear what the cause is:
[2018-06-26 14:17:30,663] ERROR: 500 Error
Traceback (most recent call last):
File "/girder/girder/api/rest.py", line 620, in endpointDecorator
val = fun(self, args, kwargs)
File "/girder/girder/api/rest.py", line 1204, in POST
return self.handleRoute(method, path, params)
File "/girder/girder/api/rest.py", line 947, in handleRoute
val = handler(**kwargs)
File "/girder/girder/api/access.py", line 63, in wrapped
return fun(*args, **kwargs)
File "/girder/girder/api/describe.py", line 702, in wrapped
return fun(*args, **kwargs)
File "/girder/plugins/wholetale/server/rest/instance.py", line 166, in createInstance
save=True)
File "/girder/plugins/wholetale/server/models/instance.py", line 147, in createInstance
volume = volumeTask.get(timeout=TASK_TIMEOUT)
File "/usr/local/lib/python3.5/dist-packages/celery/result.py", line 191, in get
on_message=on_message,
File "/usr/local/lib/python3.5/dist-packages/celery/backends/async.py", line 190, in wait_for_pending
return result.maybe_throw(callback=callback, propagate=propagate)
File "/usr/local/lib/python3.5/dist-packages/celery/result.py", line 296, in maybe_throw
self.throw(value, self._to_remote_traceback(tb))
File "/usr/local/lib/python3.5/dist-packages/celery/result.py", line 289, in throw
self.on_ready.throw(*args, **kwargs)
File "/usr/local/lib/python3.5/dist-packages/vine/promises.py", line 217, in throw
reraise(type(exc), exc, tb)
File "/usr/local/lib/python3.5/dist-packages/vine/five.py", line 179, in reraise
raise value
celery.backends.base.ValueError
Additional info:
Request URL: POST https://girder.stage.wholetale.org/api/v1/instance
Query string: taleId=59f0b91584b7920001b46f2e
Remote IP: 10.0.0.2
I've seen alerts at 4:19AM and 9:18AM today.
The Girder error logs also have a surprising number of these:
[2018-06-26 13:17:12,728] ERROR: INFO:girder:Booting server at [https://tmp-rs9nxumk3tuf.stage.wholetale.org/?token=9e9e0a4366fc4131adf9fddc811f9ad8], getting HTTP status [404]
and these
[2018-06-26 11:00:22,466] ERROR: INFO:girder:Running DM file GC
But since they are INFO
level, maybe not real errors?
9:18 AM CDT? Those two "errors" can be ignored. I don't know why they're flagged as errors...
Sorry, yes 9:18 AM CDT was the alert time from check_mk.
Looks like ValueError is raised during parsing girder_token
:
[2018-06-26 14:17:30,537: INFO/MainProcess] Received task: gwvolman.tasks.create_volume[1389d137-cec7-4619-aec6-293f255def26]
[2018-06-26 14:17:30,540: WARNING/ForkPoolWorker-1] [2018-06-26 14:17:30,540] WARNING: No jobInfoSpec. Setting job_manager to None.
[2018-06-26 14:17:30,540: WARNING/ForkPoolWorker-1] No jobInfoSpec. Setting job_manager to None.
[2018-06-26 14:17:30,661: WARNING/ForkPoolWorker-1] Bad gider token
[2018-06-26 14:17:30,664: ERROR/ForkPoolWorker-1] Task gwvolman.tasks.create_volume[1389d137-cec7-4619-aec6-293f255def26] raised unexpected: ValueError()
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/celery/app/trace.py", line 375, in trace_task
R = retval = fun(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/girder_worker/task.py", line 148, in __call__
results = super(Task, self).__call__(*_t_args, **_t_kwargs)
File "/usr/local/lib/python2.7/dist-packages/celery/app/trace.py", line 632, in __protected_call__
return self.run(*args, **kwargs)
File "gwvolman/tasks.py", line 35, in create_volume
gc, user, tale = _parse_request_body(payload)
File "gwvolman/utils.py", line 88, in _parse_request_body
raise ValueError
ValueError
relevant code:
https://github.com/whole-tale/gwvolman/blob/master/gwvolman/utils.py#L80-L86
either payload was mangled somehow (expired token?) or worker failed to connect back to girder to get user info. I set up flower
, so that we can see what's being passed as payload into the tasks and rule out the former.
Thank you. This happens so infrequently that we probably won't see anything until tomorrow AM.
The problem does correct itself (i.e., during the next check, things are OK). The check-tale
code exchanges an API key for token during each call, so it seems unlikely that it would expire, but maybe I've done something wrong.
Got one. From check_mk:
Host | wt-stage-00 (wt-stage-00) |
---|---|
Service | Tale |
Event | OK → CRITICAL |
Address | 149.165.168.151 |
Date / Time | Wed Jun 27 18:51:30 CDT 2018 |
Plugin Output | CRIT - CRITICAL - error creating tale |
Host Metrics | rta=10.073ms;200.000;500.000;0; pl=0%;80;100;; rtmax=10.236ms;;;; rtmin=9.955ms;;;; |
Service Metrics | status=500;;;; |
Looking at the Flower dashboard, the payload looks OK:
[{
'apiUrl': 'https://girder.stage.wholetale.org/api/v1',
'taleId': '59f0b91584b7920001b46f2e',
'workspaceId': '59f0b91584b7920001b46f2f',
'girder_token': 'TOKEN', '
`api_version': '2.0'
}]
I was able to curl with the token:
curl -H "Girder-Token: TOKEN " https://girder.stage.wholetale.org/api/v1/user/me
{
"_accessLevel": 2,
"_id": "5ade46db715d3400011e9524",
"_modelType": "user",
"email": "monitoring@wholetale.org",
"firstName": "Monitoring",
...
}
So this seems to point to a possible networking problem, which is unsurprising. I'm going to reduce the check time to every 10 minutes to see how pervasive this is.
Also received 401 on POST instance at ~5am CDT this morning
This is the kind of error that's reported most