whole-tale / terraform_deployment

Terraform deployment setup for WT prod
BSD 3-Clause "New" or "Revised" License
3 stars 2 forks source link

Expand monitoring to include Tale launching #7

Open craig-willis opened 6 years ago

craig-willis commented 6 years ago

This is the kind of error that's reported most

craig-willis commented 6 years ago

See https://github.com/whole-tale/girder_wholetale/blob/master/plugin_tests/instance_test.py#L108

craig-willis commented 6 years ago

@Xarthisius This has been running on staging now for several days and I'm seeing at least one error per day during the create call-- occasionally 400 and 401, and more often 500. Looking at the staging girder error log, it's not immediately clear what the cause is:

[2018-06-26 14:17:30,663] ERROR: 500 Error
Traceback (most recent call last):
  File "/girder/girder/api/rest.py", line 620, in endpointDecorator
    val = fun(self, args, kwargs)
  File "/girder/girder/api/rest.py", line 1204, in POST
    return self.handleRoute(method, path, params)
  File "/girder/girder/api/rest.py", line 947, in handleRoute
    val = handler(**kwargs)
  File "/girder/girder/api/access.py", line 63, in wrapped
    return fun(*args, **kwargs)
  File "/girder/girder/api/describe.py", line 702, in wrapped
    return fun(*args, **kwargs)
  File "/girder/plugins/wholetale/server/rest/instance.py", line 166, in createInstance
    save=True)
  File "/girder/plugins/wholetale/server/models/instance.py", line 147, in createInstance
    volume = volumeTask.get(timeout=TASK_TIMEOUT)
  File "/usr/local/lib/python3.5/dist-packages/celery/result.py", line 191, in get
    on_message=on_message,
  File "/usr/local/lib/python3.5/dist-packages/celery/backends/async.py", line 190, in wait_for_pending
    return result.maybe_throw(callback=callback, propagate=propagate)
  File "/usr/local/lib/python3.5/dist-packages/celery/result.py", line 296, in maybe_throw
    self.throw(value, self._to_remote_traceback(tb))
  File "/usr/local/lib/python3.5/dist-packages/celery/result.py", line 289, in throw
    self.on_ready.throw(*args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/vine/promises.py", line 217, in throw
    reraise(type(exc), exc, tb)
  File "/usr/local/lib/python3.5/dist-packages/vine/five.py", line 179, in reraise
    raise value
celery.backends.base.ValueError
Additional info:
  Request URL: POST https://girder.stage.wholetale.org/api/v1/instance
  Query string: taleId=59f0b91584b7920001b46f2e
  Remote IP: 10.0.0.2

I've seen alerts at 4:19AM and 9:18AM today.

The Girder error logs also have a surprising number of these:

[2018-06-26 13:17:12,728] ERROR: INFO:girder:Booting server at [https://tmp-rs9nxumk3tuf.stage.wholetale.org/?token=9e9e0a4366fc4131adf9fddc811f9ad8], getting HTTP status [404]

and these

[2018-06-26 11:00:22,466] ERROR: INFO:girder:Running DM file GC

But since they are INFO level, maybe not real errors?

Xarthisius commented 6 years ago

9:18 AM CDT? Those two "errors" can be ignored. I don't know why they're flagged as errors...

craig-willis commented 6 years ago

Sorry, yes 9:18 AM CDT was the alert time from check_mk.

Xarthisius commented 6 years ago

Looks like ValueError is raised during parsing girder_token:

[2018-06-26 14:17:30,537: INFO/MainProcess] Received task: gwvolman.tasks.create_volume[1389d137-cec7-4619-aec6-293f255def26]  
[2018-06-26 14:17:30,540: WARNING/ForkPoolWorker-1] [2018-06-26 14:17:30,540] WARNING: No jobInfoSpec. Setting job_manager to None.
[2018-06-26 14:17:30,540: WARNING/ForkPoolWorker-1] No jobInfoSpec. Setting job_manager to None.
[2018-06-26 14:17:30,661: WARNING/ForkPoolWorker-1] Bad gider token
[2018-06-26 14:17:30,664: ERROR/ForkPoolWorker-1] Task gwvolman.tasks.create_volume[1389d137-cec7-4619-aec6-293f255def26] raised unexpected: ValueError()
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/celery/app/trace.py", line 375, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/girder_worker/task.py", line 148, in __call__
    results = super(Task, self).__call__(*_t_args, **_t_kwargs)
  File "/usr/local/lib/python2.7/dist-packages/celery/app/trace.py", line 632, in __protected_call__
    return self.run(*args, **kwargs)
  File "gwvolman/tasks.py", line 35, in create_volume
    gc, user, tale = _parse_request_body(payload)
  File "gwvolman/utils.py", line 88, in _parse_request_body
    raise ValueError
ValueError

relevant code:

https://github.com/whole-tale/gwvolman/blob/master/gwvolman/utils.py#L80-L86

either payload was mangled somehow (expired token?) or worker failed to connect back to girder to get user info. I set up flower, so that we can see what's being passed as payload into the tasks and rule out the former.

craig-willis commented 6 years ago

Thank you. This happens so infrequently that we probably won't see anything until tomorrow AM.

The problem does correct itself (i.e., during the next check, things are OK). The check-tale code exchanges an API key for token during each call, so it seems unlikely that it would expire, but maybe I've done something wrong.

craig-willis commented 6 years ago

Got one. From check_mk:

Host wt-stage-00 (wt-stage-00)
Service Tale
Event OK → CRITICAL
Address 149.165.168.151
Date / Time Wed Jun 27 18:51:30 CDT 2018
Plugin Output CRIT - CRITICAL - error creating tale
Host Metrics rta=10.073ms;200.000;500.000;0; pl=0%;80;100;; rtmax=10.236ms;;;; rtmin=9.955ms;;;;
Service Metrics status=500;;;;

Looking at the Flower dashboard, the payload looks OK:

[{
  'apiUrl': 'https://girder.stage.wholetale.org/api/v1',
  'taleId': '59f0b91584b7920001b46f2e',
  'workspaceId': '59f0b91584b7920001b46f2f', 
  'girder_token': 'TOKEN', ' 
  `api_version': '2.0'
}]

I was able to curl with the token:

curl -H "Girder-Token: TOKEN " https://girder.stage.wholetale.org/api/v1/user/me

{
   "_accessLevel": 2, 
   "_id": "5ade46db715d3400011e9524", 
   "_modelType": "user",
   "email": "monitoring@wholetale.org", 
   "firstName": "Monitoring", 
   ...
}

So this seems to point to a possible networking problem, which is unsurprising. I'm going to reduce the check time to every 10 minutes to see how pervasive this is.

craig-willis commented 6 years ago

Also received 401 on POST instance at ~5am CDT this morning