show root cause when weave infrastructure containers die on startup

weaveworks / weave

Simple, resilient multi-host containers networking and more.

https://www.weave.works

Apache License 2.0

6.62k stars 670 forks source link

show root cause when weave infrastructure containers die on startup #1280

Open rade opened 9 years ago

rade commented 9 years ago

The weave and weaveproxy container can die on startup, e.g. when invalid options are specified. Currently this shows up as a The weave container has died. Consult the logs with 'docker logs weave' for further details. type error. It would be rather more helpful to show the actual error.

Perhaps just tail a few stderr lines from the logs.

rade commented 9 years ago

See #956 for a special case.

abuehrle commented 9 years ago

Why not just show what it died on instead of having to go to the docker logs. Anyway, I thought you wanted to get rid of those messages so I noted it.

rade commented 9 years ago

Why not just show what it died on instead of having to go to the docker logs.

That's what we are trying to do here. It's not easy.

rade commented 9 years ago

Perhaps just tail a few stderr lines from the logs.

Alternatively, make sure all common startup errors get logged with some easily recognisable pattern that we can grep for in the script, falling back on the existing generic error message if the grep fails.

tomwilkie commented 9 years ago

If you were to assume the most common errors are cmd line parsing, we could do a dummy run of weave (add a --dummy) which exits 0 if it the command line parses and is sane, 1 otherwise. If we ran it -ti then the output would be seen by the user.

rade commented 9 years ago

If you were to assume the most common errors are cmd line parsing

It's more than that. The referenced #956 is an extreme case, since that is raised very late during startup, after the router is running.

tomwilkie commented 9 years ago

Is #956 due to unresolvable hostnames? We could resolve them before returning with --dummy. Probably not that simple though.

On Mon, Aug 17, 2015 at 8:24 PM, Matthias Radestock < notifications@github.com> wrote:

If you were to assume the most common errors are cmd line parsing

It's more than that. The referenced #956 https://github.com/weaveworks/weave/issues/956 is an extreme case, since that is raised very late during startup, after the router is running.

— Reply to this email directly or view it on GitHub https://github.com/weaveworks/weave/issues/1280#issuecomment-131936108.

rade commented 9 years ago

It would actually be quite trivial to log all the errors with a grep-able pattern. We'd simply need to replace all the Log.Fatal invocations in main.go with a function that invokes the Log.Fatal with a suitable prefix.

bboreham commented 9 years ago

To extend the logging idea, weave status could look for relevant log lines from the last run, as an improvement on saying "weave is not running".

It can't be that hard to find all the places where the router decides to quit, and the panic log is also recognisable.

rade commented 9 years ago

To extend the logging idea

Separate issue; let's not pile extra features into this one.

rade commented 8 years ago

The "grep the logs" idea is flawed since docker logging can be configured such that container logs go elsewhere an are not available with docker logs. (which, after several users ran into that, prompted us to make the error message the more generic "Consult the container logs for further details.")

I suppose we could try docker logs grep-ing, and if that fails revert to the generic error.

rade commented 8 years ago

It would actually be quite trivial to log all the errors with a grep-able pattern. We'd simply need to replace all the Log.Fatal invocations in main.go with a function that invokes the Log.Fatal with a suitable prefix.

The Log.Fatal output is already quite grep-able...

$ weave launch --iface=foo
The weave container has died. Consult the container logs for further details.
$ docker logs weave |& grep "^FATA:"
FATA: 2016/04/12 22:29:23.399524 At most one of --datapath and --iface must be specified.
FATA: 2016/04/12 22:29:57.706913 At most one of --datapath and --iface must be specified.

(NB: there are two errors here because these days we recycle containers; so this is something we need to watch out for. We could just take the last line, no matter what)

The errors from the flag parser come out differently though. The logging there is configurable, but in strange ways that actually alter the behaviour.

bboreham commented 8 years ago

How about writing the cause of death to a file, which we could docker cp out of the container and then cat to stderr?

rade commented 8 years ago

seems overkill. docker logs --tail=1 weave will do the right thing in most cases.

bboreham commented 7 years ago

Kubernetes has "termination reason": basically you write the thing we've been discussing to a file /dev/termination-log and Kubernetes pulls it out.