weaveworks / integrations

Weave Third-Party Integrations
Apache License 2.0
26 stars 22 forks source link

Race condition between ecs-agent and weaveproxy #116

Closed errordeveloper closed 4 years ago

errordeveloper commented 7 years ago
[root@ip-172-31-0-151 /]# cat /var/log/ecs/ecs-init.log.2016-11-23-06
2016-11-23T06:55:07Z [ERROR] dial unix /var/run/weave/weave.sock: connect: no such file or directory
[root@ip-172-31-0-151 /]# ls -la /var/run/weave/weave.sock
srw-rw---- 1 root docker 0 Nov 23 06:55 /var/run/weave/weave.sock
[root@ip-172-31-0-151 /]# docker inspect weaveproxy | jq '.[0].State'
{
  "Status": "running",
  "Running": true,
  "Paused": false,
  "Restarting": false,
  "OOMKilled": false,
  "Dead": false,
  "Pid": 6197,
  "ExitCode": 0,
  "Error": "",
  "StartedAt": "2016-11-23T06:55:17.340425735Z",
  "FinishedAt": "0001-01-01T00:00:00Z"
}

This doesn't happen very often, something like 1 out of six attempts. With microservices demo it manifests itself with Service arn:aws:ecs:*-FrontEndService-* did not stabilize.

And this can be seen in the event log for FrontEnd:

service test-30a-FrontEndService-3X845SSWK0JK was unable to place a task because no container instance met all of its requirements. The closest matching container-instance 60fb37c6-e85c-4217-b44c-bae88c2a3a7c is already using a port required by your task. For more information, see the Troubleshooting section.

As the demo currently tries to get front-end to run on all 3 hosts, it fails as it doesn't have one of the hosts registered...

errordeveloper commented 7 years ago

I'd think that this may be to do with weave service not being entierly ready, while upstarts thinks it has "started", but it turns out we have extra checks in place already...

timeout 120 bash -c 'while [ "$(docker inspect -f '"'"'{{.State.Running}}'"'"' weaveproxy 2> /dev/null )" != true ]; do sleep 1; done' || { stop ; exit 0; }

Perhaps we should also check if the socket exists.

errordeveloper commented 7 years ago

Ideally, I would rather see weave setup, weave launch-proxy and weave launch-router as separate services.

2opremio commented 7 years ago

Perhaps we should also check if the socket exists.

Agreed

errordeveloper commented 7 years ago

I am looking at the upstart config once again, and wondering why it doesn't work...

[root@ip-172-31-0-151 etc]# cat /etc/init/ecs.override
# Copyright 2014-2015 Amazon.com, Inc. or its affiliates. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the
# "License"). You may not use this file except in compliance
# with the License. A copy of the License is located at
#
#     http://aws.amazon.com/apache2.0/
#
# or in the "license" file accompanying this file. This file is
# distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR
# CONDITIONS OF ANY KIND, either express or implied. See the
# License for the specific language governing permissions and
# limitations under the License.

start on started weave
stop on stopping weave

env DOCKER_HOST=unix:///var/run/weave/weave.sock

pre-start script
  # Wait for weave-proxy to be ready (for a maximum of two minutes to avoid an
  # Upstart deadlock if Weave failed to start).
  # This checks that the container is running but also implicitly checks that it's ready to
  # receive requests since DOCKER_HOST=unix:///var/run/weave/weave.sock
  timeout 120 bash -c 'while [ "$(docker inspect -f '"'"'{{.State.Running}}'"'"' weaveproxy 2> /dev/null )" != true ]; do sleep 1; done' || { stop ; exit 0; }
  exec /usr/libexec/amazon-ecs-init pre-start
end script
[root@ip-172-31-0-151 etc]#

If you think about it, we set DOCKER_HOST, and thereby any Docker client calls should fail until the socket exists... am I missing something?

2opremio commented 7 years ago

If you think about it, we set DOCKER_HOST, and thereby any Docker client calls should fail until the socket exists... am I missing something?

No, the comment is crystal clear. I had forgotten about it. Could it be that the proxy gets restarted for some reason?

errordeveloper commented 7 years ago

I had forgotten about it.

Yeah, I should read comment sometimes too...

Could it be that the proxy gets restarted for some reason?

Not according to Docker:

[root@ip-172-31-0-151 etc]# docker inspect weaveproxy | jq '.[0].RestartCount'
0
[root@ip-172-31-0-151 etc]#

...not sure if there is anywhere else I can check, e.g. if it got restarted by other means.

2opremio commented 7 years ago

Can you compare the start times of the ecs-agent and weaveproxy containers?

errordeveloper commented 7 years ago

Yeah, that's essentially in the issue description already - Docker say it started weaveproxy container 10s after the ecs-agent logged the error.

brb commented 7 years ago

Could you check the start time of proxy listening on unix://... in docker logs weaveproxy?

errordeveloper commented 7 years ago

@brb ah, great idea!


[root@ip-172-31-0-151 /]# docker logs weaveproxy
INFO: 2016/11/23 06:55:17.373438 weave proxy 1.8.0
INFO: 2016/11/23 06:55:17.373870 Command line arguments: --hostname-from-label com.amazonaws.ecs.container-name -H unix:///var/run/weave/weave.sock
INFO: 2016/11/23 06:55:17.377239 Docker API on unix:///var/run/docker.sock: &[Os=linux Arch=amd64 KernelVersion=4.4.30-32.54.amzn1.x86_64 Version=1.11.2 ApiVersion=1.23 GitCommi
t=b9f10c9/1.11.2 GoVersion=go1.5.3]
INFO: 2016/11/23 06:55:17.380006 proxy listening on unix:///var/run/weave/weave.sock
INFO: 2016/11/23 06:55:17.381918 Attaching weave router container: f1befdcd4ebf9774c4b37a2083a688465d76c19cd7293a58e2fc10ff67b5e441
[root@ip-172-31-0-151 /]#

Just a few moments after what StartedAt says...

errordeveloper commented 7 years ago

I've modified the upstart config and the main part of it looks like this now:

pre-start script
  # Wait for weave-proxy to be ready (for a maximum of two minutes to avoid an
  # Upstart deadlock if Weave failed to start).
  # This checks that the container is running but also implicitly checks that it's ready to
  # receive requests since DOCKER_HOST=unix:///var/run/weave/weave.sock
  echo "Started $(date)" >> /var/log/ecs-weave
  timeout 120 bash -c 'while [ "$(docker inspect -f '"'"'{{.State.Running}}'"'"' weaveproxy 2> /dev/null )" != true ]; do sleep 1; echo "Sleeping $(date)" >> /var/log/ecs-weave;
 done' || { stop ; exit 0; }
  echo "Checker exited $(date)" >> /var/log/ecs-weave
  exec /usr/libexec/amazon-ecs-init pre-start
end script

I have rebooted the instance and this is what I found in my logfile:

[root@ip-172-31-0-151 /]# cat /var/log/ecs-weave
Started Wed Nov 23 12:34:25 UTC 2016
Sleeping Wed Nov 23 12:34:26 UTC 2016
Started Wed Nov 23 12:34:27 UTC 2016
Started Wed Nov 23 12:34:56 UTC 2016
Sleeping Wed Nov 23 12:34:57 UTC 2016
Checker exited Wed Nov 23 12:34:58 UTC 2016
[root@ip-172-31-0-151 /]#

It's a bit odd, we cannot hit the 120s timeout, so we must be getting restarted by something else?

errordeveloper commented 7 years ago

I've also amended the following line, clear the log and rebooted.

... || { stop ; echo "Bailing $(date)" >> /var/log/ecs-weave; exit 0; }

Here is what I get:

[root@ip-172-31-0-151 /]# cat /var/log/ecs-weave

Started Wed Nov 23 13:00:27 UTC 2016
Sleeping Wed Nov 23 13:00:28 UTC 2016
Sleeping Wed Nov 23 13:00:29 UTC 2016
Sleeping Wed Nov 23 13:00:30 UTC 2016
Started Wed Nov 23 13:00:31 UTC 2016
Sleeping Wed Nov 23 13:00:32 UTC 2016
Started Wed Nov 23 13:00:33 UTC 2016
Started Wed Nov 23 13:01:01 UTC 2016
Sleeping Wed Nov 23 13:01:03 UTC 2016
Checker exited Wed Nov 23 13:01:03 UTC 2016
[root@ip-172-31-0-151 /]#

I struggle to understand what's going, especially because there is no log file to confirm the boot sequence.

2opremio commented 7 years ago

I struggle to understand what's going

Welcome to Upstart :)

initctl list should tell you what's going on, I think you can also get a stream with the events.

errordeveloper commented 7 years ago

I don't know what does stop mean in the upstart context, I'm gonna move my echo call before stop and see what happens.

Welcome to Upstart :)

I've used it before, but on Ubuntu and not on an RHEL-fork. On Ubuntu all tasks log to a file in /var/log/upstart.

initctl list

Thanks, forgot about this one... it's super basic :(

errordeveloper commented 7 years ago

Ok, so I believe stop effectively stops the current job.

And even with ... || { echo "Bailing $(date)" >> /var/log/ecs-weave; stop; exit 0; } now I'm able to confirm that this job gets stopped by something external, as I'm not seeing that line in my log on yet another reboot.

errordeveloper commented 7 years ago

I could make change to how this check is done and do something like:

until test \ 
  "$(docker -H unix:///var/run/docker.sock ps -q -f name=weaveproxy | wc -l)" -eq 1 \
  -a -S /var/run/weave/weave.sock; do \
sleep 1; ...

may be even get rid of timeout wrapping and use a counter...

But it'd be good understand what's going wrong exactly with the current version of the upstart config.