ECS Cluster Scaling (part 2)

5290charlie commented 7 years ago

TemplateID: ecs/cluster Region: us-west-2

This is an update from the my previously closed issue: https://github.com/widdix/aws-cf-templates/issues/112

Here is a quote summarizing the previous issue:

I just encountered a strange state in which an ECS service 
(based on the service-cluster-alb template) was being updated 
at the same time the cluster ASG was scaling down. Somehow a 
task was started on an instance as it was being terminated, 
and the entire stack update process stalled out (update 
rollback is currently still stalled also).

It has happened again, and I have gathered more evidence this time. I think I'm hot on the trail of figuring out why this is happening!

Screen shots and descriptions hosted on imgur

Here is a timeline of the events and a summary of my ideas:

Timeline:

16:34:31 - ECS starts new task as part of service update
16:35:47 - ASG launches a new instance trigged by container shortage alarm
16:36:20 - ASG completes launch of new cluster instance
16:36:22 - ECS successfully stops the previous service task after service update
16:41:46 - ASG begins instance termination
16:47:59 - EC2 instance transitions state and shuts down before being terminated
16:48:09 - ECS attempts to stop the task
16:48:22 - Web browser receives 503 response from the service (via HTTP)
16:48:33 - ECS registers new task in target group
16:49:21 - ASG completes instance termination

Ideas:

Not high enough value set for DrainingTimeoutInSeconds (maxWaitInSeconds)
- This defaults to 60s, but the cluster/service didn't even attempt to stop the task until after the instance was being terminated...
Possible issue with the SQS queue used to facilitate the lifecycle hook events
- After exploring the SQS dashboard to investigate these queues, I noticed that the DLQ had quite a few messages
- I am not very familiar with SQS, but my understanding is that messages that are received and not processed (deleted) more than the maximum (set to 5 in the template) are placed in the DLQ.
- Maybe the termination event hook messages aren't being received properly?
- Here is the message body for the instance lifecycle event in question:
```
{
"LifecycleHookName": "bb-cluster-AutoScalingGroupTerminatingLifecycleHook-ZTJDYU4RWMZQ",
"AccountId": "xxxxxxxxxxxx",
"RequestId": "db764c16-e5ac-4f45-b464-ccd66d391109",
"LifecycleTransition": "autoscaling:EC2_INSTANCE_TERMINATING",
"AutoScalingGroupName": "bb-cluster-AutoScalingGroup-Y6ZKBYUJ8UA4",
"Service": "AWS Auto Scaling",
"Time": "2017-10-05T22:46:57.817Z",
"EC2InstanceId": "i-0683de66000c40660",
"LifecycleActionToken": "942e7c78-d88a-423b-866b-52792dc6f292"
}
```

Still not 100% sure what the issue is, but hopefully this is enough information this time and we can figure it out together! :)

sandrokeil commented 7 years ago

Which Queue type is used, because “Standard queues guarantee that a message is delivered at least once.”. Maybe that can be an issue too? @michaelwittig

michaelwittig commented 7 years ago

@5290charlie can you check what's in /var/log/messages on all EC2 instances. Should be in CloudWatch logs. If there is something in the DLQ that's bad :)

@sandrokeil multiple deliveries shouldn't be a problem

5290charlie commented 7 years ago

@michaelwittig Searched through the /var/log/messages streams for the cluster instances, didn't see anything out of the ordinary ... Here are the poller entries from the messages stream for the instance I have been referencing. (I think these are associated to the script installed on the cluster instances, right?)

Oct  5 20:07:38 ip-10-1-34-23 poller[3215]: lifecycle transition for EC2 instance i-0de007d4007676246
Oct  5 20:07:38 ip-10-1-34-23 poller[3215]: lifecycle transition for container instance f48126ca-94d8-459c-81e7-bcb24fc3c7cb
Oct  5 20:08:39 ip-10-1-34-23 poller[3215]: container instance f48126ca-94d8-459c-81e7-bcb24fc3c7cb is not idle, but wait time elapsed
Oct  5 20:08:53 ip-10-1-34-23 poller[5369]: poller started
Oct  5 20:09:38 ip-10-1-34-23 poller[5369]: lifecycle transition for EC2 instance i-0de007d4007676246
Oct  5 20:09:59 ip-10-1-34-23 poller[6402]: poller started
Oct  5 20:10:38 ip-10-1-34-23 poller[6402]: lifecycle transition for EC2 instance i-0de007d4007676246
Oct  5 20:11:05 ip-10-1-34-23 poller[7454]: poller started
Oct  5 20:11:38 ip-10-1-34-23 poller[7454]: lifecycle transition for EC2 instance i-0de007d4007676246
Oct  5 20:11:41 ip-10-1-34-23 poller[8015]: poller started
Oct  5 22:47:57 ip-10-1-34-23 poller[8015]: lifecycle transition for EC2 instance i-0683de66000c40660
Oct  5 22:47:58 ip-10-1-34-23 poller[8015]: lifecycle transition for container instance 8eb998c8-c58a-46c1-a818-600d038fed3d

The only real errors I see in streams related to this instance are in the cloud-init.log stream:

Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/cloudinit/util.py", line 644, in runparts
    subp(prefix + [exe_path], capture=False, shell=True)
  File "/usr/lib/python2.7/dist-packages/cloudinit/util.py", line 1615, in subp
    cmd=args)
ProcessExecutionError: Unexpected error while running command.
Command: ['/var/lib/cloud/instance/scripts/part-001']
Exit code: 1
Reason: -
Stdout: ''
Stderr: ''
Oct 05 19:46:17 cloud-init[2782]: cc_scripts_user.py[WARNING]: Failed to run module scripts-user (scripts in /var/lib/cloud/instance/scripts)
Oct 05 19:46:17 cloud-init[2782]: util.py[WARNING]: Running module scripts-user (<module 'cloudinit.config.cc_scripts_user' from '/usr/lib/python2.7/dist-packages/cloudinit/config/cc_scripts_user.pyc'>) failed
Oct 05 19:46:17 cloud-init[2782]: util.py[DEBUG]: Running module scripts-user (<module 'cloudinit.config.cc_scripts_user' from '/usr/lib/python2.7/dist-packages/cloudinit/config/cc_scripts_user.pyc'>) failed
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/cloudinit/stages.py", line 660, in _run_modules
    cc.run(run_name, mod.handle, func_args, freq=freq)
  File "/usr/lib/python2.7/dist-packages/cloudinit/cloud.py", line 63, in run
    return self._runners.run(name, functor, args, freq, clear_on_fail)
  File "/usr/lib/python2.7/dist-packages/cloudinit/helpers.py", line 197, in run
    results = functor(*args)
  File "/usr/lib/python2.7/dist-packages/cloudinit/config/cc_scripts_user.py", line 38, in handle
    util.runparts(runparts_path)
  File "/usr/lib/python2.7/dist-packages/cloudinit/util.py", line 651, in runparts
    % (len(failed), len(attempted)))
RuntimeError: Runparts: 1 failures in 1 attempted commands

Not sure if that is relevant, could that have something to do with the way the "user-data" is configuring the cluster instances at startup?

kbcaleb commented 7 years ago

@5290charlie Checkout the /var/log/cfn-init-cmd.log and /var/log/cloud-init-output.log in CloudWatch to see what it exited on.

5290charlie commented 7 years ago

@kbcaleb the cfn-init-cmd.log looks like all success messages, I do see these errors/warnings at the bottom of the cloud-init-output.log which seem related:

ValidationError: Stack arn:aws:cloudformation:us-west-2:223016015200:stack/bb-cluster/c96ab0d0-9fbc-11e7-a30c-503aca41a0c5 is in UPDATE_COMPLETE state and cannot be signaled
Oct 05 19:46:17 cloud-init[2782]: util.py[WARNING]: Failed running /var/lib/cloud/instance/scripts/part-001 [1]
Oct 05 19:46:17 cloud-init[2782]: cc_scripts_user.py[WARNING]: Failed to run module scripts-user (scripts in /var/lib/cloud/instance/scripts)
Oct 05 19:46:17 cloud-init[2782]: util.py[WARNING]: Running module scripts-user (<module 'cloudinit.config.cc_scripts_user' from '/usr/lib/python2.7/dist-packages/cloudinit/config/cc_scripts_user.pyc'>) failed

michaelwittig commented 7 years ago

the "UPDATE_COMPLETE state and cannot be signaled" shouldn't be a problem. another EC2 instance was faster than this one.

Can you post some of the messages in the dlq?
messages are moved to the dlq if they can not be processed successfully after 5 retires. so something must be wrong with them. but usually this would appear as an error of the poller. but the poller runs on all ec2 instances that form the cluster. so you have to search /var/log/messages of all ec2 instances

michaelwittig commented 6 years ago

I fixed something in this area: a7557d68fb1784bdc53b08971f326e9d45e92903

widdix / aws-cf-templates

ECS Cluster Scaling (part 2) #127

Timeline:

Ideas: