widdix / aws-cf-templates

Free Templates for AWS CloudFormation
https://templates.cloudonaut.io/
Apache License 2.0
2.75k stars 1.38k forks source link

ECS Cluster Scaling (part 2) #127

Closed 5290charlie closed 6 years ago

5290charlie commented 7 years ago

TemplateID: ecs/cluster Region: us-west-2

This is an update from the my previously closed issue: https://github.com/widdix/aws-cf-templates/issues/112

Here is a quote summarizing the previous issue:

I just encountered a strange state in which an ECS service 
(based on the service-cluster-alb template) was being updated 
at the same time the cluster ASG was scaling down. Somehow a 
task was started on an instance as it was being terminated, 
and the entire stack update process stalled out (update 
rollback is currently still stalled also).

It has happened again, and I have gathered more evidence this time. I think I'm hot on the trail of figuring out why this is happening!

Screen shots and descriptions hosted on imgur

Here is a timeline of the events and a summary of my ideas:

Timeline:

Ideas:

Still not 100% sure what the issue is, but hopefully this is enough information this time and we can figure it out together! :)

sandrokeil commented 7 years ago

Which Queue type is used, because “Standard queues guarantee that a message is delivered at least once.”. Maybe that can be an issue too? @michaelwittig

michaelwittig commented 7 years ago

@5290charlie can you check what's in /var/log/messages on all EC2 instances. Should be in CloudWatch logs. If there is something in the DLQ that's bad :)

@sandrokeil multiple deliveries shouldn't be a problem

5290charlie commented 7 years ago

@michaelwittig Searched through the /var/log/messages streams for the cluster instances, didn't see anything out of the ordinary ... Here are the poller entries from the messages stream for the instance I have been referencing. (I think these are associated to the script installed on the cluster instances, right?)

Oct  5 20:07:38 ip-10-1-34-23 poller[3215]: lifecycle transition for EC2 instance i-0de007d4007676246
Oct  5 20:07:38 ip-10-1-34-23 poller[3215]: lifecycle transition for container instance f48126ca-94d8-459c-81e7-bcb24fc3c7cb
Oct  5 20:08:39 ip-10-1-34-23 poller[3215]: container instance f48126ca-94d8-459c-81e7-bcb24fc3c7cb is not idle, but wait time elapsed
Oct  5 20:08:53 ip-10-1-34-23 poller[5369]: poller started
Oct  5 20:09:38 ip-10-1-34-23 poller[5369]: lifecycle transition for EC2 instance i-0de007d4007676246
Oct  5 20:09:59 ip-10-1-34-23 poller[6402]: poller started
Oct  5 20:10:38 ip-10-1-34-23 poller[6402]: lifecycle transition for EC2 instance i-0de007d4007676246
Oct  5 20:11:05 ip-10-1-34-23 poller[7454]: poller started
Oct  5 20:11:38 ip-10-1-34-23 poller[7454]: lifecycle transition for EC2 instance i-0de007d4007676246
Oct  5 20:11:41 ip-10-1-34-23 poller[8015]: poller started
Oct  5 22:47:57 ip-10-1-34-23 poller[8015]: lifecycle transition for EC2 instance i-0683de66000c40660
Oct  5 22:47:58 ip-10-1-34-23 poller[8015]: lifecycle transition for container instance 8eb998c8-c58a-46c1-a818-600d038fed3d

The only real errors I see in streams related to this instance are in the cloud-init.log stream:

Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/cloudinit/util.py", line 644, in runparts
    subp(prefix + [exe_path], capture=False, shell=True)
  File "/usr/lib/python2.7/dist-packages/cloudinit/util.py", line 1615, in subp
    cmd=args)
ProcessExecutionError: Unexpected error while running command.
Command: ['/var/lib/cloud/instance/scripts/part-001']
Exit code: 1
Reason: -
Stdout: ''
Stderr: ''
Oct 05 19:46:17 cloud-init[2782]: cc_scripts_user.py[WARNING]: Failed to run module scripts-user (scripts in /var/lib/cloud/instance/scripts)
Oct 05 19:46:17 cloud-init[2782]: util.py[WARNING]: Running module scripts-user (<module 'cloudinit.config.cc_scripts_user' from '/usr/lib/python2.7/dist-packages/cloudinit/config/cc_scripts_user.pyc'>) failed
Oct 05 19:46:17 cloud-init[2782]: util.py[DEBUG]: Running module scripts-user (<module 'cloudinit.config.cc_scripts_user' from '/usr/lib/python2.7/dist-packages/cloudinit/config/cc_scripts_user.pyc'>) failed
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/cloudinit/stages.py", line 660, in _run_modules
    cc.run(run_name, mod.handle, func_args, freq=freq)
  File "/usr/lib/python2.7/dist-packages/cloudinit/cloud.py", line 63, in run
    return self._runners.run(name, functor, args, freq, clear_on_fail)
  File "/usr/lib/python2.7/dist-packages/cloudinit/helpers.py", line 197, in run
    results = functor(*args)
  File "/usr/lib/python2.7/dist-packages/cloudinit/config/cc_scripts_user.py", line 38, in handle
    util.runparts(runparts_path)
  File "/usr/lib/python2.7/dist-packages/cloudinit/util.py", line 651, in runparts
    % (len(failed), len(attempted)))
RuntimeError: Runparts: 1 failures in 1 attempted commands

Not sure if that is relevant, could that have something to do with the way the "user-data" is configuring the cluster instances at startup?

kbcaleb commented 7 years ago

@5290charlie Checkout the /var/log/cfn-init-cmd.log and /var/log/cloud-init-output.log in CloudWatch to see what it exited on.

5290charlie commented 7 years ago

@kbcaleb the cfn-init-cmd.log looks like all success messages, I do see these errors/warnings at the bottom of the cloud-init-output.log which seem related:

ValidationError: Stack arn:aws:cloudformation:us-west-2:223016015200:stack/bb-cluster/c96ab0d0-9fbc-11e7-a30c-503aca41a0c5 is in UPDATE_COMPLETE state and cannot be signaled
Oct 05 19:46:17 cloud-init[2782]: util.py[WARNING]: Failed running /var/lib/cloud/instance/scripts/part-001 [1]
Oct 05 19:46:17 cloud-init[2782]: cc_scripts_user.py[WARNING]: Failed to run module scripts-user (scripts in /var/lib/cloud/instance/scripts)
Oct 05 19:46:17 cloud-init[2782]: util.py[WARNING]: Running module scripts-user (<module 'cloudinit.config.cc_scripts_user' from '/usr/lib/python2.7/dist-packages/cloudinit/config/cc_scripts_user.pyc'>) failed
michaelwittig commented 7 years ago

the "UPDATE_COMPLETE state and cannot be signaled" shouldn't be a problem. another EC2 instance was faster than this one.

  1. Can you post some of the messages in the dlq?

  2. messages are moved to the dlq if they can not be processed successfully after 5 retires. so something must be wrong with them. but usually this would appear as an error of the poller. but the poller runs on all ec2 instances that form the cluster. so you have to search /var/log/messages of all ec2 instances

michaelwittig commented 6 years ago

I fixed something in this area: a7557d68fb1784bdc53b08971f326e9d45e92903