Closed 5290charlie closed 6 years ago
Which Queue type is used, because “Standard queues guarantee that a message is delivered at least once.”. Maybe that can be an issue too? @michaelwittig
@5290charlie can you check what's in /var/log/messages on all EC2 instances. Should be in CloudWatch logs. If there is something in the DLQ that's bad :)
@sandrokeil multiple deliveries shouldn't be a problem
@michaelwittig Searched through the /var/log/messages
streams for the cluster instances, didn't see anything out of the ordinary ... Here are the poller
entries from the messages stream for the instance I have been referencing. (I think these are associated to the script installed on the cluster instances, right?)
Oct 5 20:07:38 ip-10-1-34-23 poller[3215]: lifecycle transition for EC2 instance i-0de007d4007676246
Oct 5 20:07:38 ip-10-1-34-23 poller[3215]: lifecycle transition for container instance f48126ca-94d8-459c-81e7-bcb24fc3c7cb
Oct 5 20:08:39 ip-10-1-34-23 poller[3215]: container instance f48126ca-94d8-459c-81e7-bcb24fc3c7cb is not idle, but wait time elapsed
Oct 5 20:08:53 ip-10-1-34-23 poller[5369]: poller started
Oct 5 20:09:38 ip-10-1-34-23 poller[5369]: lifecycle transition for EC2 instance i-0de007d4007676246
Oct 5 20:09:59 ip-10-1-34-23 poller[6402]: poller started
Oct 5 20:10:38 ip-10-1-34-23 poller[6402]: lifecycle transition for EC2 instance i-0de007d4007676246
Oct 5 20:11:05 ip-10-1-34-23 poller[7454]: poller started
Oct 5 20:11:38 ip-10-1-34-23 poller[7454]: lifecycle transition for EC2 instance i-0de007d4007676246
Oct 5 20:11:41 ip-10-1-34-23 poller[8015]: poller started
Oct 5 22:47:57 ip-10-1-34-23 poller[8015]: lifecycle transition for EC2 instance i-0683de66000c40660
Oct 5 22:47:58 ip-10-1-34-23 poller[8015]: lifecycle transition for container instance 8eb998c8-c58a-46c1-a818-600d038fed3d
The only real errors I see in streams related to this instance are in the cloud-init.log
stream:
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/cloudinit/util.py", line 644, in runparts
subp(prefix + [exe_path], capture=False, shell=True)
File "/usr/lib/python2.7/dist-packages/cloudinit/util.py", line 1615, in subp
cmd=args)
ProcessExecutionError: Unexpected error while running command.
Command: ['/var/lib/cloud/instance/scripts/part-001']
Exit code: 1
Reason: -
Stdout: ''
Stderr: ''
Oct 05 19:46:17 cloud-init[2782]: cc_scripts_user.py[WARNING]: Failed to run module scripts-user (scripts in /var/lib/cloud/instance/scripts)
Oct 05 19:46:17 cloud-init[2782]: util.py[WARNING]: Running module scripts-user (<module 'cloudinit.config.cc_scripts_user' from '/usr/lib/python2.7/dist-packages/cloudinit/config/cc_scripts_user.pyc'>) failed
Oct 05 19:46:17 cloud-init[2782]: util.py[DEBUG]: Running module scripts-user (<module 'cloudinit.config.cc_scripts_user' from '/usr/lib/python2.7/dist-packages/cloudinit/config/cc_scripts_user.pyc'>) failed
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/cloudinit/stages.py", line 660, in _run_modules
cc.run(run_name, mod.handle, func_args, freq=freq)
File "/usr/lib/python2.7/dist-packages/cloudinit/cloud.py", line 63, in run
return self._runners.run(name, functor, args, freq, clear_on_fail)
File "/usr/lib/python2.7/dist-packages/cloudinit/helpers.py", line 197, in run
results = functor(*args)
File "/usr/lib/python2.7/dist-packages/cloudinit/config/cc_scripts_user.py", line 38, in handle
util.runparts(runparts_path)
File "/usr/lib/python2.7/dist-packages/cloudinit/util.py", line 651, in runparts
% (len(failed), len(attempted)))
RuntimeError: Runparts: 1 failures in 1 attempted commands
Not sure if that is relevant, could that have something to do with the way the "user-data" is configuring the cluster instances at startup?
@5290charlie Checkout the /var/log/cfn-init-cmd.log and /var/log/cloud-init-output.log in CloudWatch to see what it exited on.
@kbcaleb the cfn-init-cmd.log
looks like all success messages, I do see these errors/warnings at the bottom of the cloud-init-output.log
which seem related:
ValidationError: Stack arn:aws:cloudformation:us-west-2:223016015200:stack/bb-cluster/c96ab0d0-9fbc-11e7-a30c-503aca41a0c5 is in UPDATE_COMPLETE state and cannot be signaled
Oct 05 19:46:17 cloud-init[2782]: util.py[WARNING]: Failed running /var/lib/cloud/instance/scripts/part-001 [1]
Oct 05 19:46:17 cloud-init[2782]: cc_scripts_user.py[WARNING]: Failed to run module scripts-user (scripts in /var/lib/cloud/instance/scripts)
Oct 05 19:46:17 cloud-init[2782]: util.py[WARNING]: Running module scripts-user (<module 'cloudinit.config.cc_scripts_user' from '/usr/lib/python2.7/dist-packages/cloudinit/config/cc_scripts_user.pyc'>) failed
the "UPDATE_COMPLETE state and cannot be signaled" shouldn't be a problem. another EC2 instance was faster than this one.
Can you post some of the messages in the dlq?
messages are moved to the dlq if they can not be processed successfully after 5 retires. so something must be wrong with them. but usually this would appear as an error of the poller. but the poller runs on all ec2 instances that form the cluster. so you have to search /var/log/messages of all ec2 instances
I fixed something in this area: a7557d68fb1784bdc53b08971f326e9d45e92903
TemplateID:
ecs/cluster
Region:us-west-2
This is an update from the my previously closed issue: https://github.com/widdix/aws-cf-templates/issues/112
Here is a quote summarizing the previous issue:
It has happened again, and I have gathered more evidence this time. I think I'm hot on the trail of figuring out why this is happening!
Screen shots and descriptions hosted on imgur
Here is a timeline of the events and a summary of my ideas:
Timeline:
16:34:31
- ECS starts new task as part of service update16:35:47
- ASG launches a new instance trigged by container shortage alarm16:36:20
- ASG completes launch of new cluster instance16:36:22
- ECS successfully stops the previous service task after service update16:41:46
- ASG begins instance termination16:47:59
- EC2 instance transitions state and shuts down before being terminated16:48:09
- ECS attempts to stop the task16:48:22
- Web browser receives 503 response from the service (via HTTP)16:48:33
- ECS registers new task in target group16:49:21
- ASG completes instance terminationIdeas:
DrainingTimeoutInSeconds
(maxWaitInSeconds
)60s
, but the cluster/service didn't even attempt to stop the task until after the instance was being terminated...5
in the template) are placed in the DLQ.Still not 100% sure what the issue is, but hopefully this is enough information this time and we can figure it out together! :)