ECS reporter throttled by AWS API

2opremio commented 7 years ago

This is the Sock Shop, run with the CloudFormation template (3x m4.xlarge instances) in Weave Cloud

2opremio commented 7 years ago

It seems AWS is throttling us:

<probe> WARN: 2016/11/30 19:23:17.144169 Error listing ECS services, ECS service report may b
e incomplete: ThrottlingException: Rate exceeded
        status code: 400, request id: 7210cbfd-b732-11e6-a879-0b8af0abb45a
<probe> ERRO: 2016/11/30 19:23:17.154211 error applying tagger: ThrottlingException: Rate exc
eeded
        status code: 400, request id: 721252a0-b732-11e6-a879-0b8af0abb45a
<probe> WARN: 2016/11/30 19:23:18.549526 Error describing some ECS services, ECS service repo
rt may be incomplete: ThrottlingException: Rate exceeded

2opremio commented 7 years ago

Also we should correct the printf string format of some warnings:

<probe> WARN: 2016/11/30 20:54:25.202746 Failed to describe ECS task %!s(*string=0xc42121bf10
), ECS service report may be incomplete: %!s(*string=0xc42121bf30)

2opremio commented 7 years ago

First 1000 lines of the logs: http://sprunge.us/SNIb

ekimekim commented 7 years ago

my short-term thoughts on a long-term solution: we may need to get clever here with caching and careful use of immutable fields. For example, StartedBy for a task isn't going to change, which means we don't need to DescribeTasks every time - except this means any other metadata we may want to collect will also get stale. :S

2opremio commented 7 years ago

I've worked around it for now by creating the cluster in a separate region (AWS rate limits per region)

2opremio commented 7 years ago

A robust way to fix this should be using the ECS event stream https://aws.amazon.com/blogs/compute/monitor-cluster-state-with-amazon-ecs-event-stream/

However, I am not sure whether or how easily we can plug Scope in.

ekimekim commented 7 years ago

I think it's too much to ask to need the user to create CloudWatch rules and an SQS topic as part of setup - or even if we go that route later, it'd be nice if it was optional. So for now I'm moving forward with caching. My thoughts so far:

We can largely treat Tasks as immutable. Tasks always progress in state PENDING -> RUNNING -> STOPPED (there's a few error-handling paths in there too), never backwards - ie. you can't start a stopped task, only a pending one - so fields like StartedBy and StartedAt can be considered immutable by the time we see them (when state = RUNNING).
Services are harder. They have many mutable fields, both ones that are important for maintaining the tasks->services map (deployment list) and ones that we just don't want to be stale for display (eg. running count). I've mapped out a solution where we still re-fetch all services that we report on (with at least one task of that service running on the local machine) each report, but we only need to re-scan other services when a new task appears that doesn't map to any known services.

Taken together, these improvements will cut down at least 50% of all queries, and likely more in most situations (since there's more tasks than services, and we won't be fetching services that aren't present on the machine).

ekimekim commented 7 years ago

We could cut down on requests further by allowing our data to be stale out to some refresh rate (say, 1 minute), but still doing a shortcut refresh if needed to find the correct task for a service. But I'd like to avoid stale data in the details panel if at all possible - even a single instance of that can undermine user confidence in its accuracy in all cases.

pidster commented 7 years ago

I think it's too much to ask to need the user to create CloudWatch rules and an SQS topic as part of setup - or even if we go that route later, it'd be nice if it was optional.

Would this be something the launch-generator to take care of? cc @lukemarsden @errordeveloper

2opremio commented 7 years ago

Would this be something the launch-generator to take care of?

I don't see how, at least not in the way we are currently using the launch-generator. You cannot create CloudWatch rules from cluster resources (be it Kubernetes, ECS or whathaveyou)

pidster commented 7 years ago

The AWS Blox project purportedly provides a CFN template for doing this. The launch-generator could do the same, no? Or at least provide a fragment.

2opremio commented 7 years ago

Sure, we could create the AWS resources through a CFN template but I don't think the launch-generator would be involved.

Scope could detect whether the ECS SQS queue is available at start through the presence of a parameter in /etc/weave.scope.config (e.g. the SQS credentials) and otherwise fallback to using the AWS API directly.

In order to propagate the SQS credetails to Scope, I guess we could:

When using the Weave AMI: add them to /etc/scope.conf through User Data (I am not sure this would be secure enough)
When using the CFN template: simply extend the template with the creation of the extra resources and a AWS::CloudFormation::Init to add the SQS credentials to /etc/scope.conf

@errordeveloper Does this make sense? If it does, let's create separate issues forit here and https://github.com/weaveworks/integrations (we still need a minimally performant solution when SQS is not available and I would like to use this issue for that).

2opremio commented 7 years ago

A user is experiencing this even after #2065 (Scope 1.2) in a 5-node cluster: https://weaveworks.slack.com/archives/weave-users/p1486634036001678 . Reopening.

pidster commented 7 years ago

pecigonzalo commented 7 years ago

@2opremio I really like the idea of using CW for keeping a state of the resources if an SQS parameter is provided, otherwise fallback to API+Cache. Maybe something like:

An initial scan to get state and clear the SQS queue (as EG in a reboot it could contain outdated events)
Subscribe to the SQS queue
Periodically query API directly and clean SQS, to ensure reconciliation.

Something things to keep in mind:

For this is that CW events can be duplicates and you can verify this by keeping a record of the last version you processed. On a similar project (using CW Events to automate other stuff) we do this by keeping a record with a 1m ttl of the last version for a task ARN (as you dont expect older or duplicates after 1m), this could very well use a similar implementation of what is used for current cache.
CW Events do not contain the full task description, although I think they contain enough information for Scope.

bboreham commented 6 years ago

Is every probe polling the API and getting the same information? If so, could we configure just one probe to poll?

errordeveloper commented 6 years ago

@bboreham I'd agree, however currently nothing stops you from running scope probes in different clusters, in which case how can we tell which probe is in which cluster?

it'd make a lot of sense to externalise Kubernetes and ECS code into plugins, and deploy just one of those per cluster.

bboreham commented 6 years ago

By “just one” I meant one per cluster.

rade commented 6 years ago

could we configure just one probe to poll?

And if the node on which that runs gets removed, we'd have to enable the polling on another one. Not easy.

errordeveloper commented 6 years ago

And if the node on which that runs gets removed, we'd have to enable the polling on another one. Not easy.

The only sensible thing I can image would be to enable these integration outside of the probe process, run them as containers and let orchestrator take care of where it would run. It's probably a little easier then doing some kind of election among probes, but a big change nevertheless, although it could help the plugin story.

pidster commented 6 years ago

And if the node on which that runs gets removed, we'd have to enable the polling on another one. Not easy.

Can Kubernetes run this as a deployment of 1? Or do the plugins need to be sidecars?

Obvs, not going to work for ECS...

errordeveloper commented 6 years ago

Can Kubernetes run this as a deployment of 1? Or do the plugins need to be sidecars?

Yes it can, as long as there is also a probe pod on the same node (which should be the case under normal conditions).

Obvs, not going to work for ECS...

I think it could be made to work...

weaveworks / scope

ECS reporter throttled by AWS API #2050