weaveworks / scope

Monitoring, visualisation & management for Docker & Kubernetes
https://www.weave.works/oss/scope/
Apache License 2.0
5.87k stars 712 forks source link

ECS reporter throttled by AWS API #2050

Open 2opremio opened 7 years ago

2opremio commented 7 years ago

2opremio commented 7 years ago

This is the Sock Shop, run with the CloudFormation template (3x m4.xlarge instances) in Weave Cloud

2opremio commented 7 years ago

It seems AWS is throttling us:

<probe> WARN: 2016/11/30 19:23:17.144169 Error listing ECS services, ECS service report may b
e incomplete: ThrottlingException: Rate exceeded
        status code: 400, request id: 7210cbfd-b732-11e6-a879-0b8af0abb45a
<probe> ERRO: 2016/11/30 19:23:17.154211 error applying tagger: ThrottlingException: Rate exc
eeded
        status code: 400, request id: 721252a0-b732-11e6-a879-0b8af0abb45a
<probe> WARN: 2016/11/30 19:23:18.549526 Error describing some ECS services, ECS service repo
rt may be incomplete: ThrottlingException: Rate exceeded
2opremio commented 7 years ago

Also we should correct the printf string format of some warnings:

<probe> WARN: 2016/11/30 20:54:25.202746 Failed to describe ECS task %!s(*string=0xc42121bf10
), ECS service report may be incomplete: %!s(*string=0xc42121bf30)
2opremio commented 7 years ago

First 1000 lines of the logs: http://sprunge.us/SNIb

ekimekim commented 7 years ago

my short-term thoughts on a long-term solution: we may need to get clever here with caching and careful use of immutable fields. For example, StartedBy for a task isn't going to change, which means we don't need to DescribeTasks every time - except this means any other metadata we may want to collect will also get stale. :S

2opremio commented 7 years ago

I've worked around it for now by creating the cluster in a separate region (AWS rate limits per region)

2opremio commented 7 years ago

A robust way to fix this should be using the ECS event stream https://aws.amazon.com/blogs/compute/monitor-cluster-state-with-amazon-ecs-event-stream/

However, I am not sure whether or how easily we can plug Scope in.

ekimekim commented 7 years ago

I think it's too much to ask to need the user to create CloudWatch rules and an SQS topic as part of setup - or even if we go that route later, it'd be nice if it was optional. So for now I'm moving forward with caching. My thoughts so far:

Taken together, these improvements will cut down at least 50% of all queries, and likely more in most situations (since there's more tasks than services, and we won't be fetching services that aren't present on the machine).

ekimekim commented 7 years ago

We could cut down on requests further by allowing our data to be stale out to some refresh rate (say, 1 minute), but still doing a shortcut refresh if needed to find the correct task for a service. But I'd like to avoid stale data in the details panel if at all possible - even a single instance of that can undermine user confidence in its accuracy in all cases.

pidster commented 7 years ago

I think it's too much to ask to need the user to create CloudWatch rules and an SQS topic as part of setup - or even if we go that route later, it'd be nice if it was optional.

Would this be something the launch-generator to take care of? cc @lukemarsden @errordeveloper

2opremio commented 7 years ago

Would this be something the launch-generator to take care of?

I don't see how, at least not in the way we are currently using the launch-generator. You cannot create CloudWatch rules from cluster resources (be it Kubernetes, ECS or whathaveyou)

pidster commented 7 years ago

The AWS Blox project purportedly provides a CFN template for doing this. The launch-generator could do the same, no? Or at least provide a fragment.

2opremio commented 7 years ago

Sure, we could create the AWS resources through a CFN template but I don't think the launch-generator would be involved.

Scope could detect whether the ECS SQS queue is available at start through the presence of a parameter in /etc/weave.scope.config (e.g. the SQS credentials) and otherwise fallback to using the AWS API directly.

In order to propagate the SQS credetails to Scope, I guess we could:

@errordeveloper Does this make sense? If it does, let's create separate issues forit here and https://github.com/weaveworks/integrations (we still need a minimally performant solution when SQS is not available and I would like to use this issue for that).

2opremio commented 7 years ago

A user is experiencing this even after #2065 (Scope 1.2) in a 5-node cluster: https://weaveworks.slack.com/archives/weave-users/p1486634036001678 . Reopening.

pidster commented 7 years ago

See also: https://github.com/prometheus/prometheus/pull/2309

pecigonzalo commented 7 years ago

@2opremio I really like the idea of using CW for keeping a state of the resources if an SQS parameter is provided, otherwise fallback to API+Cache. Maybe something like:

  1. An initial scan to get state and clear the SQS queue (as EG in a reboot it could contain outdated events)
  2. Subscribe to the SQS queue
  3. Periodically query API directly and clean SQS, to ensure reconciliation.

Something things to keep in mind:

bboreham commented 6 years ago

Is every probe polling the API and getting the same information? If so, could we configure just one probe to poll?

errordeveloper commented 6 years ago

@bboreham I'd agree, however currently nothing stops you from running scope probes in different clusters, in which case how can we tell which probe is in which cluster?

it'd make a lot of sense to externalise Kubernetes and ECS code into plugins, and deploy just one of those per cluster.

bboreham commented 6 years ago

By “just one” I meant one per cluster.

rade commented 6 years ago

could we configure just one probe to poll?

And if the node on which that runs gets removed, we'd have to enable the polling on another one. Not easy.

errordeveloper commented 6 years ago

And if the node on which that runs gets removed, we'd have to enable the polling on another one. Not easy.

The only sensible thing I can image would be to enable these integration outside of the probe process, run them as containers and let orchestrator take care of where it would run. It's probably a little easier then doing some kind of election among probes, but a big change nevertheless, although it could help the plugin story.

pidster commented 6 years ago

And if the node on which that runs gets removed, we'd have to enable the polling on another one. Not easy.

Can Kubernetes run this as a deployment of 1? Or do the plugins need to be sidecars?

Obvs, not going to work for ECS...

errordeveloper commented 6 years ago

Can Kubernetes run this as a deployment of 1? Or do the plugins need to be sidecars?

Yes it can, as long as there is also a probe pod on the same node (which should be the case under normal conditions).

Obvs, not going to work for ECS...

I think it could be made to work...