Expose a public API to support networking chaos engineering scenarios

Lawouach commented 6 years ago

Hi all,

The idea of fault injection or resiliency testing is becoming quite useful to find out how your system handles adverse conditions. This is often packaged under the phrase Chaos Engineering.

On my Kubernetes cluster, I have been interested in learning from degraded conditions to help me confirm/infirm hunches of bad behavior impacts on my system availability.

As a Weave user, I am interested in discussing a dedicated so-called Chaos API to change the conditions of my environment.

At a high-level:

A way to prevent Weave Cloud from operating normally (like, not releasing a new tag so I see if I create alerts for it and perform the ops manually)
Various network conditions (jittering, delay, packet loss, faulty DNS, network encryption issue...) to understand how Weave, and my applications, degrade when network is poor
Perhaps even entry points to actually fail entirely some of the Weave components

As this is Weave Net project, I'll focus on the second use-case only.

What I would like to do is tell Weave Net something like: "jitter ingress network of pod X for 45 seconds" or "send egress from service Y to dev null". Or even maybe, "do not respond to DNS queries from service Z for 30 seconds".

I am a big fan of REST API for things like this, vs a Go package, as it means I can operate it from anywhere.

Is thing something worth discussing you reckon?

Thanks

Lawouach commented 6 years ago

Just a a FYI, this is quite neat https://github.com/weaveworks-plugins/scope-traffic-control

brb commented 6 years ago

@Lawouach Thanks for the issue!

As mentioned on Slack, indeed, we'd be interested in defining and developing such API.

To begin with, do you have any existing API consumer (aka chaos monkey) in mind?

Lawouach commented 6 years ago

Hi @brb, apologies for the delay.

In regards to an API, there are two (non competitive) ways of seeing it:

a simple API that says start running adversial conditions in my system (such as Azure Service Fabric https://docs.microsoft.com/en-us/azure/service-fabric/service-fabric-controlled-chaos). In this case, the platform makes the decision about the chaos it runs more or less
an API that puts the operator in charge of deciding which chaos he/she wishes to run, and when that should occur

I don't think both are exclusive, in fact I would welcome both. The former is interesting for the providers (meaning Weave Works here) because of the control that remains in their hands. However, the latter is also quite interesting for the consumer because it means a user can craft more complex scenarios.

With the Chaos Toolkit, which I would use to speak with any of those API, I tend to favour the latter because I can really tailor a chaos engineering experiment to a particular hypothesis. As I think not everything is about breaking stuff in chaos engineering, I like the opportunity for more subtle API :)

The plugin I linked above is an interesting example because it does expose the tooling I could be interested in but:

I have no idea how to access it programmatically anyway (is there an API to converse with plugins in Weave Scope or is it UI only?)
It does so with its own "agenda", meaning I don't have much control over the parameters of the chaos operations it exposes, from its README:

The hourglass buttons control the latency, from left to right they set: 2000ms, 1000ms, and 500ms. The scissor button controls the packet loss, it sets a 10% packet loss. The circled cross button clear any previous settings.

What if I want more than 10%?

This is why a fine grained API is powerful as a consumer while a higher level API remains convenient but constrained. Both seem useful but I favour the former personally :p

brb commented 6 years ago

Sorry for the delay.

My perspective is that Weave Net could provide an (fine grained) API to introduce faults (jitter, delay, packet loss), and then a consumer (e.g. the Chaos Toolkit) would perform various chaos tests by using the API.

Internally, from implementation POV, with the help of tc and cgroups it should be trivial to implement faults. We just need to maintain a mapping between Kubernetes objects (Pods, Namespaces, Services) and network namespaces managed by Weave Net. But we already do it in Network Policy Controller (weave-npc).

With the Chaos Toolkit, which I would use to speak with any of those API

Is the Chaos Toolkit able to speak to such APIs today? Asking, as it would help us define the API and its granularity.

Lawouach commented 6 years ago

No problem. There is no rush.

It's good news you're willing ti guve a it a spin.

The toolkit has no weave driver yet as we usually wait for the API to be defined before adding support for it. I will give some thoughts about how it could look this week however.

Would a REST HTTP endpoint be something you'd offer? Or a different protocol? What would be the auth you'd require (if any)? Considering the toolkit is often run outside of the cluster itself (but can obviously run from inside).

brb commented 6 years ago

I will give some thoughts about how it could look this week however.

Nice, looking forward! As a pointer, you might want to check https://kubernetes.io/docs/concepts/services-networking/network-policies/.

Would a REST HTTP endpoint be something you'd offer? Or a different protocol?

Yes - a REST HTTP endpoint, as the existing Weave Net API is implemented as REST over HTTP. E.g. https://github.com/weaveworks/weave/blob/master/ipam/http.go

What would be the auth you'd require (if any)?

We do not provide any auth. I think auth is not a crucial part at this stage, and we can discuss it later if needed.

brb commented 6 years ago

Lawouach commented 6 years ago

Hey,

Reharding API examples, do you mean to say you expect a Kubernetes extension API? Or an adhoc Weave REST API endpoint?

In that latter case, would something such as this make sense to you?

POST https://weave-endpoint/fault/network

{
    "type": "latency",
    "target": {
        "id": "...",
        "name": "...",
        "label": "..."
    },
        "device": "eth0|ip|cidr",
    "delay": {
        "distribution": "normal"
        "min": "10ms",
        "max": "500ms"
    },
    "duration": "30s"
}

To start adding latency to the given target, which could be identified as a container id/name, pod name/label. The latency would occur only the given device identified by its name, IP or CIDR. Finally, a distribution would be applied to vary the latency within the range (this could start with a single hard value).

The duration would indicate how long this should run for. However, I would not mind if we could then run a DELETE to cancel it either before or in case no upper duration limit was set.

This would look similar in all sorts of fault injections.

brb commented 6 years ago

Reharding API examples, do you mean to say you expect a Kubernetes extension API?

I'd not bother with extending the Kubernetes API at this moment.

In that latter case, would something such as this make sense to you?

IMO, it's too fine grained. As an end-user, you probably don't want to mess with such details as "device". Instead, you want to select pods (e.g. with namespaceSelector and podSelector as in NetworkPolicy, see my link above) between which a fault should be injected. Does it make sense to you?

However, I would not mind if we could then run a DELETE to cancel

That's what I prefer.

Lawouach commented 6 years ago

Much agreed with using the pod+ns selector indeed!

However, I'd still be interested in tuning the fault itself even if I am happy living with default settings. So for instance, a fairly basic "add Xms latency too all egress of that pod" but also something a little richer. But again, I could live with the former as well as long as it offers some basic tuning.

brb commented 6 years ago

but also something a little richer

Please elaborate more!

Lawouach commented 6 years ago

Right.

In the case of the latency, it can interesting to apply a distribution, not just a static delay. Or some jittering to the delay at least.

But to be honest, I do not have all the use cases in my head so maybe I'm making this more complicated they need to be.

brb commented 6 years ago

In the case of the latency, it can interesting to apply a distribution, not just a static delay. Or some jittering to the delay at least.

OK, I see. Of course, we could implement it, but I'd like to start with MVP.

brb commented 6 years ago

Lawouach commented 6 years ago

Indeed, you have also https://github.com/alexei-led/pumba

Though it hasn't exposed an API yet.

weaveworks / weave

Expose a public API to support networking chaos engineering scenarios #3274