zalando / skipper

An HTTP router and reverse proxy for service composition, including use cases like Kubernetes Ingress
https://opensource.zalando.com/skipper/
Other
3.07k stars 347 forks source link

Wrong traffic to route while RouteGroup traffic switching a route with Traffic() predicate #1464

Open szuecs opened 4 years ago

szuecs commented 4 years ago

Describe the bug For example you have a routegroup with shadow traffic with 10% and later you traffic switch v1/v2 with 80%/20%, which will send more traffic than 10% to shadow backend as requested.

spec:
  backends:
  - name: my-backend-v1
    serviceName: my-svc-v1
    servicePort: 80
    type: service
  - name: my-backend-v2
    serviceName: my-svc-v2
    servicePort: 80
    type: service
  - name: shadow-backend
    serviceName: shadow-service
    servicePort: 80
    type: service
  defaultBackends:
  - backendName: my-backend-v1
    weight: 80
  - backendName: my-backend-v2
    weight: 20
  hosts:
  - api.example.org
  routes:
  - pathSubtree: /
  - filters:
    - teeLoopback("shadow-example")
    pathSubtree: /
    predicates:
    - Traffic(.1)
  - backends:
    - backendName: shadow-backend
    pathSubtree: /
    predicates:
    - Tee("shadow-example")
    - True()

This will create the following eskip routes:

# 80% traffic to v1
kube_rg__default__rg_shadow__all__0_0: Host(/^(api[.]example[.]org)$/) && PathSubtree("/") && Traffic(0.8)
  -> "http://10.2.3.115:9090";
# 20% traffic (rest) to v2
kube_rg__default__rg_shadow__all__0_1: Host(/^(api[.]example[.]org)$/) && PathSubtree("/")
  -> "http://10.2.6.130:9090";

# duplicated route while traffic split 10% to shadow (v1 backend) (8% shadow)
kube_rg__default__rg_shadow__all__1_0: Host(/^(api[.]example[.]org)$/) && PathSubtree("/") && Traffic(0.1) && Traffic(0.8)
  -> teeLoopback("shadow-example")
  -> "http://10.2.3.115:9090";

# duplicated route while traffic split 10% to shadow (v2 backend) (10% shadow)
kube_rg__default__rg_shadow__all__1_1: Host(/^(api[.]example[.]org)$/) && PathSubtree("/") && Traffic(0.1)
  -> teeLoopback("shadow-example")
  -> "http://10.2.6.130:9090";

kube_rg__default__rg_shadow__all__2_0: Host(/^(api[.]example[.]org)$/) && PathSubtree("/") && Tee("shadow-example") && True()
  -> "http://10.2.6.173:9090";

8% and 10% traffic to shadow while traffic switching is ongoing.

Expected behavior

10% traffic goes to shadow backend

Observed behavior

10% < N% < 18%, N% traffic goes to shadow backend

aryszka commented 4 years ago

During the first step, kube_rgdefaultrg_shadowall1_0 is tested, and it is matched with a 0.1 0.8 = 0.08 chance. In the next step, both kube_rgdefaultrg_shadowall0_0 and can match the remaining 0.92, with either 0.8 0.92 or 0.1 * 0.92. This makes it a bit more complicated, and it's not exactly 0.18, but somewhat less. Though, this doesn't change the fact that we have this bug.

What makes things worse, however, is that after the loopback, which happens with a chance different from 0.1, both kube_rgdefaultrg_shadowall1_0 and kube_rgdefaultrg_shadowall2_0 can match, and because of kube_rgdefaultrg_shadowall1_0, we can loop until the maxloops is reached. This problem also needs to be considered.

aryszka commented 4 years ago

Ran an experiment.

Routing:

v1_80: Traffic(0.8)
  -> status(200)
  -> <shunt>;

v2_20: *
  -> status(200)
  -> <shunt>;

v1_10_shadow: Traffic(0.1) && Traffic(0.8)
  -> teeLoopback("shadow-example")
  -> status(200)
  -> <shunt>;

v2_10_shadow: Traffic(0.1)
  -> teeLoopback("shadow-example")
  -> status(200)
  -> <shunt>;

shadow: Tee("shadow-example") && True()
  -> status(200)
  -> <shunt>;

Requests (1000 total):

for i in {1..1000}; do curl localhost:9090; done

Repeated 5 times, restarting Skipper before each run, so that the unspecified priority between v1_10_shadow and shadow could take effect. Results:

skipper_filter_all_request_duration_seconds_count{route="shadow"} 107
skipper_filter_all_request_duration_seconds_count{route="shadow"} 176
skipper_filter_all_request_duration_seconds_count{route="shadow"} 158
skipper_filter_all_request_duration_seconds_count{route="shadow"} 165
skipper_filter_all_request_duration_seconds_count{route="shadow"} 92

Notice that sometimes the chance to hit the shadow route was ~10%, and sometimes it was ~16%.

szuecs commented 4 years ago

Thanks @aryszka , yes it's not as bad but we have a bug. Maybe not as important, because shadow backend should not harm if it dies and it's just more traffic than expected not less.