zalando / skipper

An HTTP router and reverse proxy for service composition, including use cases like Kubernetes Ingress
https://opensource.zalando.com/skipper/
Other
3.11k stars 351 forks source link

Use exponentially decaying sample for metrics #428

Closed aermakov-zalando closed 7 years ago

aermakov-zalando commented 7 years ago

In 739f413, the metrics were changed to use a UniformSample with a hard-coded 1024 reservoir size. This makes latency metrics for services with non-uniform load very skewed. For example, if our service is under a very high load (100 rps) due to a misconfiguration in another service, and then the load drops to normal (1-2 rps) after the misconfiguration is fixed, it can take hours, days or even weeks for the metrics to go back to normal. Would you consider switching back to the default exponentially decaying sample? I can send a PR if necessary.

szuecs commented 7 years ago

@lmineiro Can you please check this? Thanks!

aermakov-zalando commented 7 years ago

363 is most likely the same issue, by the way.

szuecs commented 7 years ago

Maybe it's better to have a parameter that can change the default from UniformSample to expDecay and also a parameter to set the reservoir size for UniformSample.

lmineiro commented 7 years ago

sgtm

aermakov-zalando commented 7 years ago

Benchmark: create N histograms with either a Uniform or ExpDecay sample, then update them all once per benchmark iteration. Results:

goos: darwin
goarch: amd64
pkg: github.bus.zalan.do/teapot/metrics-bench
BenchmarkUniform100-4             500000          9120 ns/op           0 B/op          0 allocs/op
BenchmarkUniform1000-4             50000         91968 ns/op           0 B/op          0 allocs/op
BenchmarkUniform10000-4            10000        903244 ns/op           0 B/op          0 allocs/op
BenchmarkUniform100000-4            1000       6693171 ns/op           0 B/op          0 allocs/op
BenchmarkExpDecay100-4            200000         22861 ns/op           0 B/op          0 allocs/op
BenchmarkExpDecay1000-4            10000        487840 ns/op           0 B/op          0 allocs/op
BenchmarkExpDecay10000-4            2000       5104737 ns/op           0 B/op          0 allocs/op
BenchmarkExpDecay100000-4            200      33383230 ns/op           0 B/op          0 allocs/op
PASS
ok      github.bus.zalan.do/teapot/metrics-bench    56.337s

Memory use: ~8272 bytes/histogram for UniformSample, ~16545 bytes/histogram for ExpDecaySample.

lmineiro commented 7 years ago

Can you do this with N updates, where N is as close as possible to the reservoir capacity?

aermakov-zalando commented 7 years ago

@lmineiro Updated 2048 times with random data (same for both benchmarks, though).

$ go test -bench=. -benchtime 10s -benchmem
goos: darwin
goarch: amd64
pkg: github.bus.zalan.do/teapot/metrics-bench
BenchmarkUniform100-4           1000      18655357 ns/op           0 B/op          0 allocs/op
BenchmarkExpDecay100-4           300      44768498 ns/op           0 B/op          0 allocs/op
PASS
ok      github.bus.zalan.do/teapot/metrics-bench    38.606s
lmineiro commented 7 years ago

These benchmarks seem to confirm what we experienced before. The memory overhead and performance of the exponential decay are worse. It was for that reason that we went with the uniform. The implemented option seems reasonable and allows users to select. I'd definitely include these number this time in the docs so that users will know what they're getting into.

aermakov-zalando commented 7 years ago

I'd say the overhead is negligible for both implementations, though, and having better metrics is worth a few more nano seconds per request. I recommend switching the default implementation after we're satisfied with the testing results.

lmineiro commented 7 years ago

I disagree that is it negligible. If you have a closer look, a single proxied request keeps and updates a lot of histograms. Not only that, the memory footprint can become overwhelming, like we realized before https://github.com/zalando/skipper/pull/446

I'd be really interested in adding a new reservoir which would combine the efficiency of the Uniform reservoir while applying a sliding window. I believe @aryszka has been working on something like this.

aryszka commented 7 years ago

I think you refer to https://github.com/zalando/skipper/blob/master/circuit/binarysampler.go

this sliding window sampler would cost too much for arbitrary values and/or too large window sizes, i believe. It's used in the circuit breaker, where we only track binary values (success vs failure), with relatively small window sizes, and so it's enough to store only bits.