feat: add retries to the promotion webhook

yitsushi commented 1 year ago

The retry package has one algorithm now (exponential), but it can be extended to have more options (like constant delay).

The the server has three new retry related CLI flags:

--promotion-retry-delay: Delay between promotion retries in seconds. (default 3)
--promotion-retry-max-delay: Maximum delay between promotion retries. (default 20)
--promotion-retry-threshold: How many times a promotion should be retried. (default 3)

Closes #58

Why didn't I use an existing library for retries

The main reason is, retry libraries are built around the concept of retrying on client side for example http requests. Exponential retry logic is good and I think that's the most reasonable one to use (for now at least), but exponential can grow... well exponentially, but we may want to limit that value, so the first N tries follow the exponential delay, but we can be sure we don't have 2 minutes long delay between retries, but we still want to retry N+M times, and we want to wait maximum X seconds between retries after N tries (for example: 2, 4, 8, 16, 20, 20, 20).

For example I tried avast/retry-go but it was a mess with DelayType to implement the logic described above.

References:

makkes commented 1 year ago

If I may I would like to state my concern here about us implementing the retry/backoff logic ourselves. There's plenty of libraries out there and IMHO we should only come up with our own implementation if we have very good reasons to skip any of the existing ones.

This is true for the retry package introduced here as well as the ratelimiter package introduced in #128.

enekofb commented 1 year ago

This is a good PR but, in my opinion, we don't need to expose this level of control to the end user. Our main goal with retries is to recover from transient failures that may occur when the chosen promotion strategy gets executed. In my opinion, the details of how this happens do not need to differ by CR, it should instead be a property/characteristic of the pipeline-controller itself. So perhaps I would move these from CR fields to CLI flags with sensible defaults.

@enekofb what is your opinion on the above?

Given that the semantics of the retry here is to best effort recover from transient failures, i agree with @yiannistri that might no be needed for a user to have control over the fact of retrying ...

however that given the retry would be happening for transient but might be also for non-transient scenarios, it feels that we should complement this feature (potentially other story) for setting timeouts that ensure the user is able to setup timeouts for user-specific knowledge like the time for cloning the configuration repo (that could be kbs, mbs or 100s mbs)

yitsushi commented 1 year ago

This is true for the retry package introduced here as well as the ratelimiter package introduced in https://github.com/weaveworks/pipeline-controller/pull/128.

For the rate limiting PR, half the PR description is about the reason.

With this PR I'll do the same, but the main reason is, retry libraries are built around the concept of retrying on client side for example http requests. Exponential retry logic is good and I think that's the most reasonable one to use (for now at least), but exponential can grow... well exponentially, but we may want to limit that value, so the first N tries follow the exponential delay, but we can be sure we don't have 2 minutes long delay between retries, but we still want to retry N+M times, and we want to wait maximum X seconds between retries after N tries (for example: 2, 4, 8, 16, 20, 20, 20).

For example I tried avast/retry-go but it was a mess with DelayType to implement the logic described above.

weaveworks / pipeline-controller

feat: add retries to the promotion webhook #133

Why didn't I use an existing library for retries