Open mattdodge opened 4 years ago
@mattdodge Hey! Thanks a lot for your detailed report.
This does seem like a race issue.
We do currently rely solely on wait_until_ready_selector
to select which pods to be examined for readiness. I believe this would result in including pods from the old generation of the deployment.
A possible fix here would be to leverage the "revision" number of the deployment that is inherited down to the replicaset and pods.
But implementing this straight forward requires us to change the configuration syntax, maybe adding wait_until_ready_deployment: DEPLOYMENT_NAME
, deprecating and removing the existing wait_until_ready_selector
. I'd hope we can avoid that if possible.
@superbrothers WDYT? Did you have any reason to avoid relying on the "revision" number?
I would be in favor of a wait_until_ready_deployment
(or something similar) option. It's not quite as flexible as the label selector but it probably covers most of the common use cases.
If we're going that route we could probably make use of the kubectl rollout status
command too, rather than digging through revision numbers and all that. This command will wait for the deployment to successfully roll out or will return an error if the timeout is hit. This is actually how I'm getting around this race condition for now. I have a pipeline YML that looks like this:
- put: prod-kube
params:
kubectl: apply -f ymls/my-app-deployment.yml
- put: prod-kube
params:
kubectl: rollout status deployment/my-app --timeout 60s
I'm sorry for the late reply.
But implementing this straight forward requires us to change the configuration syntax, maybe adding
wait_until_ready_deployment: DEPLOYMENT_NAME
, deprecating and removing the existingwait_until_ready_selector
. I'd hope we can avoid that if possible.
Adding wait_util_ready_<resource>
is not flexible, so I don't want to do it as much as possible.
If we're going that route we could probably make use of the
kubectl rollout status
command too, rather than digging through revision numbers and all that.
Yes, I think so that this is right way. However, wait_until_ready
is a required parameter, so it is a breaking change to be changed to a option parameter. At this time, I recommend that you sets wait_until_ready
to 0
. 0
means don't wait. (I know this is too verbose...)
- put: prod-kube
params:
kubectl: apply -f ymls/my-app-deployment.yml
wait_until_ready: 0
- put: prod-kube
params:
kubectl: rollout status deployment/my-app --timeout 60s
wait_until_ready: 0
I will consider to delete wait_until_ready
param and bumps up the major version.
Is this a BUG REPORT or FEATURE REQUEST?:
What happened: The put step with a wait_until_ready_selector is returning success immediately. It's almost too fast for its own good!
What you expected to happen: I expect the wait step to wait until the deployment update is complete.
How to reproduce it (as minimally and precisely as possible): Have a kubernetes deployment with a normal RollingUpdate strategy. Use this kubernetes-resource to put changes to the deployment like so:
When this step runs I see this in the output:
This returns true immediately despite the fact that the new pod/replicaset hasn't actually spun up yet. It seems like resource is checking the ready status before the new pod is even created. Likely some kind of race condition with Kubernetes.
Environment: