yext / edward

A tool for managing local microservice instances
http://engblog.yext.com/edward/
MIT License
171 stars 32 forks source link

Feature Request: Configurable grace period for stop before SIGKILL #173

Closed shousper closed 5 years ago

shousper commented 5 years ago

Could we make the interrupt grace period configurable per-service?

I found here its hard-coded to 5 seconds, but this isn't reasonable for services that have a few tasks to perform in order to shutdown gracefully.

I believe, for example, Docker's default is 10 seconds. This would be a much better default IMO, but ultimately I think a configurable grace period would interop better if users have services that might have variant shutdown periods.

As an aside, I'd also recommend either making the stop signal configurable (at the global level) or change it to SIGTERM instead of SIGINT, as this is also standard for Docker.

This 2 enhancements will help reduce the overhead to replicate production behaviours in, for example, AWS ECS or EKS/Kubernetes.

EDIT: Docker's default is 10 seconds, not 30. Updated. However ECS, EKS and Kubernetes default to 30.

theothertomelliott commented 5 years ago

Thanks for raising this issue, really appreciate the detail and recommendation!

Do you see it being more useful for your use case to have the grace period configurable in the service configuration, a command-line flag, or both?

shousper commented 5 years ago

No worries mate, thanks for building and maintaining such a useful tool for us!

I think both could be most helpful. We don't tend to use command line flags for edward at all, everything is configured via an edward.json for convenience and sharing. I'd say in priority order:

1) Service level override minimum would sort us out for a good while. 2) Top-level config override would be nice to have, but not absolutely required. 3) Command line flag for the top-level override might be handy, but also probably a nice to have.

theothertomelliott commented 5 years ago

Sounds good!

Let's focus on getting (1) done first. I can probably get some time to work on this over the next week, unless you or someone you know would be interested in putting together a PR.

shousper commented 5 years ago

That would be fantastic! We're all pretty flat out over here, but I'll ask around 👍

steveoc64 commented 5 years ago

@shousper - while you are there, do you have a suggestion for the syntax to add to the config ?

Im thinking something simple like this on the service definition

{
  "name": "my-awesome-service",
  "commands": {
     "launch": "my-awesome-service"
  },
  "terminate": { 
    "signal": "11",  // UNIX signal ID. Default = syscall.SIGINT
    "wait": "30s"   // use time.Parse() syntax. Default = 5s
  }
}

with the ability to add "terminate" to the group definition as well, and inherit that at each service level.

Thoughts ?

Another way to achieve the same thing - use the service::stop directive, and wrap it in a shell script that does what you need ?

theothertomelliott commented 5 years ago

Would the "signal" here be referring to the initial signal sent, or the one sent after the "wait" period?

shousper commented 5 years ago

Initial signal 👍

theothertomelliott commented 5 years ago

One snag with setting the initial signal sent is that the runner is signaled rather than the service itself.

Edward starts a separate runner process that maintains the service process as a child. If the runner receives an interrupt, it will attempt to stop the service as appropriate to the backend being used.

For the command-line backend, this is currently SIGINT. For the Docker backend, this is a call to the Docker API to stop the container.

The wait period is applicable to all backends, the signal itself is not, so they'd need to be separate in the config.

I'm working on a change that will add a "terminationTimeout" option (default 30s) to service config and set the default signal for stopping command-line services to SIGTERM. Will upload as a PR and link here.

theothertomelliott commented 5 years ago

I've uploaded a PR with a proposed implementation (see link above). I think the test might need some re-working, but it demonstrates the idea.

shousper commented 5 years ago

Hi @theothertomelliott, I just checked out your branch now for some testing and this looks great 👍 Very happy with the new defaults and configuration field 🎉 Stoked with how fast you did this! Thanks

theothertomelliott commented 5 years ago

My pleasure! Glad the changes fill the need.

I'm going to do a little cleanup on the test, but should be able to get that merged and drop out a minor release over the next few days.

shousper commented 5 years ago

Regarding:

If the runner receives an interrupt, it will attempt to stop the service as appropriate to the backend being used.

I'd say what you've done is sufficient. Once the "backends" feature reaches maturity and you find you need to move this timeout value to a command line backend specific property, I'd say that'd be fine too. Can always deprecate and move these things over time 😃.

As far as Docker, it has a Dockerfile command for the stop signal and you already support overriding config for this. So I'd say that one is already solid 👌

theothertomelliott commented 5 years ago

Deployed version 1.8.19 with the configurable timeout.