zhaofengli / colmena

A simple, stateless NixOS deployment tool
https://colmena.cli.rs
MIT License
1.12k stars 62 forks source link

Feature Request : healthchecks #104

Open mrVanDalo opened 2 years ago

mrVanDalo commented 2 years ago

I've never used morph, but one feature of morph caught my attention a while ago. I'm using colmena now (great work btw), and I would like to be able to run health check on my instances after deployments and when I feel like it.

I actually like the design of morph here : https://github.com/DBCDK/morph/blob/master/examples/healthchecks.nix#L22 A command line argument (health-check) would be great as well.

At the moment I us flake apps, but it is neither convenient nor easy to read.

treed commented 2 years ago

This is also very interesting to me. I think it would also be nice to have healthchecks types that:

@mrVanDalo Could you please share what your current solution looks like? I'm curious how tenable it might be for my relatively small use case. (~20 VMs)

mrVanDalo commented 2 years ago

At the moment I have something like this :

...
apps.${system}.test = {
  foo = {
    type = "app";
    program = toString (pkgs.writers.writeBash "foo" ''
      echo "testing foo"
      set -e
      set -x
      curl --fail https://service.example.com
      set -x
      echo "All Good"
    '');
  };
  bar = {
    type = "app";
    program = toString (pkgs.writers.writeBash "bar" ''
      echo "testing bar"
      set -e
      set -x
      ssh bar.private "systemctl is-active --quiet smartd && echo Service is running"
      set -x
      echo "All Good"
    '');
  };
};

colmena = {
    meta = { nixpkgs = import nixpkgs { inherit system; }; };

    defaults = { name, pkgs, ... }: {
      deployment.buildOnTarget = true;
      ...
    };

    foo = { name, nodes, pkgs, ... }: {
       ... 
    };

    bar= { name, nodes, pkgs, ... }: {
       ... 
    };
};
...

And I trigger these tests using :

nix run ".#test.foo"
nix run ".#test.bar"

I think it's not comfy, and it would be much better to put the test definition next to the service that is been tested, as well tests are running right after the deployment (of with a small delay) to make sure everything is really working.

treed commented 2 years ago

Mm, yeah. I see what you mean. Being able to put them in my service-centric modules would definitely be a lot nicer. At this point, I'm vaguely considering just making a bunch of shell scripts I can call in the meantime with a hostname.

treed commented 2 years ago

Okay, I decided to make something a bit more real than a pile of bash scripts.

I have a small Rust program I've put up at https://github.com/treed/colmena-health

(@zhaofengli please let me know if you'd like me to change the name away from using the name colmena)

It is currently very simple, but can be driven from config directly in the nix expressions (via colmena eval) to check HTTP endpoints and DNS resolution. I have a TODO list to add at least ssh as a check type and a few other things.

My intention is to treat this as a prototype to work out how healthchecks might work, with an eye towards eventually using it as the basis for implementing healthchecks directly in Colmena if that's acceptable.

There's a README that should help guide towards usage. Let me know if you have thoughts.

treed commented 2 years ago

Also it's maybe worth mentioning that I haven't used Rust in a few years, and notably haven't really used async Rust ever.

So if anyone has any code-feedback, that would also be welcome.

treed commented 2 years ago

Added an ssh check type that uses the user's own ssh because I realized that trying to make use of .ssh/config and ssh-agents with thrussh was going to be a pain. This way there shouldn't be any surprises in terms of how the ssh works compared to any other ssh done by the user.

I'll probably let this bake for a bit and start trying to add healthchecks to my stuff to test it out further.

Aidan-Chelig commented 1 month ago

once we have health checks colmena will be the ultimate deployment solution imo