"native" deployment impl

shlevy commented 10 years ago

Currently defnix is implemented by mapping to nixos configs, we should add an implementation of a defnixos-only machine/network.

shlevy commented 10 years ago

Once we're no longer dependent on nixos to set up user ids, we should be able to avoid the import-from-derivation to calculate uids (as there's no real need to have them available at evaluation time).

shlevy commented 10 years ago

An idea about the interface (to be still implemented on top of nixos for now):

Each repo describes a set of functionalities (thanks @rickynils for the term). A functionality consists of at least a pure services. Additionally, implementations can add extra fields to functionality, so for example a multi-machine deployment tool might add a way to specify stuff about the machine each service runs on, or an implementation that puts everything on a private ipsec network might add a way to specify the upstream-hosts each service needs to connect to. Finally, all functionalities that should be deployed together should be tied together in one central place.

More concretely for our use case, I propose each repo export a set of functionalities and we have a single repo tying everything together for all of our deployments. Example of what this might look like:

zalora-git.git:

{
  git = {
    service = gitolite-service;
    machine.location = locations.singapore;
    fqdn = "git.zalora.com";
  };
}

zalora-hydra.git:

{
  hydra-web = {
    service = hydra-web-service;
    machine.location = locations.singapore;
    upstream-hosts = [ "hydra-db.zalora.com" ];
    fqdn = "hydra.zalora.com";
    load-balancer.backends = 2;
    load-balancer.source-port = 443;
    load-balancer.target-port = 3000;
  };

  hydra-evaluator = {
    service = hydra-evaluator-service;
    machine.location = locations.singapore;
    upstream-hosts = [ "hydra-db.zalora.com"  "git.zalora.com" ];
  };

  hydra-queue-runner = {
    service = hydra-queue-runner-service;
    machine = same-as "hydra-evaluator";
  };

  hydra-db = {
    service = postgresql-service;
    machine.location = locations.singapore;
    fqdn = "hydra-db.zalora.com";
  };
}

zalora-deployment.git:

deploy {
  inherit functionalities;
  ca = ./ca.crt;
  default-backend = backends.ec2 {
    default-instance-type = "m3.medium";
    services-per-instance = 5;
  };
}

Thoughts on this? Obviously at some point we'll want a way to specify that a given functionality is part of an autoscaling group or whatever. @proger @rickynils @soenkehahn

proger commented 10 years ago

I don't get the upstream-hosts thing here, could you elaborate?

hydra-evaluator = {
    service = hydra-evaluator-service;
    machine.location = locations.singapore;
    upstream-hosts = [ "hydra-db.zalora.com"  "git.zalora.com" ];
  };

proger commented 10 years ago

The services-per-instance thing is too nominal and generic — we're currently trying to add automation scheduling things using random assignment (if we want automatic scheduling we should start using mesos). We really either want non-loaded services that should be scattered across a pool of three epsilons (with a deployment model where you don't deploy the whole box but rather a single service) or we want autoscale groups which usually have AMIs that run a single service.

shlevy commented 10 years ago

@proger It should probably be better named, but upstream-hosts is the set of hosts that this machine needs to be able to talk to. Depending on how each service is distributed, that will mean either a) they end up on the same machine, b) they end up on the same VPC, or c) they have an ipsec transport set up.

proger commented 10 years ago

Having said this, i don't think it's a good idea to couple services and their hosts declaratively at all — and push this functionality to a runtime service (think "mini-PaaS").

ip1981 commented 10 years ago

Do I understand it right that everything but deploy is backend-agnostic

shlevy commented 10 years ago

@ip1981 That's the goal, but of course we may end up having to depend on backend-specific stuff

shlevy commented 10 years ago

@proger Sorry, I don't quite get your latest comment, can you explain a bit more?

shlevy commented 10 years ago

@proger Also your services-per-instance comment :)

proger commented 10 years ago

I'm trying to express the idea that we shouldn't try to couple software to its infrastructure — this way we're making the interface look like nixops inside out (service-centric rather than infrastructure-centric).

I suggest to attempt to completely separate the specs of services (or things the company pays developers to do) and infrastructure (or things the company rents from AWS/MS/Google) unlike having them tightly coupled (defnix/nixops style).

There are currently three kinds of deployments that we have to cover:

huge (S)POFs (mysql) that take the whole infrastructure unit
epsilon (many tiny services on a huge box, mostly for cost-saving purposes)
(almost) infinitely scalable using autoscaling groups, like thumbor

For 1 and 3 what we already have works pretty much ok (imagining upcast already has autoscaling support) ; but for 2 the deployments usually should be performed by developers themselves because they are quite lean, developed/owned by a single person and even may be non-business-critical. If we consider the application of your model we are adding a third-party component that has to operate on the state of the whole infrastructure at once when performing deployments which makes it necessary for somebody to be there when the whole deployment fails — which pretty much defeats the purpose of relinquishing control over the whole deployment at once (even if we delegate that to some automated process like a bot reading git commits and deploying).

shlevy commented 10 years ago

Hmm, still not sure I completely understand. Having everything tied together in one place doesn't require literally manipulating the whole infrastructure, the deployment tool can (and nixops currently sort-of does) only actually deploy the services that have changed.

Can you maybe give an example of what you thing the specifications should look like?

proger commented 10 years ago

literally manipulating the whole infrastructure, the deployment tool can (and nixops currently sort-of does) only actually deploy the services that have changed

it currently has to know the state of the whole system (a statefile) and can only deploy if everything builds (if an unrelated service fails to evaluate, the whole deployment is screwed)

shlevy commented 10 years ago

Right, but those are implementation bugs, not interface ones.

shlevy commented 10 years ago

BTW while I'm definitely in favor of having devs control the workflow for epsilon-type services, I don't think it can be completely decoupled, for at least two reasons:

Devs should not have direct access to production credentials
If they are even a little bit business critical, we need to know about them and be able to fix them in an emergency

proger commented 10 years ago

The idea for the interface i ideally would like to see is:

infrastructure-side of things:

% cat infra.nix
epsilon1 = ec2 { location = sg1; size = XXL; ami = defnix-platform; };
epsilon2 = ec2 { location = sg2; size = XXXL; ami = defnix-platform; };
scheduler = ec2-autoscale { location = sg; ami = scheduler-v1.0b154; elb = "scheduler.zalora.com" }

developer-side of things

% cat app.nix
{ defnix }:
defnix.run-periodically {
  service =  (import ./harvest-spice { database = defnix.runtime-lookup "com.harkonnen.db.resources.spice42"; });
  when = every-minute;
};

developer workflow

% defnix deploy app.nix
stderr: ... using your gpg credentials from agent /tmp/gpg123.sock
stderr: achievement unlocked: speedy gonzalez (third app.nix deployment in a minute)
stderr: .... your app doesn't seem to require http
stderr: .... reserving container space at scheduler.zalora.com
stderr: deployed: ssh to your app's environment using:
stderr: % ssh root@app953.epsilon4.zalora.com

Assuming the whole state-tracking and scheduling of resources (picking what epsilon to use) is happening at the scheduler

shlevy commented 10 years ago

Hmm, how is your app.nix fundamentally different from my zalora-git.git?

shlevy commented 10 years ago

Where does infra.nix live? How does the defnix tool interact with it?

proger commented 10 years ago

Hmm, how is your app.nix fundamentally different from my zalora-git.git?

It's not different at all ( i hope), except it does not mention machine.location anywhere, only specifies the service that you can evaluate into a nixos generic-services module, a regular package, or a systemd unit, or a yaml file for Heroku, or even a dockerfile

Infra nix just specifies the infrastructure that we deploy using nixops or upcast that contains the defnix-platform; All mapping between an app and its potention infra happens without involving nix expressions using some scheduler app (a clone of something like heroku)

shlevy commented 10 years ago

OK, so do I understand correctly that your objection is to zalora-deployment.git and not to the rest? Can you sketch out a bit more detailed how the deployment process would work given those individual functionality repos?

shlevy commented 10 years ago

Ah, you updated your comment, so let me respond to that: Generally devs ask for a specific location, and often specific specs, so I thought I'd include that in the functionality-side of things, but there's no reason that setting can't be optional. Do you agree that same-as (like I used for hydra-queue-runner) has to exist?

shlevy commented 10 years ago

Actually, same-as should be replaced by a service that takes a set of services and runs them all as one.

proger commented 10 years ago

Actually, same-as should be replaced by a service that takes a set of services and runs them all as one

Agreeing on this one, there is no reason to overcomplicate the system at this point

shlevy commented 10 years ago

OK, on reflection I think I agree with your point more. For reproducibility's sake, I'd like a way to take a state file and redeploy an identical VM/instance/docker container/etc, which means if we're abstracting stuff like epsilon through a heroku-like interface we need to have a way to reference the state of the full machine setup at the time of deployment. Thoughts?

shlevy commented 10 years ago

Also, can you clarify the services-per-instance comment? What should load-balancing/autoscaling look like?

proger commented 10 years ago

My idea being, let's just keep the LB configuration be a thing of nixops-like configuration (i.e. decoupled from defnix services and untounched from the original implementataion) and focus on the definition/deployment of standalone services. Same with autoscaling — AWS (or just requires

proger commented 10 years ago

interface we need to have a way to reference the state of the full machine setup at the time of deployment. Thoughts?

Why? It's not really what clouds (imo) are about, the app should simply discover it

proger commented 10 years ago

(i.e. in runtime)

shlevy commented 10 years ago

I think in the normal course of things, it shouldn't be used, and it will never be set manually by the user, but at some point it may become important to e.g. "clone epsilon as it was on 10/31/2014 to investigate this bug" and I'd like that to be possible.

shlevy commented 10 years ago

@proger Hm, so I define the hydra-web functionality in the hydra repo. Where and when do I specify "hey, I want two of these behind an LB"?

proger commented 10 years ago

I suggest, in the configuration of the LB :) Naturally, it is like you do with anything else — you can either have a stub defnix service that does haproxy, go to console.aws.amazon.com create it yourself, provision it separately with upcast, etc.

proger commented 10 years ago

Or deploy it 10 times and assign those hydras a single DNS name (in some separate configuration, so it contains all 10 addresses altogether)

shlevy commented 10 years ago

So infra.nix contains an lb specifier, and at deploy time I say "use this LB"?

proger commented 10 years ago

So infra.nix contains an lb specifier, and at deploy time I say "use this LB"?

This configuration type is still inside-out, you "physically" can't "use" the LB since. It's rather LB (in smth like infra.nix or rather in a separate configuration my-own-app-lbs.nix is configured/created to balance across the things with known dns names, or, in case the dns names aren't known in advance there is a known function that can perform a lookup in some state file or a state http service)

shlevy commented 10 years ago

So to deploy a new instance into the LB, I first deploy the instance, then modify infra.nix to include it? Unless we have some way for instances to register themselves, in which case where should I specify which LB the instance should register itself to?

proger commented 10 years ago

So to deploy a new instance into the LB, I first deploy the instance, then modify infra.nix to include it?

Yes, this looks like the simplest plan. You likely wouldn't want to do anything smarter at first since including the instance in the LB is the act having things start affecting production and you want to have it manually managed at first (looking at our primary use case now)

shlevy commented 10 years ago

OK, that sounds good, I'll go with this for now.

ip1981 commented 10 years ago

Why? It's not really what clouds (imo) are about, the app should simply discover it

Who said zeroconf? :-)

proger commented 10 years ago

Zeroconf that works :)

@shlevy i'll work on elb updates in upcast then

shlevy commented 10 years ago

There is now a nixops-based impl (using NixOS under the hood, of course)

zalora / deployix

"native" deployment impl #17