original-brownbear commented 8 years ago

Problem

We are currently trying to move to dynamic provisioning of Docker daemons for Rultor. This means that Rultor builds would stop sharing the same build cache, making builds potentially very slow, involving the download of the 1.5 GB Rultor image in many cases at present. Also we are starting to have the need for a private way of storing Docker images in some of our projects.

Solution

Provision an ECS instance with a private Docker registry backed by S3 and use it with Rultor. It should act as a passive cache/proxy in front of DockerHub as well as by used actively by the Rultor build

Concrete Implementation Required by this Issue

Registry

passive cache with dockerhub, see here on how to configure that: https://docs.docker.com/registry/mirror
Registry should only be available to Rultor via AWS private networking, not publicly exposed.
Rultor Build Runner
configure Docker daemon to use private registry
make Docker builds from a directory always pull from a repository in the private Docker registry before building, pushing the result to that same private repo after the build (we can simply use the GitHub repository name with a prefix in the private repo ). To ensure security here, we need to make sure that the here privately cached builds are not accessible by specifying them as image names in the .rultor.yml => this gives us a very smart cache, while allowing to still delete build form a directory right after the Rultor run and to save disk space.

original-brownbear commented 8 years ago

@yegor256 would you be on board with setting up a private Registry for us? I think it's an important step towards efficient dynamically provisioned Build runners and also would help us handle private images easier.

longtimeago commented 8 years ago

@original-brownbear :+1: sounds good!

original-brownbear commented 8 years ago

@alex-palevsky this is a bug.

original-brownbear commented 8 years ago

@alex-palevsky this is postponed.

alex-palevsky commented 8 years ago

@alex-palevsky this is a bug.

@original-brownbear I added bug tag to this ticket

alex-palevsky commented 8 years ago

@original-brownbear since there is no milestone yet I set it to "2.0"

alex-palevsky commented 8 years ago

@original-brownbear thanks for this report, I added 30 mins to your account, in transaction AP-0NX64527LY920661T

alex-palevsky commented 8 years ago

@alex-palevsky this is postponed.

@original-brownbear right, I added "postponed" label

alex-palevsky commented 8 years ago

@alex-palevsky this is postponed.

@original-brownbear someone else will help in this task, no problem at all

yegor256 commented 8 years ago

@original-brownbear hm... I'm not entirely sure I understand the concept here. my key questions: 1) why a new EC2 instance, why can't we use Docker Hub paid account? 2) why do we need private Docker images?

original-brownbear commented 8 years ago

@yegor256

1) why a new EC2 instance, why can't we use Docker Hub paid account?

Because if we use images, that are not build during the merge, then any change to the dependencies needs to go in two steps:

Merge changes for Dockerfile to master to trigger DockerHub update (while still building the old code that does not need the new dependencies)
Merge code change that needs that Dockerfile update

... this is a very risky process as we've seen in Rultor a few times (updating dependency versions, random Gemfile issues etc.), because you basically have to trust that the first step will work out. Right now I could add

exit 1;

to the Dockerfile in Rultor ( as part of CMD or ENTRYPOINT) and break the build for good until manual action is taken on master. Rultor cannot recover this situation. If on the other hand Rultor builds the Image and pushes it to a private registry we're good in that regard. We can never merge a broken Dockerfile then, if the building and pushing is part of the merge process. I understand this would be possible using Dockerhub too, but see last point ...

Why do we need private Docker images?

We simply have commercial projects that are now getting their own Dockerfiles. I don't think we want to expose those publicly.

Why Private Registry >> DockerHub

We get a fast shared cache between dynamically provisioned build runners, DockerHub pushing and pulling is way too slow and unstable for this.
- For on demand instances, using only DockerHub would mean every new instance of EC2/ECS starts from an empty build cache. This is a 1.5 GB download for every merge/deploy/release on Rultor for example.
- Not having this kind of cache would still leave us with having to keep on-demand instances running, still leaving the running out of diskspace issue to be resolved. If we can tear them down this issue goes away automatically. While Docker itself has tricky space handling, a Docker Repository has very strong capabilities in terms of cleaning up old images automatically
We don't have to setup DockerHub builds manually for all repos using their own Dockerfile
We don't need to grant DockerHub access to commercial projects on GitHub

Makes sense ?

yegor256 commented 8 years ago

@original-brownbear we can create AWS image, which will be used to create EC2 instances. That image will have a docker image pre-fetched. What about this?

original-brownbear commented 8 years ago

@yegor256 well this only solves/alleviates the issue for the Rultor image users. Also it requires us to use EC2 instead of ECS.

I think the clear downsides are still these:

Needs EC2 instead of ECS
- Slower boot
  - Just try and create a docker-machine host ... it's minutes in EC2 to boot this kind of thing, seconds on ECS
- Much more costly (load balancing this is really hard so we'll often pay for idle resources or have problems with RAM etc. still)
- Much harder to implement
- ECS also gives us a lot of room to simplify the architecture down the road (would go to far to list them here)
The AMI would hold a specific version of the Rultor image only!
- What about other projects with different images ?
- What would be the process when the Rultor image (or if we incorporate other images them too) changes? Will we manually rebuild the AMI, will this have to be scripted ? ( The former would be a waste of time, the latter would be super inefficient since we'd now be maintaining Docker images and the AMI in parallel)
Scripting this with some both moving parts dockerized allows proper testing:
- With putting all of this into EC2 and an AMI, we'll always only approximate when writing integration-tests and still have random risks from EC2.
- With the build runner being in ECS ( just a Docker in Docker container, explained in the issue on ECS) and the Docker Registry in ECS, we can simply create a 100% accurate integration test for random infrastructure issues (like the ones we're experiencing now)

=> I think my plan is far superior in the outcome + easier/safer to implement too since it can be accurately tested.

longtimeago commented 8 years ago

@original-brownbear @yegor256 I'm new to this type of question so may say something obvious but I have to say this. I took another look at this issue and want to rise one more question - security. If I understand correctly, adding possibility for repo owner (user) to specify the image to be used by Rultor to build the project implies that any user can execute almost any code in Rultor's docker environment. At least, this is a perfect way to DDoS the Rultor (Yes, I know about CPU, RAM, NET restrictions per) container. But at most, code executed in one container could affect other containers (sorry, can't find some proof link, but I've heard a lot on noise about container security, and none said it's safe to run arbitrary images on own platform). We should consider this issue carefully before doing something with it.

original-brownbear commented 8 years ago

@longtimeago Yea DDoS may be an issue, but really the same goes for Travis to a much larger extend. I think Rultor simply wouldn't be worth it (or even capable to really do any damage). I mean we wouldn't setup ECS to spawn off an unlimited number of builds :) (nor would Amazon give us that freedom in the first place).

About security between containers, past me took care of the situation a while back here #1008 :) This really is an issue of the past when it comes to Docker, so long as we don't give the container any privileges ( which we don't anymore :) ), we're good with that (excluding the possibility that someone knows some non-public exploit :P ).

longtimeago commented 8 years ago

@original-brownbear what about the possibility of building image and running container which starts email bots, bit-miners ... ?

UPD: Say it's possible right now :)

original-brownbear commented 8 years ago

@longtimeago well, in theory that's obviously possible :) But again ... You can in fact start a Docker instance in any Travis build too for example! I think an attacker would much rather do this, than use Rultor ? :) ... obviously not an argument :)

But I think Travis as well as Rultor are safeguarded by very limited resources and also GitHub here. The maximum runtime Rultor allows is 120minutes, and you only get one build/deploy per Github Repo simultaneously, so in order to exploit us here you'd have to do this for 120min of evil:

Create Github account
Create Repo
Add Rultor
Push your image
Run build

Then you also have to factor in, that we could simply set an alarm at a certain number of containers in Amazon and/or hard limit the maximum number of the at say 4 or 8 or so. Any attacker really wouldn't be getting much out of this.

email bots won't work => no way our IP would be accepted as sender by any email provider
Mining bitcoins ... in 2016 I'd be surprised if even a larger EC2 instance could mine more than $1 per day :)
DDoS ... well you may be able to kill some smallish webserver for a few minutes ... but that's about it

=> I really don't see what you could even accomplish with a malicious image. => Plus should this against all expectations turn out to be an issue, we could simply adjust the network settings and set a limit of like 50MB on the outbound traffic per build right ? :)

longtimeago commented 8 years ago

@original-brownbear Nice explanation, thanks! Actually, you've mentioned at least 2 preventive tasks to be done ;)

yegor256 commented 8 years ago

@original-brownbear shame on me, I haven't heard about EC2 Container Service (ECS) before. As I understand now (http://docs.aws.amazon.com/AmazonECS/latest/developerguide/Welcome.html), it is just an EC2 instance with a Docker daemon installed there. All AWS is giving us is an ability to manage that containers via AWS API. I don't see any advantage of that, comparing to our own EC2 instance, which we have now and a plain simple SSH (which we do now). The only advantage AWS ECS is giving us is an ability to managed many EC2 instances through one entry point. Am I wrong?

original-brownbear commented 8 years ago

@yegor256

The only advantage AWS ECS is giving us is an ability to managed many EC2 instances through one entry point. Am I wrong?

Wrongish :), the thing isn't so much that ECS gives us the ability to manage multiple EC2 instances ( though this is nice too of cause ), but that it gives us a well designed scheduler for running containers. Instead of our current very naive and error prone ( look at some builds stability ) use of simply scheduling by CPU load on the EC2 instance, we'd get proper resource management out of the box. => waiting for CPU and RAM to become available, running task, queuing other tasks meanwhile and guaranteeing certain resources. (currently we simply guarantee a bunch of RAM via swapping and it's less than ideal, making something better ourselves though would be very tricky and hence expensive).

Please understand here too, that running the EC2 instances fully on-demand ( one instance per build) would be very very slow and also very expensive. Plus like all solutions revolving around just dynamic EC2 it requires maintaining an AMI.

If we keep an instance running and just reboot it when in trouble we gain nothing in terms of the build stability issues we have in some projects at the moment during peak hours. Also a bad solution.

So if we actually want to improve the stability issues, both from the EC2 outright dying as well as from load spikes we need scheduling. ECS simply gives us just that out of the box. You're given a flexible decision on how much money to spend on EC2 and unlike now the decision will only affect build time, not stability. Making something that keeps a dynamic or even static number of EC2 instances available to Rultor and then implementing the whole process of distributing builds among those instances would just be reinventing ECS.

Also in terms of hands on implementing this, I'm convinced ECS is our fastest route to proper on-demand provisioning: We don't need to change anything to the current SSH implementation (bad as it may be, it works for now). We can simply use ECS to dynamically provide us with Docker in Docker containers giving us the same environment we had before but with guaranteed resources per build (and allow this environment to be fully under Rultor's control, no randomness from some AMI). It simply decouples the whole EC2 cost and maintenance side from the side of simply running Rultor builds. We can set the ECS settings to whatever you see fit, Rultor will still always work and use the ECS api to schedule its builds.

Makes sense ?

yegor256 commented 8 years ago

@original-brownbear yes, it does make sense, thanks. OK, I'm in, let's use ECS

yegor256 / rultor

Let's Create a Private Registry for Rultor #1081

Problem

Solution

Concrete Implementation Required by this Issue

Registry

Rultor Build Runner