Explore migration to Kubernetes for volumes

craig-willis commented 5 years ago

Problem:

WT has home-grown container volume management, when other solutions exist via the orchestration systems, but these do not support Fuse and WebDav
gwvolman runs privileged to mount volumes (must have SYS_ADMIN for mount and FUSE_DEV caps)
gwvolman does not handle all situations (e.g., node draining/failure)
WT runs on Openstack, so we can control the hosts, but would like to be able to deploy elsewhere (GCE).

Background: We are considering migrating from Docker Swarm to Kubernetes. Most existing capabilities will translate easily, but volume/storage management may be challenging.

Similar to Docker, Kubernetes has a standard storage driver abstraction with a variety of common volume types. Unfortunately, Fuse and DavFS are not apparently readily supported.

In WT today, we use a home-grown container volume management. We could:

Port girderfs to a storage driver using flexVolume or CSI
Find an alternative -- maybe there's something already there (see Pangeo issue https://github.com/pangeo-data/pangeo/issues/92)
Use the current approach, running a privileged container to manage the mounts.
Other ideas: listen to Kubernetes events

Other references:

https://karlstoney.com/2017/03/01/fuse-mount-in-kubernetes/

Task constraints: This is an exploratory task to see what our options are -- time bounded at ~3 weeks to see if it's worth pursuing. Please document findings as you go along.

craig-willis commented 5 years ago

This exploration and much much more has been completed on the following branches:

The wt-kubernetes repo provides the following:

Prototype deployment of WT under Kubernetes (initially done under GCE)
Custom Docker image definitions for Girder and gwvolman to run under Kubernetes

The gwvolman branch includes:

Changes to gwvolman to support launching Tales via the Kubernetes API
Templates for the Tale deployment that define the volume implementation (i.e., girderfs support)

The primary goal of this issue was to provide a Kubernetes-esque solution to the girderfs problem. Instead of the discussed CSI/flexVolume approach, @hategan has proposed a much simpler side-car container https://github.com/whole-tale/gwvolman/blob/dev-kube-2.0/gwvolman/templates/tale-deployment.yaml#L48-L92. In short, for each Tale, a privileged side-car container runs gwvolman to mount and unmount girderfs fuse volumes via Pod lifecycle hooks.

This solution seems great to me and I think we can move on with the formal transition from swarm to Kubernetes. I've started putting together a design doc for comment:

https://docs.google.com/document/d/1WvlNF5wVDeaNvZSaEc022C1LZOtMnolnIo8ej7dNrno/edit

craig-willis commented 5 years ago

@hategan One minor question I have about your "mounter" container solution is whether you tried to use an emptyDir volume instead of a real PersistentVolumeClaim. On the one hand the PVC seems unnecessary -- why provision a volume just to mount? On the other hand, PVCs are problematic when we deploy on OpenStack -- our primary platform. Dynamic volume provisioning is one of the big weaknesses on OpenStack and if we don't need a volume it would be easier.

hategan commented 5 years ago

You need actual space to unpack the tale files (e.g., .ypnb, etc.) and for whatever temporary data might be generated by the tale. We could use emptyDir/hostPath for that, but that means we can't provision for particular space requirements (we'd be at the mercy of whatever space the node has and that is neither specified by Kubernetes nor controllable by the user AFAIK). We also don't want this to be ephemeral storage, since we want to preserve changes to tale files if the pods are restarted/migrated. So, in my understanding, we need some kind of persistent volume for that.

Since the above already gives us a directory on the node, there is no point in creating another emptyDir/hostPath mount just for the FUSE mountpoints.

So it isn't as much that we need a PV for the FUSE stuff, but that we need it for the tale, and since we have it, we might as well use it for the mountpoints.

craig-willis commented 5 years ago

Thanks, @hategan (/cc @Xarthisius). I must be misunderstanding something -- we mount the Tale workspace via girderfs, right? We shouldn't be downloading files outside of fuse and data generated by the Tale is stored in the fuse mounted workspace (unless they are writing somewhere else in the container, in which case it's still ephemeral).

hategan commented 5 years ago

@craig-willis not quite. One of the first steps that gwvolman takes is create_volume() (see tasks.py). This calls girder_client.downloadFolderRecursive() to get the tale files locally. So they are not part of a mount.

I don't know why this is done instead of mounting the folder (or maybe that's what ENABLE_WORKSPACES is about and the download is only the old way of doing things). I think @Xarthisius can provide some insight there. It may very well be that we could instead mount the tale folder.

However, we would still have a need to control the scratch space that tales get.

I would also like to mention that nothing here requires dynamic provisioning. While this is how it's done now, because it is easy, static provisioning can be done equally well with what I believe would be somewhat minor changes.

Xarthisius commented 5 years ago

@craig-willis not quite. One of the first steps that gwvolman takes is create_volume() (see tasks.py). This calls girder_client.downloadFolderRecursive() to get the tale files locally. So they are not part of a mount.

Oh, that's the narrative. Concept that I very much like, but I'm the only one in the Universe. That code is dead, i.e. there are no "real" Tales that would ever set narrativeId to something not null...

Bottom line: current directory we use for FUSE mount points is always empty.

hategan commented 5 years ago

@Xarthisius I see. So to answer @craig-willis ' original question, modulo scratch space, I think we can switch to emptyDir or hostPath.

Xarthisius commented 5 years ago

Yeah, and AFAIR current setting for girderfs cache is $TMPDIR/wtdms*, so it's stored in the ephemeral space.

craig-willis commented 5 years ago

OK, this makes sense now. I also like the narrative concept but sadly never did anything with it...

The scratch space question is worth revisiting. We also need to think about this in the context of OpenStack (our primary platform) v GCE. We've never enforced quotas on tales, but it's something we'll eventually hit and we'll always have an ephemeral problem whether under /tmp, /var/lib/docker/ or /var/lib/kubelet.

Whether the storage quota is actually enforceable I believe depends on the underlying cloud infrastructure. Under OpenStack we don't want to allocate a Cinder volume per tale, as we'd quickly exhaust our volume quota. I don't know whether local volumes will enforce the storage capacity -- i.e., if I have 100GB of local scratch and give each tale 10GB vi local volume, if the tale exceeds the 10GB storage will Kubernetes report an error? For NFS I know this isn't the case.

whole-tale / whole-tale

Explore migration to Kubernetes for volumes #60