Support use of K8S LoadBalancer type service for master traffic

muff1nman commented 4 years ago

The multi-K8S example deployment on GKE uses cross pod traffic between clusters. This is an atypical/nonstandard behavior of multiple K8S clusters. I would rather use a deployment with LoadBalancer type services to assign dedicated ips to endpoints that need to be exposed across clusters.

I have tried this with the following modified manifiest (modified accordingly for each cluster):

---
# Source: yugabyte/templates/service.yaml
apiVersion: v1
kind: Service
metadata:
  name: "yb-masters"
  labels:
    app: "yb-master"
    heritage: "Helm"
    release: "blue"
    chart: "yugabyte"
    component: "yugabytedb"
spec:
  type: LoadBalancer
  loadBalancerIP: 172.16.0.34
  ports:
    - name: "rpc-port"
      port: 7100
    - name: "ui"
      port: 7000
  selector:
    app: "yb-master"
---
# Source: yugabyte/templates/service.yaml
apiVersion: v1
kind: Service
metadata:
  name: "yb-tservers"
  labels:
    app: "yb-tserver"
    heritage: "Helm"
    release: "blue"
    chart: "yugabyte"
    component: "yugabytedb"
spec:
  clusterIP: None
  ports:
    - name: "rpc-port"
      port: 7100
    - name: "ui"
      port: 9000
    - name: "yedis-port"
      port: 6379
    - name: "yql-port"
      port: 9042
    - name: "ysql-port"
      port: 5433
  selector:
    app: "yb-tserver"
---
# Source: yugabyte/templates/service.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: "yb-master"
  namespace: "yugabytedb"
  labels:
    app: "yb-master"
    heritage: "Helm"
    release: "blue"
    chart: "yugabyte"
    component: "yugabytedb"
spec:
  serviceName: "yb-masters"
  podManagementPolicy: Parallel

  replicas: 1

  volumeClaimTemplates:
    - metadata:
        name: datadir0
        annotations:
          volume.beta.kubernetes.io/storage-class: rook-ceph-block
        labels:
          heritage: "Helm"
          release: "blue"
          chart: "yugabyte"
          component: "yugabytedb"
      spec:
        accessModes:
          - "ReadWriteOnce"
        storageClassName: rook-ceph-block
        resources:
          requests:
            storage: 10Gi
    - metadata:
        name: datadir1
        annotations:
          volume.beta.kubernetes.io/storage-class: rook-ceph-block
        labels:
          heritage: "Helm"
          release: "blue"
          chart: "yugabyte"
          component: "yugabytedb"
      spec:
        accessModes:
          - "ReadWriteOnce"
        storageClassName: rook-ceph-block
        resources:
          requests:
            storage: 10Gi
  updateStrategy:
    type: RollingUpdate
    rollingUpdate:

      partition: 0

  selector:
    matchLabels:
      app: "yb-master"
  template:
    metadata:

      labels:
        app: "yb-master"
        heritage: "Helm"
        release: "blue"
        chart: "yugabyte"
        component: "yugabytedb"
    spec:
      affinity:
        # Set the anti-affinity selector scope to YB masters.

        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values:
                  - "yb-master"
              topologyKey: kubernetes.io/hostname
      containers:
      - name: "yb-master"
        image: "yugabytedb/yugabyte:2.1.6.0-b17"
        imagePullPolicy: IfNotPresent
        lifecycle:
          postStart:
            exec:
              command:
                - "sh"
                - "-c"
                - >
                  mkdir -p /mnt/disk0/cores;
                  mkdir -p /mnt/disk0/yb-data/scripts;
                  if [ ! -f /mnt/disk0/yb-data/scripts/log_cleanup.sh ]; then
                    if [ -f /home/yugabyte/bin/log_cleanup.sh ]; then
                      cp /home/yugabyte/bin/log_cleanup.sh /mnt/disk0/yb-data/scripts;
                    fi;
                  fi
        env:
        - name: POD_IP
          valueFrom:
            fieldRef:
              fieldPath: status.podIP
        - name: HOSTNAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        - name: NAMESPACE
          valueFrom:
            fieldRef:
              fieldPath: metadata.namespace
        resources:

          limits:
            cpu: 2
            memory: 2Gi
          requests:
            cpu: 500m
            memory: 1Gi

        command:

          - "/home/yugabyte/bin/yb-master"

          - "--fs_data_dirs=/mnt/disk0,/mnt/disk1"

          - "--server_broadcast_addresses=yb-master-blue.example.com:7100"

          - "--master_addresses=yb-master-black.example.com:7100, yb-master-blue.example.com:7100, yb-master-white.example.com:7100"
          - "--replication_factor=3"

          - "--enable_ysql=true"
          - "--rpc_bind_addresses=0.0.0.0:7100"
          - "--metric_node_name=$(HOSTNAME)"
          - "--memory_limit_hard_bytes=1824522240"
          - "--stderrthreshold=0"
          - "--num_cpus=2"
          - "--undefok=num_cpus,enable_ysql"
          - "--default_memory_limit_to_ram_ratio=0.85"
          - "--leader_failure_max_missed_heartbeat_periods=10"
          - "--placement_cloud=AAAA"
          - "--placement_region=YYYY"
          - "--placement_zone=YYYY"

        ports:
          - containerPort: 7100
            name: "rpc-port"
          - containerPort: 7000
            name: "ui"
        volumeMounts:

          - name: datadir0
            mountPath: /mnt/disk0
          - name: datadir1
            mountPath: /mnt/disk1

      - name: yb-cleanup
        image: busybox:1.31
        env:
        - name: USER
          value: "yugabyte"
        command:
          - "/bin/sh"
          - "-c"
          - >
            mkdir /var/spool/cron;
            mkdir /var/spool/cron/crontabs;
            echo "0 * * * * /home/yugabyte/scripts/log_cleanup.sh" | tee -a /var/spool/cron/crontabs/root;
            crond;
            while true; do
              sleep 86400;
            done
        volumeMounts:
          - name: datadir0
            mountPath: /home/yugabyte/
            subPath: yb-data

      volumes:

        - name: datadir0
          hostPath:
            path: /mnt/disks/ssd0
        - name: datadir1
          hostPath:
            path: /mnt/disks/ssd1
---
# Source: yugabyte/templates/service.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: "yb-tserver"
  namespace: "yugabytedb"
  labels:
    app: "yb-tserver"
    heritage: "Helm"
    release: "blue"
    chart: "yugabyte"
    component: "yugabytedb"
spec:
  serviceName: "yb-tservers"
  podManagementPolicy: Parallel

  replicas: 1

  volumeClaimTemplates:
    - metadata:
        name: datadir0
        annotations:
          volume.beta.kubernetes.io/storage-class: rook-ceph-block
        labels:
          heritage: "Helm"
          release: "blue"
          chart: "yugabyte"
          component: "yugabytedb"
      spec:
        accessModes:
          - "ReadWriteOnce"
        storageClassName: rook-ceph-block
        resources:
          requests:
            storage: 10Gi
    - metadata:
        name: datadir1
        annotations:
          volume.beta.kubernetes.io/storage-class: rook-ceph-block
        labels:
          heritage: "Helm"
          release: "blue"
          chart: "yugabyte"
          component: "yugabytedb"
      spec:
        accessModes:
          - "ReadWriteOnce"
        storageClassName: rook-ceph-block
        resources:
          requests:
            storage: 10Gi
  updateStrategy:
    type: RollingUpdate
    rollingUpdate:

      partition: 0

  selector:
    matchLabels:
      app: "yb-tserver"
  template:
    metadata:

      labels:
        app: "yb-tserver"
        heritage: "Helm"
        release: "blue"
        chart: "yugabyte"
        component: "yugabytedb"
    spec:
      affinity:
        # Set the anti-affinity selector scope to YB masters.

        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values:
                  - "yb-tserver"
              topologyKey: kubernetes.io/hostname
      containers:
      - name: "yb-tserver"
        image: "yugabytedb/yugabyte:2.1.6.0-b17"
        imagePullPolicy: IfNotPresent
        lifecycle:
          postStart:
            exec:
              command:
                - "sh"
                - "-c"
                - >
                  mkdir -p /mnt/disk0/cores;
                  mkdir -p /mnt/disk0/yb-data/scripts;
                  if [ ! -f /mnt/disk0/yb-data/scripts/log_cleanup.sh ]; then
                    if [ -f /home/yugabyte/bin/log_cleanup.sh ]; then
                      cp /home/yugabyte/bin/log_cleanup.sh /mnt/disk0/yb-data/scripts;
                    fi;
                  fi
        env:
        - name: POD_IP
          valueFrom:
            fieldRef:
              fieldPath: status.podIP
        - name: HOSTNAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        - name: NAMESPACE
          valueFrom:
            fieldRef:
              fieldPath: metadata.namespace
        resources:

          limits:
            cpu: 2
            memory: 4Gi
          requests:
            cpu: 500m
            memory: 2Gi

        command:

          - "/home/yugabyte/bin/yb-tserver"
          - "--fs_data_dirs=/mnt/disk0,/mnt/disk1"
          - "--server_broadcast_addresses=$(HOSTNAME).yb-tservers.$(NAMESPACE).svc.cluster.local:9100"
          - "--rpc_bind_addresses=$(HOSTNAME).yb-tservers.$(NAMESPACE).svc.cluster.local"
          - "--cql_proxy_bind_address=$(HOSTNAME).yb-tservers.$(NAMESPACE).svc.cluster.local"

          - "--enable_ysql=true"
          - "--pgsql_proxy_bind_address=$(POD_IP):5433"

          - "--tserver_master_addrs=yb-master-black.example.com:7100, yb-master-blue.example.com:7100, yb-master-white.example.com:7100"

          - "--metric_node_name=$(HOSTNAME)"
          - "--memory_limit_hard_bytes=3649044480"
          - "--stderrthreshold=0"
          - "--num_cpus=2"
          - "--undefok=num_cpus,enable_ysql"
          - "--leader_failure_max_missed_heartbeat_periods=10"
          - "--placement_cloud=AAAA"
          - "--placement_region=YYYY"
          - "--placement_zone=YYYY"
          - "--use_cassandra_authentication=false"

        ports:
          - containerPort: 7100
            name: "rpc-port"
          - containerPort: 9000
            name: "ui"
          - containerPort: 6379
            name: "yedis-port"
          - containerPort: 9042
            name: "yql-port"
          - containerPort: 5433
            name: "ysql-port"
        volumeMounts:

          - name: datadir0
            mountPath: /mnt/disk0
          - name: datadir1
            mountPath: /mnt/disk1

      - name: yb-cleanup
        image: busybox:1.31
        env:
        - name: USER
          value: "yugabyte"
        command:
          - "/bin/sh"
          - "-c"
          - >
            mkdir /var/spool/cron;
            mkdir /var/spool/cron/crontabs;
            echo "0 * * * * /home/yugabyte/scripts/log_cleanup.sh" | tee -a /var/spool/cron/crontabs/root;
            crond;
            while true; do
              sleep 86400;
            done
        volumeMounts:
          - name: datadir0
            mountPath: /home/yugabyte/
            subPath: yb-data

      volumes:

        - name: datadir0
          hostPath:
            path: /mnt/disks/ssd0
        - name: datadir1
          hostPath:
            path: /mnt/disks/ssd1

However the master is unable to startup with the following snippet from the logs:

I0601 17:10:41.096400    49 async_initializer.cc:74] Starting to init ybclient
I0601 17:10:41.097048    39 service_pool.cc:148] yb.master.MasterBackupService: yb::rpc::ServicePoolImpl created at 0x1592b40
W0601 17:10:41.097179    49 master.cc:186] Failed to get current config: Illegal state (yb/master/catalog_manager.cc:6854): Node c70fb38b07564744846ec947dd8a846b peer not initialized.
I0601 17:10:41.097867    49 client-internal.cc:1847] New master addresses: [yb-master-black.example.com:7100,yb-master-blue.example.com:7100,yb-master-white.example.com:7100]
I0601 17:10:41.098279    39 service_pool.cc:148] yb.master.MasterService: yb::rpc::ServicePoolImpl created at 0x1592fc0
I0601 17:10:41.098922    39 service_pool.cc:148] yb.tserver.TabletServerService: yb::rpc::ServicePoolImpl created at 0x1593200
I0601 17:10:41.099171    39 thread_pool.cc:166] Starting thread pool { name: Master-high-pri queue_limit: 10000 max_workers: 1024 }
I0601 17:10:41.099233    39 service_pool.cc:148] yb.consensus.ConsensusService: yb::rpc::ServicePoolImpl created at 0x1593d40
I0601 17:10:41.099557    39 service_pool.cc:148] yb.tserver.RemoteBootstrapService: yb::rpc::ServicePoolImpl created at 0x1bd4000
I0601 17:10:41.099700    39 webserver.cc:148] Starting webserver on 0.0.0.0:7000
I0601 17:10:41.099741    39 webserver.cc:153] Document root: /home/yugabyte/www
I0601 17:10:41.100805    39 webserver.cc:240] Webserver started. Bound to: http://0.0.0.0:7000/
I0601 17:10:41.101158    39 service_pool.cc:148] yb.server.GenericService: yb::rpc::ServicePoolImpl created at 0x1bd4240
I0601 17:10:41.101905    39 rpc_server.cc:169] RPC server started. Bound to: 0.0.0.0:7100
E0601 17:10:41.104785    55 master.cc:266] Master@0.0.0.0:7100: Unable to init master catalog manager: Illegal state (yb/master/catalog_manager.cc:1273): Unable to initialize catalog manager: Failed to initialize sys tables async: None of the local addresses are present in master_addresses yb-master-black.example.com:7100, yb-master-blue.example.com:7100, yb-master-white.example.com:7100.

Looking at the code: https://github.com/yugabyte/yugabyte-db/blob/e4f4e6f4f77d4507b3ec53c5b51cd4886bb7206b/src/yb/master/catalog_manager.cc#L1244-L1274 it looks like this is due to the fact that the rpc address isn't in the list of resolved master addresses. However this is expected for a LoadBalancer type setup as the master addresses will resolve to the LoadBalancer ip while the container will not be able to listen on that ip address (hence why I've used 0.0.0.0:7100.

Can this check be optionally skipped? Will it impact logic further down the line?

schoudhury commented 4 years ago

@muff1nman Thanks for bringing your need to our attention. We have been discussing it internally but haven't found any solution yet -- that every pod should be able to discover every pod directly in the same StatefulSet is how Kubernetes works and that is exactly we have designed YugabyteDB to work. Given that there is no global DNS (that can handle such pod-to-pod discovery) across multiple Kubernetes clusters, we had found that our documented approach was the simplest way to get such a global DNS.

Couple of questions:

Are you running on GKE/Google Cloud or your own private data center?
Would a VM based deployment be an option so that we can brainstorm a solution w/o Kubernetes as an additional layer of complexity?

cc @rkarthik007 @iSignal

rkarthik007 commented 4 years ago

Hi @muff1nman,

The two key requirements to getting YugabyteDB across multiple k8s clusters are:

discovery: any pod can discover the location to the other pods using private IPs or pod names
reachability: once discovered, each pod can talk to the others over an IP address

A few questions:

I would rather use a deployment with LoadBalancer type services to assign dedicated ips to endpoints that need to be exposed across clusters.

YugabyteDB expects almost all nodes (pods in the case of k8s) to be able to communicate with one another using IP addresses (over TCP). This is true for both master and tserver pods. Typically, we have observed many deployments are not able to stand up as many load balancers (one LB exposing each pod in the cluster to each other). Is this something that you would be able to achieve in your environment?

Also, are you thinking about using the node port for this?

muff1nman commented 4 years ago

Are you running on GKE/Google Cloud or your own private data center?

Private datacenter on bare metal.

Would a VM based deployment be an option so that we can brainstorm a solution w/o Kubernetes as an additional layer of complexity?

VM deployments are not an option.

YugabyteDB expects almost all nodes (pods in the case of k8s) to be able to communicate with one another using IP addresses (over TCP). This is true for both master and tserver pods. Typically, we have observed many deployments are not able to stand up as many load balancers (one LB exposing each pod in the cluster to each other). Is this something that you would be able to achieve in your environment?

The amount of load balancer ips is not a limitation. For a minimal install, I figured at least three ips were needed for the masters, and then potentially another three for the tservers.

Also, are you thinking about using the node port for this?

Node ports are not an option.

rkarthik007 commented 4 years ago

The amount of load balancer ips is not a limitation. For a minimal install, I figured at least three ips were needed for the masters, and then potentially another three for the tservers.

Got it, this makes sense and would work even currently. Could you please try using server_broadcast_addresses with both yb-master and yb-tserver?

The server_broadcast_addresses parameter is used to specify the public IP or DNS hostname of the server.

In the case of running inside k8s, this is the load balancer address. An example is shown in this blog post on using DNS names for communication between nodes/pods. The reference in docs is here.

Note that this requires you to know the LB IP address (or dns if it exists) ahead of time to enable stable identities for the various masters. Does this work?

muff1nman commented 4 years ago

Could you please try using server_broadcast_addresses with both yb-master and yb-tserver?

That is what I tried doing as seen in the manifests above. The addresses resolved to the corresponding LB ips. It does not work due to the aforementioned check to ensure the bind address matches the broadcast address (which will not happen with LoadBalancer Services).

rkarthik007 commented 4 years ago

Got it, thanks for patiently working through this with me @muff1nman - I do see now that you were a bunch of steps ahead in your question :)

YB-Master

In this snippet you posted above:

          - "--server_broadcast_addresses=yb-master-blue.example.com:7100"
          - "--rpc_bind_addresses=0.0.0.0:7100"
          - "--master_addresses=yb-master-black.example.com:7100, yb-master-blue.example.com:7100, yb-master-white.example.com:7100"

This is the right thing to do, because the rpc_bind_addresses parameter expects to resolve to a network interface to bind and listen to for the rpcs. If you just want it to bind on a particular interface, you would need something like:

          - "--server_broadcast_addresses=yb-master-blue.example.com:7100"
          - "--rpc_bind_addresses=$(POD_IP):7100"
          - "--master_addresses=yb-master-black.example.com:7100, yb-master-blue.example.com:7100, yb-master-white.example.com:7100"

YB-TServer

          - "--server_broadcast_addresses=$(HOSTNAME).yb-tservers.$(NAMESPACE).svc.cluster.local:9100"
          - "--rpc_bind_addresses=$(HOSTNAME).yb-tservers.$(NAMESPACE).svc.cluster.local"
          - "--cql_proxy_bind_address=$(HOSTNAME).yb-tservers.$(NAMESPACE).svc.cluster.local"

          - "--enable_ysql=true"
          - "--pgsql_proxy_bind_address=$(POD_IP):5433"

I believe that while the server_broadcast_addresses is specified correctly, the rpc_bind_addresses and cql_proxy_bind_address would need to be either 0.0.0.0:9100 or $(POD_IP):9100 depending on what you are trying to achieve.

Please let me know if that works for you. Also, I do think this needs to be documented better. Could you please open a github issue for that?

cc @iSignal @bmatican

muff1nman commented 4 years ago

Yes I imagine the tserver manfiest needs some work, however, I couldn't even get the master pods to stay up and establish a quorum. I'm not very familiar with the architecture yet, but I imagine the masters being in quorum is a prerequisite for the tservers to function?

iSignal commented 4 years ago

@muff1nman : yes, we should focus on getting the master quorum to work before working on tservers.

Building what Karthik suggested above, I would suggest the following command line flags for the master pods. Please let me know how this works:

Set --rpc_bind_addresses=$(POD_NAME).yb-masters.$(NAMESPACE).svc.cluster.local:7100 This is the default setting in our helm charts and yaml files (see https://raw.githubusercontent.com/yugabyte/yugabyte-db/master/cloud/kubernetes/yugabyte-statefulset.yaml). You can also use POD_IP - we typically don't because it is not stable across pod movements/restarts.
Set --server_broadcast_addresses=load_balancer_ip:7100 (or DNS that resolves to it)
Set --use_private_ip=zone. This assumes that your master pods are in multiple zones.
Set --master_addresses="{private1:7100,public1:7100},{private2:7100,public2:7100},..." where private1 is the same as the rpc_bind_addresses parameter in step (1) and public1 is the load_balancer_ip specified in step (2).
Make sure you delete PVCs associated with any older runs each time you try different values for these parameters.

schoudhury commented 4 years ago

@muff1nman were you able to try the changes @iSignal has suggested in the previous comment? would be great if you could let us know if you are still blocked.

yugabyte / yugabyte-db

Support use of K8S LoadBalancer type service for master traffic #4635

YB-Master

YB-TServer