zalando / zalenium

A flexible and scalable container based Selenium Grid with video recording, live preview, basic auth & dashboard.
https://opensource.zalando.com/zalenium/
Other
2.39k stars 575 forks source link

Parallel Zalenium in Kubernetes in AWS #953

Closed kubiksamek closed 5 years ago

kubiksamek commented 5 years ago

🐛 Bug Report

Hello, I am struggling to run parallel Zalenium in Kubernetes.

Our E2E testing tools: Ruby, ver.: 2.5.1 Watir framework (ver.: 6.16.0, http://watir.com/)

We want to run E2E tests in Kubernetes in AWS with support of auto scaling. Our goal is to run 15 parallel executions (up to 5 browsers per test run).

We get errors like:

Can't open browser in node:

Error during browser launch, Trying to open browser again: unexpected response, code=502, content-type="text/html"
Error during browser launch, Trying to open browser again: Error forwarding the new session Error forwarding the request Failed to connect to /100.112.166.7:40000 (org.openqa.grid.common.exception.GridException)
Error during browser launch, Trying to open browser again: Net::ReadTimeout

Can't connect to node:

cannot forward the request Failed to connect to /100.110.111.151:40000 (org.openqa.grid.common.exception.GridException) (Selenium::WebDriver::Error::UnknownError)
cannot forward the request unexpected end of stream on Connection{100.113.109.19:40000, proxy=DIRECT hostAddress=/100.113.109.19:40000 cipherSuite=none protocol=http/1.1} (org.openqa.grid.common.exception.GridException) (Selenium::WebDriver::Error::UnknownError)

zalenium.yaml:

---
# Source: zalenium/templates/service-account.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  name: {{ template "selenium.hub.fullname" . }}
  labels:
    role: grid
    app: zalenium
    release: zalenium
    app.kubernetes.io/name: {{ template "selenium.hub.fullname" . }}
    app.kubernetes.io/instance: {{ template "selenium.hub.fullname" . }}
---
# Source: zalenium/templates/service.yaml
apiVersion: v1
kind: Service
metadata:
  name: {{ template "selenium.hub.fullname" . }}
  labels:
    role: grid
    app: zalenium
    release: zalenium
    app.kubernetes.io/name: {{ template "selenium.hub.fullname" . }}
    app.kubernetes.io/instance: {{ template "selenium.hub.fullname" . }}
spec:
  type: "NodePort"
  sessionAffinity: "None"
  ports:
    - name: {{ template "selenium.hub.fullname" . }}
      port: 4444
      targetPort: 4444
  selector:
    app.kubernetes.io/name: {{ template "selenium.hub.fullname" . }}
    app.kubernetes.io/instance: {{ template "selenium.hub.fullname" . }}
---
# Source: zalenium/templates/deployment.yaml
apiVersion: apps/v1beta1
kind: Deployment
metadata:
  annotations:
    cluster-autoscaler.kubernetes.io/safe-to-evict: "false"
  name: {{ template "selenium.hub.fullname" . }}
  labels:
    role: grid
    app: zalenium
    release: zalenium
    app.kubernetes.io/name: {{ template "selenium.hub.fullname" . }}
    app.kubernetes.io/instance: {{ template "selenium.hub.fullname" . }}
spec:
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: {{ template "selenium.hub.fullname" . }}
      app.kubernetes.io/instance: {{ template "selenium.hub.fullname" . }}
  template:
    metadata:
      annotations:
        cluster-autoscaler.kubernetes.io/safe-to-evict: "false"
      labels:
        role: grid
        app: zalenium
        release: zalenium
        app.kubernetes.io/name: {{ template "selenium.hub.fullname" . }}
        app.kubernetes.io/instance: {{ template "selenium.hub.fullname" . }}
    spec:
      containers:
        - name: zalenium
          image: "dosel/zalenium:3.141.59j"
          imagePullPolicy: Always
          ports:
            - containerPort: 4444
              protocol: TCP
          livenessProbe:
            httpGet:
              path: /grid/console
              port: 4444
            initialDelaySeconds: 90
            periodSeconds: 5
            timeoutSeconds: 1
          readinessProbe:
            httpGet:
              path: /grid/console
              port: 4444
            timeoutSeconds: 1
          env:
            - name: ZALENIUM_KUBERNETES_CPU_REQUEST
              value: "800m"
            - name: ZALENIUM_KUBERNETES_MEMORY_REQUEST
              value: "1200Mi"
            - name: DESIRED_CONTAINERS
              value: "2"
            - name: MAX_DOCKER_SELENIUM_CONTAINERS
              value: "6"
            - name: SELENIUM_IMAGE_NAME
              value: "520314695264.dkr.ecr.us-east-1.amazonaws.com/collab/node-zalenium-with-data:3.14.0-p22"
            - name: VIDEO_RECORDING_ENABLED
              value: "false"
            - name: SCREEN_WIDTH
              value: "1920"
            - name: SCREEN_HEIGHT
              value: "1200"
            - name: MAX_TEST_SESSIONS
              value: "1"
            - name: NEW_SESSION_WAIT_TIMEOUT
              value: "1200000"
            - name: SEL_BROWSER_TIMEOUT_SECS
              value: "1200"
            - name: DEBUG_ENABLED
              value: "false"
            - name: SEND_ANONYMOUS_USAGE_INFO
              value: "false"
            - name: CHECK_CONTAINERS_INTERVAL
              value: "3000"
            - name: TZ
              value: "UTC"
            - name: KEEP_ONLY_FAILED_TESTS
              value: "false"
            - name: RETENTION_PERIOD
              value: "0"
            - name: CONTEXT_PATH
              value: "/"
          args:
            - start
          resources:
            requests:
              cpu: 200m
              memory: 300Mi

          volumeMounts:
            - name: dshm
              mountPath: /dev/shm
            - name: {{ template "storage.fullname" . }}
              mountPath: /tmp/shared
      volumes:
        - name: dshm
          emptyDir:
            medium: Memory
        - name: {{ template "storage.fullname" . }}
          persistentVolumeClaim:
            claimName: {{ template "storage.fullname" . }}
      serviceAccountName: {{ template "selenium.hub.fullname" . }}
---

As you can see, we are using non standard browsers images 520314695264.dkr.ecr.us-east-1.amazonaws.com/collab/node-zalenium-with-data:3.14.0-p22. This image is modified only with our testing data (Audio/Video for fake webcam). So if you will be running it you can use original elgalu/selenium:3.14.0-p22

Tests are executed in executor:

apiVersion: v1
kind: Pod
metadata:
  name: {{ template "executor.fullname" . }}
  annotations:
    cluster-autoscaler.kubernetes.io/safe-to-evict: "false"
  labels:
    app: {{ template "executor.fullname" . }}
    release: {{ .Release.Name }}
    e2eRole: "executor"
spec:
  containers:
    - name: tests
      image: {{ .Values.executor.image }}:{{ .Values.executor.imageTag }}
      imagePullPolicy: {{ .Values.executor.imagePullPolicy }}
      env:
        - name: SELENIUM_HUB
          value: {{ template "selenium.hub.fullname" . }}
        - name: TZ
          value: "UTC"
      command: ['sleep']
      args: ['infinity']
      volumeMounts:
        - name: {{ template "storage.fullname" . }}
          mountPath: "/tmp/shared"
      resources:
        requests:
          cpu: "0.5"
          memory: "1000Mi"
  volumes:
    - name: {{ template "storage.fullname" . }}
      persistentVolumeClaim:
        claimName: {{ template "storage.fullname" . }}
  restartPolicy: Never

Shared data between browsers in one execution are in PVC:

kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: {{ template "storage.fullname" . }}
  annotations:
    volume.beta.kubernetes.io/storage-class: "aws-efs"
spec:
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 5Gi

_helpers.tpl

{{/* vim: set filetype=mustache: */}}

{{- define "selenium.hub.fullname" -}}
{{- printf "e2e-selenium-hub-%s" .Release.Name | trunc 63 | trimSuffix "-" -}}
{{- end -}}

{{- define "selenium.node.fullname" -}}
{{- if eq .Values.tests.browser "chrome" }}
{{- printf "e2e-selenium-chrome-%s" .Release.Name | trunc 63 | trimSuffix "-" }}
{{- else }}
{{- printf "e2e-selenium-firefox-%s" .Release.Name | trunc 63 | trimSuffix "-" }}
{{- end }}
{{- end -}}

{{- define "executor.fullname" -}}
{{- printf "e2e-executor-%s" .Release.Name | trunc 63 | trimSuffix "-" -}}
{{- end -}}

{{- define "storage.fullname" -}}
{{- printf "e2e-storage-%s" .Release.Name | trunc 63 | trimSuffix "-" -}}
{{- end -}}

To Reproduce

We use Jenkins pipeline to run tests.

I am deploying Zalenium using helm: helm install ci-pipeline/helm/ --name='$job.BUILDNAME'

Simplified example:

require 'watir'
require 'socket'

@client = Selenium::WebDriver::Remote::Http::Default.new
@client.read_timeout = 720 # seconds
@client.open_timeout = 600

# From executor.yaml
# - name: SELENIUM_HUB
# value: {{ template "selenium.hub.fullname" . }}
hub_address = ENV['SELENIUM_HUB'] || ''
@selenium_hub_address = IPSocket.getaddress(hub_address)

@browser = Watir::Browser.new :remote, url: "http://#{@selenium_hub_address}:4444/wd/hub", http_client: @client

@browser.goto 'www.google.com'
sleep 600 # do nothing

Expected behavior

All browsers are started correctly and connection is not lost between client and browsers.

Environment

Everything is running in AWS (executor and browsers).

Machine(s) for K8s: AWS EC2 m5.large, ami image ami-050a5ee88521c50e4 Zalenium Image Version(s): 3.141.59j

m5.large has 2 CPUs, 8GB RAM Because of requested resources

            - name: ZALENIUM_KUBERNETES_CPU_REQUEST
              value: "800m"
            - name: ZALENIUM_KUBERNETES_MEMORY_REQUEST
              value: "1200Mi"

we know that only 2 browsers can run in one AWS node. I also tried m5.xlarge instance with double HW specs and results were same.

kubectl version
Client Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.2", GitCommit:"5fa2db2bd46ac79e5e00a4e6ed24191080aa463b", GitTreeState:"clean", BuildDate:"2018-01-18T10:09:24Z", GoVersion:"go1.9.2", Compiler:"gc", Platform:"linux/amd64"}

Thank you for any response!

diemol commented 5 years ago

Hi @aliaCZEk,

It is really hard to say why it does not work as expected. Does it work locally in a docker setup with your custom containers?

kubiksamek commented 5 years ago

@diemol I tried it locally using minikube and it works. But I tried only one test run. I don't have too much powerful cpu to be able to execute multiple runs.

In AWS it works if I run 1/2 parallel executions for example. 10 executions and it starts to fail. We don't know it problem is in AWS or in Zalenium. Starting nodes, moving containers, ...

diemol commented 5 years ago

What do you see in the Zalenium log? Are all pods starting normally?

kubiksamek commented 5 years ago

How can I get Zalenium log? Sorry for delay.

diemol commented 5 years ago

it should be the log output from the Zalenium pod

kubiksamek commented 5 years ago

@diemol So you mean by using kubectl logs <pod_name>. We don't save logs from pods by default. I'll try to get them in another executions.

diemol commented 5 years ago

Closing as we didn't get more feedback. Feel free to reopen when more information that helps to reproduce the issue is provided.