Frequent java.net.ConnectException: Connection refused while running tests

flavialetgo commented 5 years ago

🐛 Bug Report

I've a suite of web tests implemented with Webdriverio 4 which I'm running on a Zalenium image. I am executing 14 tests in parallel in Chrome and I randomly get java.net.ConnectException: Connection refused. The same suite of tests have been running for a while with no issues by end of May, 2019.

The hub logs the following exception:

ERROR: java.net.ConnectException: Connection refused (Connection refused)

    at new RuntimeError (/home/jenkins/workspace/E2E_Tests_master/test/e2e/node_modules/webdriverio/build/lib/utils/ErrorHandler.js:143:12)
    at Request._callback (/home/jenkins/workspace/E2E_Tests_master/test/e2e/node_modules/webdriverio/build/lib/utils/RequestHandler.js:318:39)
    at Request.self.callback (/home/jenkins/workspace/E2E_Tests_master/test/e2e/node_modules/request/request.js:185:22)
    at Request.emit (events.js:160:13)
    at Request.<anonymous> (/home/jenkins/workspace/E2E_Tests_master/test/e2e/node_modules/request/request.js:1161:10)
    at Request.emit (events.js:160:13)
    at IncomingMessage.<anonymous> (/home/jenkins/workspace/E2E_Tests_master/test/e2e/node_modules/request/request.js:1083:12)
    at Object.onceWrapper (events.js:255:19)
    at IncomingMessage.emit (events.js:165:20)
    at endReadableNT (_stream_readable.js:1101:12)
    at process._tickCallback (internal/process/next_tick.js:152:19)

Relevant Zalenium logs while experiencing the Connection refused (notice node xx.xx.xx.166 behavior):

14:00:50.952 [AutoStartProxyPoolPoller] DEBUG d.z.e.z.proxy.AutoStartProxySet - Timing out containers because active container count 15 is greater than min 0.
14:00:50.953 [<http://xx.xx.xx.166:40000]> DEBUG d.z.e.z.p.DockerSeleniumRemoteProxy - No test activity, proxy has has been idle 300020 which is more than 300000
14:00:50.953 [<http://xx.xx.xx.166:40000]> DEBUG d.z.e.z.p.DockerSeleniumRemoteProxy - Proxy is idle.
14:00:50.953 [<http://xx.xx.xx.166:40000]> DEBUG d.z.e.z.p.DockerSeleniumRemoteProxy - Shutting down node due to proxy being idle after test.
14:00:50.953 [AutoStartProxyPoolPoller] DEBUG d.z.e.z.proxy.AutoStartProxySet - 1 proxies are idle and will be removed.
14:00:50.953 [AutoStartProxyPoolPoller] DEBUG d.z.e.z.proxy.AutoStartProxySet - Checked containers.
14:00:50.961 [<http://xx.xx.xx.88:40000]> DEBUG d.z.e.z.p.DockerSeleniumRemoteProxy - lastCommand: POST - executing...
14:00:50.964 [<http://xx.xx.xx.166:40000]> DEBUG d.z.e.z.p.DockerSeleniumRemoteProxy - lastCommand: DELETE - executing...
14:00:50.964 [<http://xx.xx.xx.206:40000]> DEBUG d.z.e.z.p.DockerSeleniumRemoteProxy - lastCommand: DELETE - executing...
14:00:50.965 [<http://xx.xx.xx.88:40000]> DEBUG d.z.e.z.p.DockerSeleniumRemoteProxy - lastCommand: POST - executing...
14:00:50.966 [<http://xx.xx.xx.166:40000]> DEBUG d.z.e.z.p.DockerSeleniumRemoteProxy - Marking the node as down because it was idle after the tests had finished.
14:00:50.966 [<http://xx.xx.xx.166:40000]> DEBUG d.z.e.z.registry.ZaleniumRegistry - Cleaning up stale test sessions on the unregistered node <http://xx.xx.xx.166:40000>
14:00:50.966 [<http://xx.xx.xx.166:40000]> DEBUG d.z.e.z.proxy.AutoStartProxySet - Stopping removed container [<http://xx.xx.xx.166:40000>
14:00:50.974 [<http://xx.xx.xx.166:40000]> DEBUG d.z.e.z.p.DockerSeleniumRemoteProxy - stopPolling() deactivated
14:00:50.977 [<http://xx.xx.xx.166:40000]> DEBUG d.z.e.z.p.DockerSeleniumRemoteProxy - lastCommand: POST - executing...
14:00:50.977 [qtp1414845278-312] WARN  o.o.g.w.s.handler.RequestHandler - The client is gone for session ext. key 0ed8ac8aeb36d90d4ce0ebd5a19dca07, terminating

14:00:51.396 [qtp1414845278-311] ERROR o.o.g.w.s.handler.RequestHandler - cannot forward the request unexpected end of stream on Connection{xx.xx.xx.166:40000, proxy=DIRECT hostAddress=/xx.xx.xx.166:40000 cipherSuite=none protocol=http/1.1}
java.io.IOException: unexpected end of stream on Connection{xx.xx.xx.166:40000, proxy=DIRECT hostAddress=/xx.xx.xx.166:40000 cipherSuite=none protocol=http/1.1}
    at okhttp3.internal.http1.Http1Codec.readResponseHeaders(Http1Codec.java:208)
    at okhttp3.internal.http.CallServerInterceptor.intercept(CallServerInterceptor.java:88)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
    at org.openqa.selenium.remote.internal.OkHttpClient$Factory$1.lambda$createClient$1(OkHttpClient.java:152)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
    at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.java:45)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
    at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.java:93)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
    at okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.java:93)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
    at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.java:126)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
    at okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:200)
    at okhttp3.RealCall.execute(RealCall.java:77)
    at org.openqa.selenium.remote.internal.OkHttpClient.execute(OkHttpClient.java:103)
    at org.openqa.grid.internal.TestSession.sendRequestToNode(TestSession.java:422)
    at org.openqa.grid.internal.TestSession.forward(TestSession.java:229)
    at org.openqa.grid.web.servlet.handler.RequestHandler.forwardRequest(RequestHandler.java:99)
    at org.openqa.grid.web.servlet.handler.RequestHandler.process(RequestHandler.java:133)
    at org.openqa.grid.web.servlet.DriverServlet.process(DriverServlet.java:85)
    at org.openqa.grid.web.servlet.DriverServlet.doPost(DriverServlet.java:69)
    at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
    at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
    at org.seleniumhq.jetty9.servlet.ServletHolder.handle(ServletHolder.java:865)
    at org.seleniumhq.jetty9.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1655)
    at io.prometheus.client.filter.MetricsFilter.doFilter(MetricsFilter.java:170)
    at org.seleniumhq.jetty9.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1642)
    at org.seleniumhq.jetty9.servlet.ServletHandler.doHandle(ServletHandler.java:533)
    at org.seleniumhq.jetty9.server.handler.ScopedHandler.handle(ScopedHandler.java:146)
    at org.seleniumhq.jetty9.security.SecurityHandler.handle(SecurityHandler.java:548)
    at org.seleniumhq.jetty9.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
    at org.seleniumhq.jetty9.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:257)
    at org.seleniumhq.jetty9.server.session.SessionHandler.doHandle(SessionHandler.java:1595)
    at org.seleniumhq.jetty9.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255)
    at org.seleniumhq.jetty9.server.handler.ContextHandler.doHandle(ContextHandler.java:1340)
    at org.seleniumhq.jetty9.server.handler.ScopedHandler.nextScope(ScopedHandler.java:203)
    at org.seleniumhq.jetty9.servlet.ServletHandler.doScope(ServletHandler.java:473)
    at org.seleniumhq.jetty9.server.session.SessionHandler.doScope(SessionHandler.java:1564)
    at org.seleniumhq.jetty9.server.handler.ScopedHandler.nextScope(ScopedHandler.java:201)
    at org.seleniumhq.jetty9.server.handler.ContextHandler.doScope(ContextHandler.java:1242)
    at org.seleniumhq.jetty9.server.handler.ScopedHandler.handle(ScopedHandler.java:144)
    at org.seleniumhq.jetty9.server.handler.StatisticsHandler.handle(StatisticsHandler.java:174)
    at org.seleniumhq.jetty9.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
    at org.seleniumhq.jetty9.server.Server.handle(Server.java:503)
    at org.seleniumhq.jetty9.server.HttpChannel.handle(HttpChannel.java:364)
    at org.seleniumhq.jetty9.server.HttpConnection.onFillable(HttpConnection.java:260)
    at org.seleniumhq.jetty9.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:305)
    at org.seleniumhq.jetty9.io.FillInterest.fillable(FillInterest.java:103)
    at org.seleniumhq.jetty9.io.ChannelEndPoint$2.run(ChannelEndPoint.java:118)
    at org.seleniumhq.jetty9.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:333)
    at org.seleniumhq.jetty9.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:310)
    at org.seleniumhq.jetty9.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:168)
    at org.seleniumhq.jetty9.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:126)
    at org.seleniumhq.jetty9.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:366)
    at org.seleniumhq.jetty9.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:765)
    at org.seleniumhq.jetty9.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:683)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.EOFException: \n not found: limit=0 content=…
    at okio.RealBufferedSource.readUtf8LineStrict(RealBufferedSource.java:237)
    at okhttp3.internal.http1.Http1Codec.readHeaderLine(Http1Codec.java:215)
    at okhttp3.internal.http1.Http1Codec.readResponseHeaders(Http1Codec.java:189)
    ... 61 common frames omitted

I'm aware of a reported issue in the past and its fix (https://github.com/zalando/zalenium/issues/970). However, is it possible I am facing that one?

To Reproduce

Kick off tests in parallel to be executed in different nodes. Unfortunately I was not able to distinguish a specific scenario of occurrence yet to give more details.

Steps to reproduce the behavior (including the docker command/docker-compose/Kubernetes manifests to start Zalenium):

Expected behavior

Tests should run with no errors

Actual behavior

Random java.net.ConnectException: Connection refused while parallel tests.

Environment

ubuntu:xenial-20181113 Zalenium Image Version(s): 3.141.59r

Hub config values:

hub:
  ## The repository and image
  ## ref: https://hub.docker.com/r/dosel/zalenium
  image: "dosel/zalenium"

  ## The tag for the image
  ## ref: https://hub.docker.com/r/dosel/zalenium/tags
  tag: "latest"

  ## Specify a imagePullPolicy
  ## ref: http://kubernetes.io/docs/user-guide/images/#pre-pulling-images
  pullPolicy: "IfNotPresent"

  ## Specify secrets to pull images from private repositories
  ## ref: https://kubernetes.io/docs/concepts/containers/images/#specifying-imagepullsecrets-on-a-pod
  #imagePullSecret: "xxxxxxxx"

  ## The port which the hub listens on
  port: 4444

  ## Timeout for probe Hub liveness via HTTP request on Hub console
  livenessTimeout: 1

  ## Timeout for probe Hub readiness via HTTP request on Hub console
  readinessTimeout: 1

  ## Pod Security Context
  ## ref: https://kubernetes.io/docs/tasks/configure-pod-container/security-context/
  ## in case it is running outside openshift, this should fix this https://github.com/zalando/zalenium/issues/631
  securityContext:
    enabled: false
    fsGroup: 0
    runAsUser: 1001

  ## Configure resource requests and limits
  ## ref: http://kubernetes.io/docs/user-guide/compute-resources/
  resources:
    requests:
      cpu: "1000m"
      memory: "4Gi"
  # For Java application, it's better to not currently put limit on CPU.
    limits:
      memory: "4Gi"

  ## The type of service to create
  ##   Values: ClusterIP, NodePort, LoadBalancer, or ExternalName
  ## ref: https://kubernetes.io/docs/user-guide/services/
  serviceType: "LoadBalancer"

  serviceAnnotations: {
    external-dns.alpha.kubernetes.io/hostname: my.hostname.com, 
    service.beta.kubernetes.io/aws-load-balancer-backend-protocol: http,
    service.beta.kubernetes.io/aws-load-balancer-internal: "0.0.0.0/0",
    service.beta.kubernetes.io/aws-load-balancer-cross-zone-load-balancing-enabled: "True",
  }

  #service.beta.kubernetes.io/aws-load-balancer-ssl-cert: "arn:aws:acm:us-east-1:072182941009:certificate/f6ab184c-e945-4463-9694-eee09b3620a2",
  ## If serviceType is LoadBalancer:
  ##   Add list of IPs allowed to connect to the service
  ## ref: https://kubernetes.io/docs/concepts/services-networking/service/
  serviceSourceRanges: ["0.0.0.0/0"]

  ## Control where client requests go, to the same pod or round-robin
  ##   Values: ClientIP or None
  ## ref: https://kubernetes.io/docs/user-guide/services/
  serviceSessionAffinity: "None"

  ## Environment variables passed to Zalenium hub.
  ## https://github.com/zalando/zalenium/blob/master/docs/usage_examples.md
  desiredContainers: 0
  maxDockerSeleniumContainers: 20
  videoRecordingEnabled: false
  cpuRequest: 1000m
  memRequest: 2Gi
  memLimit: 3Gi
  screenWidth: 1920
  screenHeight: 1080
  timeZone: "UTC"
  seleniumImageName: "elgalu/selenium"
  maxTestSessions: 1
  newSessionsWaitTimeout: 900000
  browserTimeout: 0
  idleTimeout: 500
  debugEnabled: true
  keepOnlyFailedTests: false
  retentionPeriod: 3
  sendAnonymousUsageInfo: true
  basicAuth:
    enabled: false
    username: "xxxxxxxx"
    password: "xxxxxxxx"
  sauceLabsEnabled: false
  sauceUserName: blank
  sauceAccessKey: blank
  browserStackEnabled: false
  browserStackUser: blank
  browserStackKey: blank
  testingBotEnabled: false
  testingBotKey: blank
  testingBotSecret: blank

    ## Arbitrary environment variables
  env: 
    # FOO: BAR
    ZALENIUM_EXTRA_JVM_PARAMS: -Dwebdriver.http.factory=apache

  ## Use Openshift DeploymentConfig instead of Kubernetes Deployment
  ## https://docs.okd.io/latest/architecture/core_concepts/deployments.html#deployments-and-deployment-configurations
  openshift:
    deploymentConfig:
      enabled: false
      triggers:
        - type: ConfigChange

## Enable persistence using Persistent Volume Claims
## ref: http://kubernetes.io/docs/user-guide/persistent-volumes/
##
persistence:
  data:
    enabled: false
    ## If true will use existing PVC instead of creating one
    useExisting: false
    ## Name of existing PVC to be used in the zalenium deployment
    name: 
    ## If defined, storageClassName: <storageClass>
    ## If set to "-", storageClassName: "", which disables dynamic provisioning
    ## If undefined (the default) or set to null, no storageClassName spec is
    ##   set, choosing the default provisioner.  (gp2 on AWS, standard on
    ##   GKE, AWS & OpenStack)
    ##
    storageClass: standard
    accessMode: ReadWriteOnce
    size: 15Gi
  video:
    enabled: false
    ## If true will use existing PVC instead of creating one
    useExisting: false
    ## Name of existing PVC to be used in the zalenium deployment
    name: 
    ## If defined, storageClassName: <storageClass>
    ## If set to "-", storageClassName: "", which disables dynamic provisioning
    ## If undefined (the default) or set to null, no storageClassName spec is
    ##   set, choosing the default provisioner.  (gp2 on AWS, standard on
    ##   GKE, AWS & OpenStack)
    ##
    storageClass: standard
    accessMode: ReadWriteOnce
    size: 30Gi

ingress:
  enabled: false
  tls: false
  # secretName: my-tls-cert # only needed if tls above is true
  hostname: zalenium.foobar.com
  # If your ingress host name is shared with multiple
  # applications, Enter a path starting with / (eg. /zalenium)
  # If your ingress host name is only for Zalenium,
  # set the path as / (root)
  path: /
  annotations:
    # kubernetes.io/ingress.class: "nginx"
    # kubernetes.io/tls-acme: "true"
  ## For RBAC support:

rbac:
  create: true
  ## Run the zalenium hub container with the ability to deploy/manage containers of jobs
  ## cluster-wide or only within namespace
  clusterWideAccess: false

serviceAccount:
  create: true
  ## Use the following Kubernetes Service Account name if not created
  name:

nodeSelector:
  enabled: false
  ## Run zalenium hub on specified nodes
  ## key: test
  ## value: test

tolerations:
  enabled: false
  ## Set tolerations to run zalenium hub on nodes with taints
  ## key: test
  ## operator: Equal
  ## value: test
  ## effect: NoSchedule

diemol commented 5 years ago

@flavialetgo the only relevant change has been the release of Chrome 75, nothing else that I am aware.

Please let us know when you find a clear way on how to reproduce this, so we can have a look. Thanks,

chickenZ42 commented 5 years ago

I'm experiencing the same issue lately

For me it seems to be that after a certain amount of tests run, the hub gets unstable to the point where either node containers shut down in the middle of a test (no sleeps or extended waits used), in which case I get the connection refused error. Or the instability of the hub results in no more nodes being started. If I go to the grid console/ admin live preview dashboard, I can see that the hub is up, but no nodes are connected and no more getting started/created upon new test requests.

Might be related but I haven't figured out yet what causes the instability in the first place. Usually restarting zalenium does the trick for a while, until it gets unstable again.

simonkoener-penguin commented 5 years ago

I think this issue is the same one as this: https://github.com/zalando/zalenium/issues/560. Not sure if this has been resolved on either selenium side or maybe at deeper level.

flavialetgo commented 5 years ago

@flavialetgo the only relevant change has been the release of Chrome 75, nothing else that I am aware.

Please let us know when you find a clear way on how to reproduce this, so we can have a look. Thanks,

I don't have a clear way to reproduce this yet. Under the same environment conditions the issue is not always reproducible but there's a high chance of occurrence. I'll keep you posted

flavialetgo commented 5 years ago

@diemol, the following could be a way to reproduce:

1- Have a set of tests/feature files to be executed in parallel, for example, 10 tests/feature files. 2- Have some of them to throw a timeout. In my case (webdriverIO + cucumber), the step definition is set to X milliseconds. Force the test to time out at some point. 3- Execute the tests in parallel, one session per node. Have fewer nodes than the number of tests so they can be queued. For example, if you have 10 tests, indicate zalenium to allow up to 5 nodes.

Please take into account that it might be necessary to repeat the test a few times until the java.net.ConnectException: Connection refused is seen.

ls-sergii-buchovskyi-zz commented 4 years ago

I had CI job with ~60 tests and that error was happening once a while, when I split that job into 6 with 4-20 tests in each, it happens 3-4 times out of 6... even in job where I have only 4 tests.

kumar1210 commented 4 years ago

Hi @diemol I am also facing this issue, and am able to replicate this issue quite frequently. Steps to reproduce : 1) Have a test suite, which has multiple test methods spread across the classes 2) have multiple testng threads as well as multiple zalenium nodes. Let's say 5 testng thread and 5 zalenium nodes. 3) Monitor live console , that post executing 1st test case it kills the session and the node also gets killed and a new session get created.

So post when testng tries further to execute the steps, it's not able to find the exact session and throws the exception.

Command to start zalenium hub : docker run --rm -ti --name zalenium --hostname zalenium_hub -p 4444:4444 -e zalenium_no_proxy="localhost,127.0.0.1,172.17.0.*" -v /var/run/docker.sock:/var/run/docker.sock -v /tmp/videos:/home/seluser/videos --privileged dosel/zalenium start --maxTestSessions 5 --desiredContainers 5

andresilva5 commented 4 years ago

Hi @kumar1210, I have same problem. Could you solve it?

kumar1210 commented 4 years ago

Hi @kumar1210, I have same problem. Could you solve it?

Not able to solve it. But i found some workaround. Not sure with the reasons yet. I am using multiple machine to run the grid. So before starting the hub machine, i run the nodes in other machine and then start the hub. So when hub starts, it starts registering the all nodes subscribed to it. Zalenium route the test cases to the nodes in the same order nodes are registered. So all my test cases are getting executed in slave nodes and i am not seeing that exception anymore.

Not sure, i am guessing it might be because the different way we are starting the nodes.

To start the hub : docker run --rm -ti --name zalenium --hostname zalenium_hub -p 4444:4444 -e zalenium_no_proxy="localhost,127.0.0.1,172.17.0.*" -v /var/run/docker.sock:/var/run/docker.sock -v /tmp/videos:/home/seluser/videos --privileged dosel/zalenium start --desiredContainers 5

To start the node in other machine : docker run -d --name hostname_node_0 -h hostname_node_0 -p 5550:5555 -e HUB_HOST= -e HUB_PORT=4444 -e REMOTE_HOST="http://`hostname --long`:5550" selenium/node-firefox ;

rakeshnambiar commented 3 years ago

@flavialetgo I too facing the same issue in the latest version as well as the version 3.14.0g. Do you know any workaround for the k8s deployment?

zalando / zalenium