Open ajcann opened 5 years ago
Hi @ajcann,
Do you have any error or stacktrace in the zalenium and/or selenium logs that can help a bit more ?
Hi @arnaud-deprez - the only errors I am seeing are of the following type. The containers themselves don't seem to have any errors though for reference I am now cleaning them up with a scheduled job that runs kubectl get pods --selector=createdBy=zalenium -o json | jq --raw-output '.items [] | select (.status.startTime | fromdateiso8601 < (now | floor) - (1200)) | .metadata.name' | xargs -I % kubectl delete pod/%
which deletes containers older than expected.
14:52:35.742 [qtp2052435819-556462] ERROR d.z.e.z.p.DockerSeleniumRemoteProxy - Failed to create java.lang.IllegalStateException: Unable to locate pod by ip address, registration will fail at de.zalando.ep.zalenium.container.kubernetes.KubernetesContainerClient.registerNode(KubernetesContainerClient.java:445) at de.zalando.ep.zalenium.proxy.DockerSeleniumRemoteProxy.<init>(DockerSeleniumRemoteProxy.java:111) at sun.reflect.GeneratedConstructorAccessor65.newInstance(Unknown Source) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.openqa.grid.internal.BaseRemoteProxy.getNewInstance(BaseRemoteProxy.java:360) at org.openqa.grid.web.servlet.RegistrationServlet.process(RegistrationServlet.java:103) at org.openqa.grid.web.servlet.RegistrationServlet.doPost(RegistrationServlet.java:70) at javax.servlet.http.HttpServlet.service(HttpServlet.java:707) at javax.servlet.http.HttpServlet.service(HttpServlet.java:790) at org.seleniumhq.jetty9.servlet.ServletHolder.handle(ServletHolder.java:865) at org.seleniumhq.jetty9.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1655) at io.prometheus.client.filter.MetricsFilter.doFilter(MetricsFilter.java:170) at org.seleniumhq.jetty9.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1642) at org.seleniumhq.jetty9.servlet.ServletHandler.doHandle(ServletHandler.java:533) at org.seleniumhq.jetty9.server.handler.ScopedHandler.handle(ScopedHandler.java:146) at org.seleniumhq.jetty9.security.SecurityHandler.handle(SecurityHandler.java:548) at org.seleniumhq.jetty9.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132) at org.seleniumhq.jetty9.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:257) at org.seleniumhq.jetty9.server.session.SessionHandler.doHandle(SessionHandler.java:1595) at org.seleniumhq.jetty9.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255) at org.seleniumhq.jetty9.server.handler.ContextHandler.doHandle(ContextHandler.java:1340) at org.seleniumhq.jetty9.server.handler.ScopedHandler.nextScope(ScopedHandler.java:203) at org.seleniumhq.jetty9.servlet.ServletHandler.doScope(ServletHandler.java:473) at org.seleniumhq.jetty9.server.session.SessionHandler.doScope(SessionHandler.java:1564) at org.seleniumhq.jetty9.server.handler.ScopedHandler.nextScope(ScopedHandler.java:201) at org.seleniumhq.jetty9.server.handler.ContextHandler.doScope(ContextHandler.java:1242) at org.seleniumhq.jetty9.server.handler.ScopedHandler.handle(ScopedHandler.java:144) at org.seleniumhq.jetty9.server.handler.StatisticsHandler.handle(StatisticsHandler.java:174) at org.seleniumhq.jetty9.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132) at org.seleniumhq.jetty9.server.Server.handle(Server.java:503) at org.seleniumhq.jetty9.server.HttpChannel.handle(HttpChannel.java:364) at org.seleniumhq.jetty9.server.HttpConnection.onFillable(HttpConnection.java:260) at org.seleniumhq.jetty9.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:305) at org.seleniumhq.jetty9.io.FillInterest.fillable(FillInterest.java:103) at org.seleniumhq.jetty9.io.ChannelEndPoint$2.run(ChannelEndPoint.java:118) at org.seleniumhq.jetty9.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:333) at org.seleniumhq.jetty9.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:310) at org.seleniumhq.jetty9.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:168) at org.seleniumhq.jetty9.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:126) at org.seleniumhq.jetty9.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:366) at org.seleniumhq.jetty9.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:765) at org.seleniumhq.jetty9.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:683) at java.lang.Thread.run(Thread.java:748)
There is some debugging you can enable, that I think we probably should have left enabled by default.
If you see this file: https://github.com/zalando/zalenium/blob/master/src/main/resources/logback.xml#L23
You need to change that line to DEBUG
. You should simply be able to mount a new version of that file somewhere in the container, and then override the logback file with the LOGBACK_PATH
environment variable.
https://github.com/zalando/zalenium/blob/master/scripts/zalenium.sh#L25
Then every 30 seconds it will dump a table to the log output that shows the internal state of all the what zalenium believes is happening with starting, containers, reusing containers, etc. It's a pretty handy debugging tool.
See: https://github.com/zalando/zalenium/blob/master/src/main/java/de/zalando/ep/zalenium/proxy/AutoStartProxySet.java#L78 And: https://github.com/zalando/zalenium/blob/master/src/main/java/de/zalando/ep/zalenium/proxy/AutoStartProxySet.java#L483-L485
Hopefully, that helps you get to the bottom of the issue.
Regarding that exception you posted, I'm pretty sure that doesn't matter because the selenium pods retry registration if it fails from memory.
It could be garbage collection related, a race condition, or I/O related to start.
Kubernetes allows users to customize the garbage collection policy via three flags.
@pearj, I think indeed from a zalenium point of view, it does not matter as zalenium will create a new Pod if needed. However from the kubernetes cluster point of view, zalenium will load it with Pods it does not manage because they haven't been registered (if I understood all the code it correctly).
@antlong I think it's more a race condition or I/O related when zalenium Pod is created rather than a garbage collection when the pod are deleted as from what I understand the Pods are still un Running
state and not in Terminating
or something.
However this is all suppositions and I'm afraid I don't have the infrastructure to try to reproduce it like that. I'll have to pay a bit on Google, Amazon or Azure for it and I'm not sure if I would be able to reproduce it as it can depends on network setup and VMs sizing as well.
But if it's really that the problem, it might be solvable by implementing a reconciliation loop that regularly check the actual kubernetes state with zalenium's registry state and if it detects a zalenium grid that is not register in its registry just delete it and warn in logs.
Hi there,
There is not so much we can do on our side since we don't have the infrastructure to reproduce the issue. Since it is possible to reproduce it on your end, we are open to receive any PRs, thanks!
Please run and provide the result from kubectl describe nodes <NODE_NAME>
🐛 Bug Report
Using a Zalenium installation in kubernetes (via helm chart), occasionally containers are not removed when no tests are running despite max test sessions being set to 1 and desired containers set to 0. We currently run a few thousand tests a day and this only seems to happen about one in thousand or so tests. Max container # is set to 15 and often throughout the day all 15 slots are being consumed.
This is particularly unfortunate for us as the orphaned (probably not the right term here) containers often result in our Zalenium node pool to remain scaled up indefinitely.
As well, we have max container number set to 15 so slowly (if not noticed) the capacity is eaten up and tests run slower until the stop all together.
To Reproduce
Install Zalenium via helm into a kubernetes cluster. Have max test sessions being set to 1, desired containers set to 0, max containers set to 15. Repeatedly run tests at a rate which consume all 15 slots for some time. Eventually some containers will not be properly cleaned up.
Expected behavior
When max test sessions is set to 1 and desired containers is set to 0, all containers should be removed when no tests are running.
Environment
OS: Container Optimized OS (GKE) Zalenium Image Version(s): dosel/zalenium:3.141.59n Selenium Image: elgalu/selenium:3.141.59-p14 If using Kubernetes, specify your environment, and if relevant your manifests: GKE - v1.13.7-gke.8.