Kubernetes containers occasionally not removed

ajcann commented 5 years ago

🐛 Bug Report

Using a Zalenium installation in kubernetes (via helm chart), occasionally containers are not removed when no tests are running despite max test sessions being set to 1 and desired containers set to 0. We currently run a few thousand tests a day and this only seems to happen about one in thousand or so tests. Max container # is set to 15 and often throughout the day all 15 slots are being consumed.

This is particularly unfortunate for us as the orphaned (probably not the right term here) containers often result in our Zalenium node pool to remain scaled up indefinitely.

As well, we have max container number set to 15 so slowly (if not noticed) the capacity is eaten up and tests run slower until the stop all together.

To Reproduce

Install Zalenium via helm into a kubernetes cluster. Have max test sessions being set to 1, desired containers set to 0, max containers set to 15. Repeatedly run tests at a rate which consume all 15 slots for some time. Eventually some containers will not be properly cleaned up.

Expected behavior

When max test sessions is set to 1 and desired containers is set to 0, all containers should be removed when no tests are running.

Environment

OS: Container Optimized OS (GKE) Zalenium Image Version(s): dosel/zalenium:3.141.59n Selenium Image: elgalu/selenium:3.141.59-p14 If using Kubernetes, specify your environment, and if relevant your manifests: GKE - v1.13.7-gke.8.

arnaud-deprez commented 5 years ago

Hi @ajcann,

Do you have any error or stacktrace in the zalenium and/or selenium logs that can help a bit more ?

ajcann commented 5 years ago

Hi @arnaud-deprez - the only errors I am seeing are of the following type. The containers themselves don't seem to have any errors though for reference I am now cleaning them up with a scheduled job that runs kubectl get pods --selector=createdBy=zalenium -o json | jq --raw-output '.items [] | select (.status.startTime | fromdateiso8601 < (now | floor) - (1200)) | .metadata.name' | xargs -I % kubectl delete pod/% which deletes containers older than expected.

14:52:35.742 [qtp2052435819-556462] ERROR d.z.e.z.p.DockerSeleniumRemoteProxy - Failed to create java.lang.IllegalStateException: Unable to locate pod by ip address, registration will fail at de.zalando.ep.zalenium.container.kubernetes.KubernetesContainerClient.registerNode(KubernetesContainerClient.java:445) at de.zalando.ep.zalenium.proxy.DockerSeleniumRemoteProxy.<init>(DockerSeleniumRemoteProxy.java:111) at sun.reflect.GeneratedConstructorAccessor65.newInstance(Unknown Source) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.openqa.grid.internal.BaseRemoteProxy.getNewInstance(BaseRemoteProxy.java:360) at org.openqa.grid.web.servlet.RegistrationServlet.process(RegistrationServlet.java:103) at org.openqa.grid.web.servlet.RegistrationServlet.doPost(RegistrationServlet.java:70) at javax.servlet.http.HttpServlet.service(HttpServlet.java:707) at javax.servlet.http.HttpServlet.service(HttpServlet.java:790) at org.seleniumhq.jetty9.servlet.ServletHolder.handle(ServletHolder.java:865) at org.seleniumhq.jetty9.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1655) at io.prometheus.client.filter.MetricsFilter.doFilter(MetricsFilter.java:170) at org.seleniumhq.jetty9.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1642) at org.seleniumhq.jetty9.servlet.ServletHandler.doHandle(ServletHandler.java:533) at org.seleniumhq.jetty9.server.handler.ScopedHandler.handle(ScopedHandler.java:146) at org.seleniumhq.jetty9.security.SecurityHandler.handle(SecurityHandler.java:548) at org.seleniumhq.jetty9.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132) at org.seleniumhq.jetty9.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:257) at org.seleniumhq.jetty9.server.session.SessionHandler.doHandle(SessionHandler.java:1595) at org.seleniumhq.jetty9.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255) at org.seleniumhq.jetty9.server.handler.ContextHandler.doHandle(ContextHandler.java:1340) at org.seleniumhq.jetty9.server.handler.ScopedHandler.nextScope(ScopedHandler.java:203) at org.seleniumhq.jetty9.servlet.ServletHandler.doScope(ServletHandler.java:473) at org.seleniumhq.jetty9.server.session.SessionHandler.doScope(SessionHandler.java:1564) at org.seleniumhq.jetty9.server.handler.ScopedHandler.nextScope(ScopedHandler.java:201) at org.seleniumhq.jetty9.server.handler.ContextHandler.doScope(ContextHandler.java:1242) at org.seleniumhq.jetty9.server.handler.ScopedHandler.handle(ScopedHandler.java:144) at org.seleniumhq.jetty9.server.handler.StatisticsHandler.handle(StatisticsHandler.java:174) at org.seleniumhq.jetty9.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132) at org.seleniumhq.jetty9.server.Server.handle(Server.java:503) at org.seleniumhq.jetty9.server.HttpChannel.handle(HttpChannel.java:364) at org.seleniumhq.jetty9.server.HttpConnection.onFillable(HttpConnection.java:260) at org.seleniumhq.jetty9.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:305) at org.seleniumhq.jetty9.io.FillInterest.fillable(FillInterest.java:103) at org.seleniumhq.jetty9.io.ChannelEndPoint$2.run(ChannelEndPoint.java:118) at org.seleniumhq.jetty9.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:333) at org.seleniumhq.jetty9.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:310) at org.seleniumhq.jetty9.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:168) at org.seleniumhq.jetty9.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:126) at org.seleniumhq.jetty9.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:366) at org.seleniumhq.jetty9.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:765) at org.seleniumhq.jetty9.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:683) at java.lang.Thread.run(Thread.java:748)

pearj commented 5 years ago

There is some debugging you can enable, that I think we probably should have left enabled by default.

If you see this file: https://github.com/zalando/zalenium/blob/master/src/main/resources/logback.xml#L23

You need to change that line to DEBUG. You should simply be able to mount a new version of that file somewhere in the container, and then override the logback file with the LOGBACK_PATH environment variable. https://github.com/zalando/zalenium/blob/master/scripts/zalenium.sh#L25

Then every 30 seconds it will dump a table to the log output that shows the internal state of all the what zalenium believes is happening with starting, containers, reusing containers, etc. It's a pretty handy debugging tool.

See: https://github.com/zalando/zalenium/blob/master/src/main/java/de/zalando/ep/zalenium/proxy/AutoStartProxySet.java#L78 And: https://github.com/zalando/zalenium/blob/master/src/main/java/de/zalando/ep/zalenium/proxy/AutoStartProxySet.java#L483-L485

Hopefully, that helps you get to the bottom of the issue.

pearj commented 5 years ago

Regarding that exception you posted, I'm pretty sure that doesn't matter because the selenium pods retry registration if it fails from memory.

antlong commented 5 years ago

It could be garbage collection related, a race condition, or I/O related to start.

Kubernetes allows users to customize the garbage collection policy via three flags.

minimum-container-ttl-duration, minimum age for a finished container before it is garbage collected. Default is 1 minute.
maximum-dead-containers-per-container, maximum number of old instances to retain per container. Default is 2.
maximum-dead-containers, maximum number of old instances of containers to retain globally. Default is 100.

arnaud-deprez commented 5 years ago

@pearj, I think indeed from a zalenium point of view, it does not matter as zalenium will create a new Pod if needed. However from the kubernetes cluster point of view, zalenium will load it with Pods it does not manage because they haven't been registered (if I understood all the code it correctly).

@antlong I think it's more a race condition or I/O related when zalenium Pod is created rather than a garbage collection when the pod are deleted as from what I understand the Pods are still un Running state and not in Terminating or something.

However this is all suppositions and I'm afraid I don't have the infrastructure to try to reproduce it like that. I'll have to pay a bit on Google, Amazon or Azure for it and I'm not sure if I would be able to reproduce it as it can depends on network setup and VMs sizing as well.

But if it's really that the problem, it might be solvable by implementing a reconciliation loop that regularly check the actual kubernetes state with zalenium's registry state and if it detects a zalenium grid that is not register in its registry just delete it and warn in logs.

diemol commented 4 years ago

Hi there,

There is not so much we can do on our side since we don't have the infrastructure to reproduce the issue. Since it is possible to reproduce it on your end, we are open to receive any PRs, thanks!

antlong commented 4 years ago

Please run and provide the result from kubectl describe nodes <NODE_NAME>

zalando / zalenium