[Platform] Yugaware creates zombie ssh process

tylarb commented 3 years ago

it appears that there is some process leak in yugaware which doesn't reap ssh processes after creation.

Normally this is due to some subprocess.exec() without a corresponding subprocesses.wait() command.

It seems the leak is slow - in the 10s of processes per month - but over time it would be possible to get into a "too many procs" scenario if yugaware isn't restarted at some point.

29226 29245 29245 29245 ?           -1 Ssl      0 124:50  |   \_ /usr/lib/jvm/jre-1.8.0-openjdk/bin/java -XX:PermSize=1024m -XX:MaxPermSize=1024m -Duser.dir=/opt/yugabyte/yugaware -Dconfig.file=conf/application.docker.conf -cp /opt/yugabyte/yugaware/lib/../conf/:/opt/yugabyte/yugaware/lib/yugaware.yug
aware-2.4.2.0-b0-sans-externalized.jar:/opt/yugabyte/yugaware/lib/org.scala-lang.scala-library-2.11.7.jar:/opt/yugabyte/yugaware/lib/com.typesafe.play.play-enhancer-1.2.2.jar:/opt/yugabyte/yugaware/lib/org.javassist.javassist-3.21.0-GA.jar:/opt/yugabyte/yugaware/lib/com.typesafe.play.twirl-api_2.11-1.///[trimmed]
29245 18952 18952 18952 ?           -1 Zs       0   0:00  |   |   \_ [ssh] <defunct>
29245 18953 18953 18953 ?           -1 Zs       0   0:00  |   |   \_ [ssh] <defunct>
29245 18972 18972 18972 ?           -1 Zs       0   0:00  |   |   \_ [ssh] <defunct>
29245 18973 18973 18973 ?           -1 Zs       0   0:00  |   |   \_ [ssh] <defunct>
29245 18986 18986 18986 ?           -1 Zs       0   0:00  |   |   \_ [ssh] <defunct>
29245 18987 18987 18987 ?           -1 Zs       0   0:00  |   |   \_ [ssh] <defunct>
29245 24418 24418 24418 ?           -1 Zs       0   0:00  |   |   \_ [ssh] <defunct>
29245 24419 24419 24419 ?           -1 Zs       0   0:00  |   |   \_ [ssh] <defunct>
29245 28912 28912 28912 ?           -1 Zs       0   0:00  |   |   \_ [ssh] <defunct>
29245 28913 28913 28913 ?           -1 Zs       0   0:00  |   |   \_ [ssh] <defunct>
29245 31258 31258 31258 ?           -1 Zs       0   0:00  |   |   \_ [ssh] <defunct>

anmalysh-yb commented 3 years ago

Started looking at this one. Current findings are:

It's not reproducible on AWS portal. Still reproduces on gcp portal. Either it's fixed in master branch, or it's specific to installation type/os/etc.
Application logs show that backup ssh connections are failing to one of the nodes each time new defunct processes are spawned. But probably it's not backup - because python code running ssh looks correct and also it has 10 retries with 10 second delay - so part of the processes should be started in next minute always. While all defunct processes are always started during the same minute.
Enabled auditd logs to find out - which exact ssh processes are stuck. Waiting for it to reproduce.

tylarb commented 3 years ago

Thanks for looking at this @anmalysh-yb

I've seen this on AWS - the above logs are from a system there. I believe that universe is 2.5.3 but I can confirm that as well.

It looks like python Popen() needs a wait() as well but my cursory search didn't find any Popen() without a wait().

anmalysh-yb commented 3 years ago

Ok, I figured out it's health queries:

root 18187 14040 0 15:49 ? 00:00:00 [ssh] <defunct> root 18188 14040 0 15:49 ? 00:00:00 [ssh] <defunct> root 18189 14040 0 15:49 ? 00:00:00 [ssh] <defunct> [centos@portal ~]$ sudo ausearch -i | grep pid=18189 -B 10 | grep type=EXECVE | grep ssh type=EXECVE msg=audit(05/05/2021 15:49:21.133:1178610) : argc=15 a0=ssh a1=yugabyte@10.150.0.72 a2=-p a3=54422 a4=-o a5=StrictHostKeyChecking no a6=-o a7=ConnectTimeout=10 a8=-o a9=UserKnownHostsFile /dev/null a10=-o a11=LogLevel ERROR a12=-i a13=/opt/yugabyte/yugaware/data/keys/f8438937-bef6-4ed7-8304-2fad20f54946/yb-1-gcp-provider-key.pem a14=set -o pipefail; /home/yugabyte/tserver/bin/ysqlsh -h 10.150.0.72 -p 5433 -U yugabyte "sslmode=require" -c "\conninfo" type=EXECVE msg=audit(05/05/2021 15:49:21.137:1178615) : argc=15 a0=ssh a1=yugabyte@10.150.0.72 a2=-p a3=54422 a4=-o a5=StrictHostKeyChecking no a6=-o a7=ConnectTimeout=10 a8=-o a9=UserKnownHostsFile /dev/null a10=-o a11=LogLevel ERROR a12=-i a13=/opt/yugabyte/yugaware/data/keys/f8438937-bef6-4ed7-8304-2fad20f54946/yb-1-gcp-provider-key.pem a14=set -o pipefail; df -h [centos@portal ~]$ sudo ausearch -i | grep pid=18188 -B 10 | grep type=EXECVE | grep ssh type=EXECVE msg=audit(05/05/2021 15:49:21.133:1178610) : argc=15 a0=ssh a1=yugabyte@10.150.0.72 a2=-p a3=54422 a4=-o a5=StrictHostKeyChecking no a6=-o a7=ConnectTimeout=10 a8=-o a9=UserKnownHostsFile /dev/null a10=-o a11=LogLevel ERROR a12=-i a13=/opt/yugabyte/yugaware/data/keys/f8438937-bef6-4ed7-8304-2fad20f54946/yb-1-gcp-provider-key.pem a14=set -o pipefail; /home/yugabyte/tserver/bin/ysqlsh -h 10.150.0.72 -p 5433 -U yugabyte "sslmode=require" -c "\conninfo" [centos@portal ~]$ sudo ausearch -i | grep pid=18187 -B 10 | grep type=EXECVE | grep ssh type=EXECVE msg=audit(05/05/2021 15:49:21.123:1178600) : argc=15 a0=ssh a1=yugabyte@10.150.0.72 a2=-p a3=54422 a4=-o a5=StrictHostKeyChecking no a6=-o a7=ConnectTimeout=10 a8=-o a9=UserKnownHostsFile /dev/null a10=-o a11=LogLevel ERROR a12=-i a13=/opt/yugabyte/yugaware/data/keys/f8438937-bef6-4ed7-8304-2fad20f54946/yb-1-gcp-provider-key.pem a14=set -o pipefail; find /home/yugabyte/tserver/logs/ -mmin -12 -name "*FATAL*" -type f -printf "%T@ %p\n" | sort -rn type=EXECVE msg=audit(05/05/2021 15:49:21.129:1178605) : argc=15 a0=ssh a1=yugabyte@10.150.0.72 a2=-p a3=54422 a4=-o a5=StrictHostKeyChecking no a6=-o a7=ConnectTimeout=10 a8=-o a9=UserKnownHostsFile /dev/null a10=-o a11=LogLevel ERROR a12=-i a13=/opt/yugabyte/yugaware/data/keys/f8438937-bef6-4ed7-8304-2fad20f54946/yb-1-gcp-provider-key.pem a14=set -o pipefail; SSL_VERSION=TLSv1_2 SSL_CERTFILE=/home/yugabyte/yugabyte-tls-config/ca.crt /home/yugabyte/tserver/bin/cqlsh 10.150.0.72 9042 -e "SHOW HOST" --ssl

The only thing left is to find out what's wrong with cluster_health.py Popen usage. For now it seems correct, but I'm not python expert

anmalysh-yb commented 3 years ago

Well, somehow the change in cluster_health.py from this diff: https://phabricator.dev.yugabyte.com/D10954 is fixing the issue - I applied this change to GCP portal and it was running with the patch for the whole day - no new defunct processes. I'm not sure though why it works - looks like some magic.

For me the issue seems like this: https://blog.phusion.nl/2015/01/20/docker-and-the-pid-1-zombie-reaping-problem/ Daniel already attempted to fix it earlier here https://phabricator.dev.yugabyte.com/D10164, but it does not seem to fully fix the issue.

I'll go ahead and port fix from D10954 to earlier branches - will see if it help or not. At least it helped on GCP portal. If not - will need to look into this issue further. Probably we can try to move to subprocess.check_output usage later once we stop python 2 support.

ajcaldera1 commented 3 years ago

Reporting a regression here. This is being seen again on 2.9 and 2.6 portal instances internally.

yugabyte / yugabyte-db

[Platform] Yugaware creates zombie ssh process #8165