Open tylarb opened 3 years ago
Started looking at this one. Current findings are:
Thanks for looking at this @anmalysh-yb
I've seen this on AWS - the above logs are from a system there. I believe that universe is 2.5.3 but I can confirm that as well.
It looks like python Popen() needs a wait() as well but my cursory search didn't find any Popen() without a wait().
Ok, I figured out it's health queries:
root 18187 14040 0 15:49 ? 00:00:00 [ssh] <defunct> root 18188 14040 0 15:49 ? 00:00:00 [ssh] <defunct> root 18189 14040 0 15:49 ? 00:00:00 [ssh] <defunct> [centos@portal ~]$ sudo ausearch -i | grep pid=18189 -B 10 | grep type=EXECVE | grep ssh type=EXECVE msg=audit(05/05/2021 15:49:21.133:1178610) : argc=15 a0=ssh a1=yugabyte@10.150.0.72 a2=-p a3=54422 a4=-o a5=StrictHostKeyChecking no a6=-o a7=ConnectTimeout=10 a8=-o a9=UserKnownHostsFile /dev/null a10=-o a11=LogLevel ERROR a12=-i a13=/opt/yugabyte/yugaware/data/keys/f8438937-bef6-4ed7-8304-2fad20f54946/yb-1-gcp-provider-key.pem a14=set -o pipefail; /home/yugabyte/tserver/bin/ysqlsh -h 10.150.0.72 -p 5433 -U yugabyte "sslmode=require" -c "\conninfo" type=EXECVE msg=audit(05/05/2021 15:49:21.137:1178615) : argc=15 a0=ssh a1=yugabyte@10.150.0.72 a2=-p a3=54422 a4=-o a5=StrictHostKeyChecking no a6=-o a7=ConnectTimeout=10 a8=-o a9=UserKnownHostsFile /dev/null a10=-o a11=LogLevel ERROR a12=-i a13=/opt/yugabyte/yugaware/data/keys/f8438937-bef6-4ed7-8304-2fad20f54946/yb-1-gcp-provider-key.pem a14=set -o pipefail; df -h [centos@portal ~]$ sudo ausearch -i | grep pid=18188 -B 10 | grep type=EXECVE | grep ssh type=EXECVE msg=audit(05/05/2021 15:49:21.133:1178610) : argc=15 a0=ssh a1=yugabyte@10.150.0.72 a2=-p a3=54422 a4=-o a5=StrictHostKeyChecking no a6=-o a7=ConnectTimeout=10 a8=-o a9=UserKnownHostsFile /dev/null a10=-o a11=LogLevel ERROR a12=-i a13=/opt/yugabyte/yugaware/data/keys/f8438937-bef6-4ed7-8304-2fad20f54946/yb-1-gcp-provider-key.pem a14=set -o pipefail; /home/yugabyte/tserver/bin/ysqlsh -h 10.150.0.72 -p 5433 -U yugabyte "sslmode=require" -c "\conninfo" [centos@portal ~]$ sudo ausearch -i | grep pid=18187 -B 10 | grep type=EXECVE | grep ssh type=EXECVE msg=audit(05/05/2021 15:49:21.123:1178600) : argc=15 a0=ssh a1=yugabyte@10.150.0.72 a2=-p a3=54422 a4=-o a5=StrictHostKeyChecking no a6=-o a7=ConnectTimeout=10 a8=-o a9=UserKnownHostsFile /dev/null a10=-o a11=LogLevel ERROR a12=-i a13=/opt/yugabyte/yugaware/data/keys/f8438937-bef6-4ed7-8304-2fad20f54946/yb-1-gcp-provider-key.pem a14=set -o pipefail; find /home/yugabyte/tserver/logs/ -mmin -12 -name "*FATAL*" -type f -printf "%T@ %p\n" | sort -rn type=EXECVE msg=audit(05/05/2021 15:49:21.129:1178605) : argc=15 a0=ssh a1=yugabyte@10.150.0.72 a2=-p a3=54422 a4=-o a5=StrictHostKeyChecking no a6=-o a7=ConnectTimeout=10 a8=-o a9=UserKnownHostsFile /dev/null a10=-o a11=LogLevel ERROR a12=-i a13=/opt/yugabyte/yugaware/data/keys/f8438937-bef6-4ed7-8304-2fad20f54946/yb-1-gcp-provider-key.pem a14=set -o pipefail; SSL_VERSION=TLSv1_2 SSL_CERTFILE=/home/yugabyte/yugabyte-tls-config/ca.crt /home/yugabyte/tserver/bin/cqlsh 10.150.0.72 9042 -e "SHOW HOST" --ssl
The only thing left is to find out what's wrong with cluster_health.py Popen usage. For now it seems correct, but I'm not python expert
Well, somehow the change in cluster_health.py from this diff: https://phabricator.dev.yugabyte.com/D10954 is fixing the issue - I applied this change to GCP portal and it was running with the patch for the whole day - no new defunct processes. I'm not sure though why it works - looks like some magic.
For me the issue seems like this: https://blog.phusion.nl/2015/01/20/docker-and-the-pid-1-zombie-reaping-problem/ Daniel already attempted to fix it earlier here https://phabricator.dev.yugabyte.com/D10164, but it does not seem to fully fix the issue.
I'll go ahead and port fix from D10954 to earlier branches - will see if it help or not. At least it helped on GCP portal. If not - will need to look into this issue further. Probably we can try to move to subprocess.check_output usage later once we stop python 2 support.
Reporting a regression here. This is being seen again on 2.9 and 2.6 portal instances internally.
it appears that there is some process leak in yugaware which doesn't reap ssh processes after creation.
Normally this is due to some
subprocess.exec()
without a correspondingsubprocesses.wait()
command.It seems the leak is slow - in the 10s of processes per month - but over time it would be possible to get into a "too many procs" scenario if yugaware isn't restarted at some point.