zalando / spilo

Highly available elephant herd: HA PostgreSQL cluster using Docker
Apache License 2.0
1.53k stars 382 forks source link

ssl.SSLError: [SSL: HTTP_REQUEST] http request (_ssl.c:997) in Spilo 3.0-p1 #889

Open sasadangelo opened 1 year ago

sasadangelo commented 1 year ago

Hi Team,

In my systems I had a very old Spilo version related to this commit https://github.com/zalando/spilo/commit/f8179d9a5de5e9a78e5a447130f759f97811879b Nov 11 2021.

Now I decided to move to Spilo 3.0-p1 version. Everything works as expected but I see this error message in logs:

 Exception in thread Thread-242 (process_request_thread):                                                                                                                        
 Traceback (most recent call last):                                                                                                                                              
   File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner                                                                                                       
     self.run()                                                                                                                                                                  
   File "/usr/lib/python3.10/threading.py", line 953, in run                                                                                                                     
     self._target(*self._args, **self._kwargs)                                                                                                                                   
   File "/usr/local/lib/python3.10/dist-packages/patroni/api.py", line 885, in process_request_thread                                                                             
      request.do_handshake()
     File "/usr/lib/python3.10/ssl.py", line 1342, in do_handshake                                      
     self._sslobj.do_handshake()
 ssl.SSLError: [SSL: HTTP_REQUEST] http request (_ssl.c:997)                                                                                                                     

Now on my Patroni endpoint (8008) I use certificates (as per PR I submitted two years ago).

The problem seems related to this method:

    def process_request_thread(self, request, client_address):
        enable_keepalive(request, 10, 3)
        if hasattr(request, 'context'):  # SSLSocket
            request.do_handshake()
        super(RestApiServer, self).process_request_thread(request, client_address)

It's not clear to me who call this method and why and what is the effect to have this message into the log. Please can anyone help?

sasadangelo commented 1 year ago

Doing a diff with my old configuration when I had similar problem I found these two extra lines in postgresql.yml:

    ssl_ciphers: 'HIGH:!aNULL:!SSLv2:!SSLv3:!TLSv1:!TLSv1.1'
    ssl_prefer_server_ciphers: true

could they have any impact on the above problem? At the moment, these two extra lines are not present in my configuration. If I well remember I was testing the removal of some weak algorithms.

sasadangelo commented 1 year ago

Anyone can help on this?

sasadangelo commented 1 year ago

The problem has been analyzed with Alexander here: https://postgresteam.slack.com/archives/CFYAXFT7D/p1688372311182259

It seems related to an HTTP (not secure) connection to an endpoint that use certificates. The problem is here: https://github.com/zalando/spilo/blob/41c888cfa43b04aa9d46e0bd640b36cd3d7d3fec/postgres-appliance/scripts/patroni_wait.sh#L65

that cannot work if patroni is configured in SSL. This is the script that keep the pod running:

postgres     38     36  0 08:31 ?        00:00:00 /bin/bash /scripts/patroni_wait.sh --role master -- /usr/bin/pgqd /home/postgres/pgq_ticker.ini

This process waits until the curl on:

localhost:8008/ROLE

doesn’t return 200. But it is completely buggy. It doesn’t take into account the possibility that Patroni could be configured in SSL. Probably when I did the PR to support SSL certificates on Spilo I should fix also this file. At that time no error was showed but the script still failed. We didn’t notice any problem because the main patroni_wait.sh that is always up and running and keep the pod running runs:

/usr/bin/pgqd /home/postgres/pgq_ticker.ini

that wait indefinitively. So the fix should take into account the variables:

SSL_RESTAPI_CA_FILE
SSL_RESTAPI_CERTIFICATE_FILE
SSL_RESTAPI_PRIVATE_KEY_FILE

and if they are set then pass them to the curl.

sasadangelo commented 1 year ago

I am going to prepare an official patch for this problem.

sasadangelo commented 1 year ago

Created the PR to fix this issue: https://github.com/zalando/spilo/pull/909