zhmcclient / zhmc-prometheus-exporter

A Prometheus exporter for the IBM Z HMC
Apache License 2.0
12 stars 8 forks source link

Unhandled wsgiref exception in log #397

Closed Charles1000Chen closed 7 months ago

Charles1000Chen commented 8 months ago

Describe the bug The following error from the wsgiref package in ZHMC prometheus exporter log is unhandled:

Traceback (most recent call last):
  File "/usr/lib/python3.10/wsgiref/handlers.py", line 138, in run
    self.finish_response()
  File "/usr/lib/python3.10/wsgiref/handlers.py", line 184, in finish_response
    self.write(data)
  File "/usr/lib/python3.10/wsgiref/handlers.py", line 288, in write
    self.send_headers()
  File "/usr/lib/python3.10/wsgiref/handlers.py", line 346, in send_headers
    self.send_preamble()
  File "/usr/lib/python3.10/wsgiref/handlers.py", line 268, in send_preamble
    self._write(
  File "/usr/lib/python3.10/wsgiref/handlers.py", line 467, in _write
    result = self.stdout.write(data)
  File "/usr/lib/python3.10/socketserver.py", line 826, in write
    self._sock.sendall(b)
  File "/usr/lib/python3.10/ssl.py", line 1237, in sendall
    v = self.send(byte_view[count:])
  File "/usr/lib/python3.10/ssl.py", line 1206, in send
    return self._sslobj.write(data)
ssl.SSLEOFError: EOF occurred in violation of protocol (_ssl.c:2426)

Expected behavior The wsgiref exception should be catched by zhmc prometheus exporter and output understandable error message.

To Reproduce <-- Describe the steps to reproduce the behavior. -->

Environment information

Command output <-- Relevant parts of the command output. If possible, with '-vv'. -->

Log file <-- If possible, attach a log file generated with '--log-comp all=debug --log exporter.log'. -->

Charles1000Chen commented 8 months ago

The SSLEOFError exception needs be handled when call the "start_http_server" function.

andy-maier commented 8 months ago

@Charles1000Chen The code already catches ssl.SSLError when raised by the "start_http_server" function, because it is a subclass of IOError: https://github.com/zhmcclient/zhmc-prometheus-exporter/blob/master/zhmc_prometheus_exporter/zhmc_prometheus_exporter.py#L1871

I think what probably happens is that the ssl exception is raised in the thread that is started.

andy-maier commented 8 months ago

I put up PR #413 which in the HTTPS case simplifies the error message but keeps on catching IOError around the call to start_http_server() (because that also catches any ssl.SSLError exceptions - I verified that), and that resulted in the following (properly caught) error messages, for a few selected error situations:

Note that the error reported by you is not part of these tests. I suspect that reproducing that error would require cutting off the network between Prometheus and the zhmc exporter during the TLS handshake, which is hard to reproduce for me.

I think the improvement in PR #413 is as much as we can do in the zhmc exporter, because the handling of exceptions raised by the HTTP/HTTPS server while it runs would need to be done by the Python Prometheus client code, and not by the zhmc exporter code.

If you have indications that the above is incorrect, please let me know.

Charles1000Chen commented 8 months ago

@andy-maier I basically agree with you. Currently, the error can be seen in my test every once in a while. Could we catch the Exception as well and put the update in the version 1.5.0b2, so that I'll test it in my environment to see if any difference?

andy-maier commented 8 months ago

@Charles1000Chen Yes, i can open up the type of exceptions from IOError to Exception to make sure we catch everything, and build a new beta version.

andy-maier commented 8 months ago

Beta version 1.5.0b3 has been released, with the change to the exception handling.

The exception handling change is only to be double sure that the ssl.SSLEOFError gets handled if it is raised in the call to start_http_server().

I'll leave this issue open, since it is not solved yet. Once the exception reoccurs we will investigate what else can be done.

andy-maier commented 7 months ago

I am closing his issue and will release the official 1.5.0 version today. If the problem re-occurs, please reopen this issue or open a new issue.