zhmcclient / zhmc-prometheus-exporter

A Prometheus exporter for the IBM Z HMC
Apache License 2.0
14 stars 9 forks source link

Fixed handling of HMC connection restart #337

Closed andy-maier closed 1 year ago

andy-maier commented 1 year ago

For details, see the commit message. Merging without review - see the test results below.

coveralls commented 1 year ago

Coverage Status

Coverage: 40.574% (-0.3%) from 40.829% when pulling 2bc009bbd953bba6b10c286fbb8778ca11e39d32 on andy/fix-hmc-restart into f2fdff318f43ee7fe418dd86cc9b4a2a5c6aad85 on master.

andy-maier commented 1 year ago

I am testing this version of the PR together with zhmcclient 1.8.1.

andy-maier commented 1 year ago

Test case 1:

Result: Exporter handles it successfully

Details:

Exporter up and running

Metrics retrieval triggered in browser:
2023-05-26 15:21:20+0000 DEBUG zhmcclient.hmc: Request: GET https://9.114.87.7:6794/api/services/metrics/context/1, headers: {'User-Agent': 'python-zhmcclient/1.8.1', 'Content-type': 'application/json', 'Accept': '*/*', 'X-API-Session': '********'}, content: None
2023-05-26 15:21:20+0000 DEBUG zhmcclient.hmc: Respons: GET https://9.114.87.7:6794/api/services/metrics/context/1, status: 200, headers: {'Strict-Transport-Security': 'max-age=0', 'X-Frame-Options': 'SAMEORIGIN', 'X-Content-Type-Options': 'nosniff', 'X-XSS-Protection': '1; mode=block', 'Content-Security-Policy': "img-src * data:; frame-ancestors 'self'", 'x-request-id': 'Sx44a31d0c-fba4-11ed-b767-00106f234c71.e Rx759', 'Cache-control': 'no-cache', 'Content-Type': 'application/vnd.ibm-z-zmanager-metrics;charset=UTF-8', 'Transfer-Encoding': 'chunked', 'Date': 'Fri, 26 May 2023 15:24:18 GMT', 'Keep-Alive': 'timeout=1', 'Connection': 'keep-alive', 'Server': 'Hardware management console API web server / 2.0'}, content(19391 B):  .....

Metrics shown in browser.

HMC restart ...

HMC is up again ...

Metrics retrieval triggered through browser, resulting in these activities:

Exporter issues "Get Metrics Context", which fails due to invalid session ID:
2023-05-26 15:30:39+0000 DEBUG zhmcclient.hmc: Request: GET https://9.114.87.7:6794/api/services/metrics/context/1, headers: {'User-Agent': 'python-zhmcclient/1.8.1', 'Content-type': 'application/json', 'Accept': '*/*', 'X-API-Session': '********'}, content: None
2023-05-26 15:30:40+0000 DEBUG zhmcclient.hmc: Respons: GET https://9.114.87.7:6794/api/services/metrics/context/1, status: 403, headers: {'Strict-Transport-Security': 'max-age=0', 'X-Frame-Options': 'SAMEORIGIN', 'X-Content-Type-Options': 'nosniff', 'X-XSS-Protection': '1; mode=block', 'Content-Security-Policy': "img-src * data:; frame-ancestors 'self'", 'Cache-control': 'no-cache', 'x-api-reason-code': '5', 'Content-Type': 'application/json', 'Content-Length': '438', 'Date': 'Fri, 26 May 2023 15:33:38 GMT', 'Keep-Alive': 'timeout=1', 'Connection': 'keep-alive', 'Server': 'Hardware management console API web server / 2.0'}, content: '{"http-status": 403, "reason": 5, "request-uri": "/api/services/metrics/context/1", "request-method": "GET", "message": "Specified x-api-session header did not map to a known session.", "request-headers": {"host": "9.114.87.7:6794", "connection": "keep-alive", "content-type": "application/json", "x-api-session": "********", "accept-encoding": "gzip, deflate", "user-agent": "python-zhmcclient/1.8.1", "accept": "*/*"}}'

zhmcclient performs automatic session renewal:
2023-05-26 15:30:40+0000 DEBUG zhmcclient.hmc: Request: POST https://9.114.87.7:6794/api/sessions, headers: {'User-Agent': 'python-zhmcclient/1.8.1', 'Content-type': 'application/json', 'Accept': '*/*'}, content(63 B): '{"userid": "maiera@de.ibm.com", "password": "********"}'
2023-05-26 15:30:41+0000 DEBUG zhmcclient.hmc: Respons: POST https://9.114.87.7:6794/api/sessions, status: 200, headers: {'Strict-Transport-Security': 'max-age=0', 'X-Frame-Options': 'SAMEORIGIN', 'X-Content-Type-Options': 'nosniff', 'X-XSS-Protection': '1; mode=block', 'Content-Security-Policy': "img-src * data:; frame-ancestors 'self'", 'x-request-id': 'Sx34ce251c-fbda-11ed-88ba-00106f234c71.2 Rx0', 'Cache-control': 'no-cache', 'x-wsa-provider-duration-ms': '376', 'X-API-Session': '********', 'Content-Type': 'application/json;charset=UTF-8', 'Content-Length': '524', 'Date': 'Fri, 26 May 2023 15:33:38 GMT', 'Keep-Alive': 'timeout=1', 'Connection': 'keep-alive', 'Server': 'Hardware management console API web server / 2.0'}, content(463 B): '{"notification-topic": "maiera@de.ibm.com.1", "api-major-version": 4, "api-features": ["internal-echo", "internal-trace-ffdc", "internal-wsapi-log", "internal-wsapi-stats", "internal-get-files-from-se", "internal-console-sud", "internal-se-sud", "internal-remote-execution", "internal-supportchecker"], "api-session": "********", "session-credential": "********", "password-expires": -1, "job-notification-topic": "maiera@de.ibm.com.1job", "api-minor-version": 2}'

zhmcclient repeats the last operation (Get Metrics Context). That fails because it uses the old metric context from before HMC restart, which the HMC has reset during the restart:
2023-05-26 15:30:41+0000 DEBUG zhmcclient.hmc: Request: GET https://9.114.87.7:6794/api/services/metrics/context/1, headers: {'User-Agent': 'python-zhmcclient/1.8.1', 'Content-type': 'application/json', 'Accept': '*/*', 'X-API-Session': '********'}, content: None
2023-05-26 15:30:41+0000 DEBUG zhmcclient.hmc: Respons: GET https://9.114.87.7:6794/api/services/metrics/context/1, status: 404, headers: {'Strict-Transport-Security': 'max-age=0', 'X-Frame-Options': 'SAMEORIGIN', 'X-Content-Type-Options': 'nosniff', 'X-XSS-Protection': '1; mode=block', 'Content-Security-Policy': "img-src * data:; frame-ancestors 'self'", 'x-request-id': 'Sx34ce251c-fbda-11ed-88ba-00106f234c71.3 Rx0', 'Cache-control': 'no-cache', 'x-wsa-provider-duration-ms': '41', 'x-api-reason-code': '1', 'Content-Type': 'application/json;charset=UTF-8', 'Content-Length': '472', 'Date': 'Fri, 26 May 2023 15:33:39 GMT', 'Keep-Alive': 'timeout=1', 'Connection': 'keep-alive', 'Server': 'Hardware management console API web server / 2.0'}, content: '{"http-status": 404, "reason": 1, "request-uri": "/api/services/metrics/context/1", "request-method": "GET", "message": "Context specified is not found.", "request-headers": {"host": "9.114.87.7:6794", "connection": "keep-alive", "content-type": "application/json", "x-api-session": "********", "accept-encoding": "gzip, deflate", "user-agent": "python-zhmcclient/1.8.1", "accept": "*/*"}, "request-body": {}, "request-authenticated-as": "maiera@de.ibm.com"}'

Exporter creates a new metrics context to handle that:
2023-05-26 15:30:41+0000 DEBUG zhmcclient.hmc: Request: POST https://9.114.87.7:6794/api/services/metrics/context, headers: {'User-Agent': 'python-zhmcclient/1.8.1', 'Content-type': 'application/json', 'Accept': '*/*', 'X-API-Session': '********'}, content(399 B): '{"anticipated-frequency-seconds": 15, "metric-groups": ["cpc-usage-overview", "logical-partition-usage", "channel-usage", "crypto-usage", "flash-memory-usage", "roce-usage", "dpm-system-usage-overview", "partition-usage", "adapter-usage", "network-physical-adapter-port", "partition-attached-network-interface", "zcpc-environmentals-and-power", "zcpc-processor-usage", "environmental-power-status"]}'
2023-05-26 15:30:41+0000 DEBUG zhmcclient.hmc: Respons: POST https://9.114.87.7:6794/api/services/metrics/context, status: 201, headers: {'Strict-Transport-Security': 'max-age=0', 'X-Frame-Options': 'SAMEORIGIN', 'X-Content-Type-Options': 'nosniff', 'X-XSS-Protection': '1; mode=block', 'Content-Security-Policy': "img-src * data:; frame-ancestors 'self'", 'x-request-id': 'Sx34ce251c-fbda-11ed-88ba-00106f234c71.3 Rx1', 'Cache-control': 'no-cache', 'location': '/api/services/metrics/context/1', 'Content-Type': 'application/json;charset=UTF-8', 'Transfer-Encoding': 'chunked', 'Date': 'Fri, 26 May 2023 15:33:39 GMT', 'Keep-Alive': 'timeout=1', 'Connection': 'keep-alive', 'Server': 'Hardware management console API web server / 2.0'}, content(11061 B): .......

Exporter sets up the auto-updated resources (many operations, not shown)

Exporter finally performs the requested "Get Metrics Context" again, this time successful:
2023-05-26 15:31:48+0000 DEBUG zhmcclient.hmc: Request: GET https://9.114.87.7:6794/api/services/metrics/context/1, headers: {'User-Agent': 'python-zhmcclient/1.8.1', 'Content-type': 'application/json', 'Accept': '*/*', 'X-API-Session': '********'}, content: None
2023-05-26 15:31:48+0000 DEBUG zhmcclient.hmc: Respons: GET https://9.114.87.7:6794/api/services/metrics/context/1, status: 200, headers: {'Strict-Transport-Security': 'max-age=0', 'X-Frame-Options': 'SAMEORIGIN', 'X-Content-Type-Options': 'nosniff', 'X-XSS-Protection': '1; mode=block', 'Content-Security-Policy': "img-src * data:; frame-ancestors 'self'", 'x-request-id': 'Sx34ce251c-fbda-11ed-88ba-00106f234c71.3 Rx3e', 'Cache-control': 'no-cache', 'Content-Type': 'application/vnd.ibm-z-zmanager-metrics;charset=UTF-8', 'Transfer-Encoding': 'chunked', 'Date': 'Fri, 26 May 2023 15:34:46 GMT', 'Keep-Alive': 'timeout=1', 'Connection': 'keep-alive', 'Server': 'Hardware management console API web server / 2.0'}, content(19400 B): .......

Metrics shown in browser.
andy-maier commented 1 year ago

Test case 2:

Result: Success.

Details:

HMC restart triggered.

Metrics retrieval through browser while HMC is not available. This causes all of the following to be performed:

Repeated "Get Metric Context" requests that time out with connection errors. The repetition interval is 40 seconds (10 sec timeout + 30 sec backoff time):
2023-05-26 15:54:35+0000 DEBUG zhmcclient.hmc: Request: GET https://9.114.87.7:6794/api/services/metrics/context/1, headers: {'User-Agent': 'python-zhmcclient/1.8.1', 'Content-type': 'application/json', 'Accept': '*/*', 'X-API-Session': '********'}, content: None
2023-05-26 15:55:15+0000 DEBUG zhmcclient.hmc: Request: GET https://9.114.87.7:6794/api/services/metrics/context/1, headers: {'User-Agent': 'python-zhmcclient/1.8.1', 'Content-type': 'application/json', 'Accept': '*/*', 'X-API-Session': '********'}, content: None
2023-05-26 15:55:55+0000 DEBUG zhmcclient.hmc: Request: GET https://9.114.87.7:6794/api/services/metrics/context/1, headers: {'User-Agent': 'python-zhmcclient/1.8.1', 'Content-type': 'application/json', 'Accept': '*/*', 'X-API-Session': '********'}, content: None

........ (one attempt every 40 seconds)

Finally, the HMC is available again and the "Get Metric Context" request gets a response, which fails with invalid session:
2023-05-26 16:01:55+0000 DEBUG zhmcclient.hmc: Request: GET https://9.114.87.7:6794/api/services/metrics/context/1, headers: {'User-Agent': 'python-zhmcclient/1.8.1', 'Content-type': 'application/json', 'Accept': '*/*', 'X-API-Session': '********'}, content: None
2023-05-26 16:02:06+0000 DEBUG zhmcclient.hmc: Respons: GET https://9.114.87.7:6794/api/services/metrics/context/1, status: 403, headers: {'Strict-Transport-Security': 'max-age=0', 'X-Frame-Options': 'SAMEORIGIN', 'X-Content-Type-Options': 'nosniff', 'X-XSS-Protection': '1; mode=block', 'Content-Security-Policy': "img-src * data:; frame-ancestors 'self'", 'Cache-control': 'no-cache', 'x-api-reason-code': '5', 'Content-Type': 'application/json', 'Content-Length': '438', 'Date': 'Fri, 26 May 2023 16:05:05 GMT', 'Keep-Alive': 'timeout=1', 'Connection': 'keep-alive', 'Server': 'Hardware management console API web server / 2.0'}, content: '{"http-status": 403, "reason": 5, "request-uri": "/api/services/metrics/context/1", "request-method": "GET", "message": "Specified x-api-session header did not map to a known session.", "request-headers": {"host": "9.114.87.7:6794", "connection": "keep-alive", "content-type": "application/json", "x-api-session": "********", "accept-encoding": "gzip, deflate", "user-agent": "python-zhmcclient/1.8.1", "accept": "*/*"}}'

zhmcclient renews the session, successfully:
2023-05-26 16:02:06+0000 DEBUG zhmcclient.hmc: Request: POST https://9.114.87.7:6794/api/sessions, headers: {'User-Agent': 'python-zhmcclient/1.8.1', 'Content-type': 'application/json', 'Accept': '*/*'}, content(63 B): '{"userid": "maiera@de.ibm.com", "password": "********"}'
2023-05-26 16:02:14+0000 DEBUG zhmcclient.hmc: Respons: POST https://9.114.87.7:6794/api/sessions, status: 200, headers: {'Strict-Transport-Security': 'max-age=0', 'X-Frame-Options': 'SAMEORIGIN', 'X-Content-Type-Options': 'nosniff', 'X-XSS-Protection': '1; mode=block', 'Content-Security-Policy': "img-src * data:; frame-ancestors 'self'", 'x-request-id': 'Sxbf8d0fc0-fbde-11ed-bf93-00106f234c71.1 Rx0', 'Cache-control': 'no-cache', 'x-wsa-provider-duration-ms': '1447', 'X-API-Session': '********', 'Content-Type': 'application/json;charset=UTF-8', 'Content-Length': '524', 'Date': 'Fri, 26 May 2023 16:05:13 GMT', 'Keep-Alive': 'timeout=1', 'Connection': 'keep-alive', 'Server': 'Hardware management console API web server / 2.0'}, content(463 B): '{"notification-topic": "maiera@de.ibm.com.1", "api-major-version": 4, "api-features": ["internal-echo", "internal-trace-ffdc", "internal-wsapi-log", "internal-wsapi-stats", "internal-get-files-from-se", "internal-console-sud", "internal-se-sud", "internal-remote-execution", "internal-supportchecker"], "api-session": "********", "session-credential": "********", "password-expires": -1, "job-notification-topic": "maiera@de.ibm.com.1job", "api-minor-version": 2}'

zhmcclient retries the last operation ("Get Metric Context") which fails because the metric context has been reset by the HMC:
2023-05-26 16:02:14+0000 DEBUG zhmcclient.hmc: Request: GET https://9.114.87.7:6794/api/services/metrics/context/1, headers: {'User-Agent': 'python-zhmcclient/1.8.1', 'Content-type': 'application/json', 'Accept': '*/*', 'X-API-Session': '********'}, content: None
2023-05-26 16:02:15+0000 DEBUG zhmcclient.hmc: Respons: GET https://9.114.87.7:6794/api/services/metrics/context/1, status: 404, headers: {'Strict-Transport-Security': 'max-age=0', 'X-Frame-Options': 'SAMEORIGIN', 'X-Content-Type-Options': 'nosniff', 'X-XSS-Protection': '1; mode=block', 'Content-Security-Policy': "img-src * data:; frame-ancestors 'self'", 'x-request-id': 'Sxbf8d0fc0-fbde-11ed-bf93-00106f234c71.2 Rx0', 'Cache-control': 'no-cache', 'x-wsa-provider-duration-ms': '27', 'x-api-reason-code': '1', 'Content-Type': 'application/json;charset=UTF-8', 'Content-Length': '472', 'Date': 'Fri, 26 May 2023 16:05:13 GMT', 'Keep-Alive': 'timeout=1', 'Connection': 'keep-alive', 'Server': 'Hardware management console API web server / 2.0'}, content: '{"http-status": 404, "reason": 1, "request-uri": "/api/services/metrics/context/1", "request-method": "GET", "message": "Context specified is not found.", "request-headers": {"host": "9.114.87.7:6794", "connection": "keep-alive", "content-type": "application/json", "x-api-session": "********", "accept-encoding": "gzip, deflate", "user-agent": "python-zhmcclient/1.8.1", "accept": "*/*"}, "request-body": {}, "request-authenticated-as": "maiera@de.ibm.com"}'

Exporter handles that by creating the metrics context again:
2023-05-26 16:02:15+0000 DEBUG zhmcclient.hmc: Request: POST https://9.114.87.7:6794/api/services/metrics/context, headers: {'User-Agent': 'python-zhmcclient/1.8.1', 'Content-type': 'application/json', 'Accept': '*/*', 'X-API-Session': '********'}, content(399 B): '{"anticipated-frequency-seconds": 15, "metric-groups": ["cpc-usage-overview", "logical-partition-usage", "channel-usage", "crypto-usage", "flash-memory-usage", "roce-usage", "dpm-system-usage-overview", "partition-usage", "adapter-usage", "network-physical-adapter-port", "partition-attached-network-interface", "zcpc-environmentals-and-power", "zcpc-processor-usage", "environmental-power-status"]}'
2023-05-26 16:02:15+0000 DEBUG zhmcclient.hmc: Respons: POST https://9.114.87.7:6794/api/services/metrics/context, status: 201, headers: {'Strict-Transport-Security': 'max-age=0', 'X-Frame-Options': 'SAMEORIGIN', 'X-Content-Type-Options': 'nosniff', 'X-XSS-Protection': '1; mode=block', 'Content-Security-Policy': "img-src * data:; frame-ancestors 'self'", 'x-request-id': 'Sxbf8d0fc0-fbde-11ed-bf93-00106f234c71.2 Rx1', 'Cache-control': 'no-cache', 'location': '/api/services/metrics/context/1', 'Content-Type': 'application/json;charset=UTF-8', 'Transfer-Encoding': 'chunked', 'Date': 'Fri, 26 May 2023 16:05:13 GMT', 'Keep-Alive': 'timeout=1', 'Connection': 'keep-alive', 'Server': 'Hardware management console API web server / 2.0'}, content(11061 B): .......

Exporter sets up the auto-updated resources (many operations, not shown)

Exporter performs "Get Metric Context" using the new metric context, successfully:
2023-05-26 16:03:24+0000 DEBUG zhmcclient.hmc: Request: GET https://9.114.87.7:6794/api/services/metrics/context/1, headers: {'User-Agent': 'python-zhmcclient/1.8.1', 'Content-type': 'application/json', 'Accept': '*/*', 'X-API-Session': '********'}, content: None
2023-05-26 16:03:25+0000 DEBUG zhmcclient.hmc: Respons: GET https://9.114.87.7:6794/api/services/metrics/context/1, status: 200, headers: {'Strict-Transport-Security': 'max-age=0', 'X-Frame-Options': 'SAMEORIGIN', 'X-Content-Type-Options': 'nosniff', 'X-XSS-Protection': '1; mode=block', 'Content-Security-Policy': "img-src * data:; frame-ancestors 'self'", 'x-request-id': 'Sxbf8d0fc0-fbde-11ed-bf93-00106f234c71.2 Rx3e', 'Cache-control': 'no-cache', 'Content-Type': 'application/vnd.ibm-z-zmanager-metrics;charset=UTF-8', 'Transfer-Encoding': 'chunked', 'Date': 'Fri, 26 May 2023 16:06:23 GMT', 'Keep-Alive': 'timeout=1', 'Connection': 'keep-alive', 'Server': 'Hardware management console API web server / 2.0'}, content(19394 B): ......

Metrics shown in browser.
andy-maier commented 1 year ago

This is a good enough test to release a fix version. I'll merge this PR without review for the future 1.5.0 and roll it back into 1.4.2.