zhmcclient / zhmc-prometheus-exporter

A Prometheus exporter for the IBM Z HMC
Apache License 2.0
17 stars 10 forks source link

Errors when starting to perform the first collection #539

Closed fulwang closed 4 months ago

fulwang commented 4 months ago

Describe the bug can't not collect any data after start the container

Expected behavior can collect data and queried in the browser

To Reproduce

start the container with command as below:

podman run -itd -v /opt/zhmcexporter:/root/myconfig -p 9291:9291 --name zhmcexporter zhmcexporter:latest -c /root/myconfig/hmccreds.yaml -v

Environment information zhmc_prometheus_exporter version: 1.7.0.dev1 zhmcclient version: 1.17.0 Verbosity level: 1

Command output

Log file zhmcexporter.log

fulwang commented 4 months ago

I checkout version 1.5.2 and build another container to have a try, but still can't get the data collected as before. attached is the console log of running the new container. zhmcexporter-1.5.2.log

andy-maier commented 4 months ago

@fulwang If you use version 1.5.2 of the exporter, you also need to use the metric definition file for that version. The warning in your 1.5.2 log:

/usr/local/lib/python3.9/site-packages/zhmc_prometheus_exporter/zhmc_prometheus_exporter.py:540: UserWarning: Ignoring item because its condition "'storage-group-uris' in resource_obj.properties" does not properly evaluate: NameError: name 'resource_obj' is not defined
  warnings.warn("Ignoring item because its condition {!r} does not "

Is caused by using a metric definition file that uses the resource object in its conditions, with an exporter version that does not yet have that support.

On your original error with 1.7.0.dev1:

There are two main errors there:

HTTPError: 503,3: Too many concurrent threads per user [GET /api/cpcs/348762ef-90df-36c2-ae18-8dd2abf730b4]

I have never seen this before and have started a dialogue with the Z development team on that.

HTTPError: 400,14: 'absolute-ifl-capping'' is not a valid value for the corresponding query parm [GET /api/logical-partitions/0a92d550-d75c-35d2-bc51-2bd88fc01b3f]

That is an error in the exporter code, but to find that it would be very helpful to get an exporter log file.

-> Could you please run this version of the exporter again and add the following options to its command line: --log-comp all=debug --log exporter.log ?

fulwang commented 4 months ago

@andy-maier Thanks for the analysis? Could you tell where i can get the metric definition file for version 1.5.2 and how to replace it before i rebuild the container image?

fulwang commented 4 months ago

@andy-maier For rerun v1.7.0, do i need to rebuild the container to add the log options you mentioned or just add it to the podman command line is enough?

fulwang commented 4 months ago

I just scheduled a run by adding the options on command line.

[root@lpar27 ~]# podman run -itd -v /opt/zhmcexporter:/root/myconfig -p 9291:9291 --name zhmcexporter zhmcexporter:v1.7.0 -c /root/myconfig/hmccreds.yaml -v --log-comp all=debug --log exporter.log 90f58391668f89d1ded5c3d4ebbdb23bb0ffde7bdf8db7f57fb6b0294c55e334 [root@lpar27 ~]# [root@lpar27 ~]# [root@lpar27 ~]# podman ps -a CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES ccdb73d06b25 localhost/grafana:v10.3.4 4 days ago Up 4 days 0.0.0.0:3000->3000/tcp grafana 82ed76756548 localhost/prometheus:v2.53.0 --config.file=/et... 4 days ago Up 4 days 0.0.0.0:9090->9090/tcp prometheus a219e9f8fcf6 localhost/nginx:v1.23.3 nginx -g daemon o... 4 days ago Up 4 days 0.0.0.0:1443->1443/tcp rproxy 8ee926934022 localhost/zhmcexporter:v1.5.2 -c /root/myconfig... 4 days ago Up 4 days 0.0.0.0:9292->9291/tcp zhmcexporter_1 c1087def4b23 localhost/s390x/mariadb:10.5.13 mysqld 4 hours ago Up 4 hours 0.0.0.0:3306->3306/tcp ecs_db 05ead85a9219 localhost/nginx:1.17.9 nginx -g daemon o... 4 hours ago Up 4 hours 0.0.0.0:443->443/tcp ecs_nginx 159b9ef35fd6 localhost/ecs_api:test python app.py 4 hours ago Up 4 hours 0.0.0.0:18443->18443/tcp ecs_api 90f58391668f localhost/zhmcexporter:v1.7.0 -c /root/myconfig... 8 seconds ago Up 8 seconds 0.0.0.0:9291->9291/tcp zhmcexporter [root@lpar27 ~]#

andy-maier commented 4 months ago

podman passes the command line after the container name through to the invoked container, so your podman command line looks good to me.

andy-maier commented 4 months ago

The metric definition file for a specific exporter version can be downloaded from the repo, when selecting the tag for that version. For example, for version 1.5.2, this is the repo at that version: https://github.com/zhmcclient/zhmc-prometheus-exporter/tree/1.5.2, and the sample metric file for that version is https://github.com/zhmcclient/zhmc-prometheus-exporter/blob/1.5.2/examples/metrics.yaml

I don't know how you build your container image, and whether you have the metric definition file in the image (vs. mounting its directory). If you have it in the image (which I think is the case given your podman command line), then you need to rebuild your image, and then you probably already have a COPY directive in the Dockerfile that pulls it in from the local directory.

fulwang commented 4 months ago

zhmcexporter_1.log @andy-maier I choose to have the metric definition file in the image, so i save the metrics.yaml of version 1.5.2 to "/root" directory and rebuild the image as below and run it on the testing environment again, but no luck for now.


cd /root git clone https://github.com/zhmcclient/zhmc-prometheus-exporter cd zhmc-prometheus-exporter/ git checkout 1.5.2 rm -fr examples/metrics.yaml cp /root/metrics.yaml examples/ make docker

fulwang commented 4 months ago

@andy-maier Can this be something wrong with the HMC side? The physical server was shutdown for several days due to malfunctions of the cooling system and was powered on in last week. I can saw many of errors include "HTTPError: 409,272: Unable to obtain STP configuration data, rc=[0x1000] [GET /api/cpcs/348762ef-90df-36c2-ae18-8dd2abf730b4]".

I have checked the user for hmc access and the option of "Web Services API " was checked as before.

andy-maier commented 4 months ago

@fulwang The errors "Unable to obtain STP configuration data" are not severe, they only cause the "cpc" label not to be added to metrics for some types of resources.

Having said that, I suggest to configure STP on that HMC so that this error goes away.

Let's walk through the errors in the zhmcexporter_1.log file you attached above:

andy-maier commented 4 months ago

@fulwang On your Docker build:

If you use the "make build" command, then it uses the Dockerfile in the repo. That Dockerfile gets the metrics.yaml file from examples/metrics.yaml.

Your commands shown above first check out version 1.5.2, and then replace the examples/metrics.yaml file with /root/metrics.yaml. That step is not necessary, because when you check out version 1.5.2, the examples/metrics.yaml file already has the correct version for 1.5.2. Depending on the version of /root/metrics.yaml, that might have introduced the version mismatch.

So your commands should be (after removing /root/zhmc-prometheus-exporter):

cd /root
git clone https://github.com/zhmcclient/zhmc-prometheus-exporter
cd zhmc-prometheus-exporter/
git checkout 1.5.2
make docker
andy-maier commented 4 months ago

@fulwang The messages have been improved in commit 511f7fe1dd and in PR #559 (merged).

You may want to try out the latest version from the master branch (including its matching metrics.yaml file) to see if there are any issues remaining. I'll keep this issue open for a while.

fulwang commented 4 months ago

@andy-maier I built with the latest code and run on the testing env a moment ago, here is the log for your review. zhmcexporter_new.log

fulwang commented 4 months ago

@fulwang On your Docker build:

If you use the "make build" command, then it uses the Dockerfile in the repo. That Dockerfile gets the metrics.yaml file from examples/metrics.yaml.

Your commands shown above first check out version 1.5.2, and then replace the examples/metrics.yaml file with /root/metrics.yaml. That step is not necessary, because when you check out version 1.5.2, the examples/metrics.yaml file already has the correct version for 1.5.2. Depending on the version of /root/metrics.yaml, that might have introduced the version mismatch.

So your commands should be (after removing /root/zhmc-prometheus-exporter):

cd /root
git clone https://github.com/zhmcclient/zhmc-prometheus-exporter
cd zhmc-prometheus-exporter/
git checkout 1.5.2
make docker

@andy-maier I realized this later and built the container image using the source code (tar.gz download from your repo) yesterday.

fulwang commented 4 months ago

@andy-maier How we can customize the metrics.yaml to exclude the data collection from CPC BZ17? We just need to ignore it.


Enabling auto-update for CPC BZ17 Ignoring resource-based metrics for CPC BZ17, because enabling auto-update for it failed with ConnectionError: HTTPSConnectionPool(host='172.16.27.231', port=6794): Max retries exceeded with url: /api/cpcs/348762ef-90df-36c2-ae18-8dd2abf730b4 (Caused by ReadTimeoutError("HTTPSConnectionPool(host='172.16.27.231', port=6794): Read timed out. (read timeout=300)")), reason: HTTPSConnectionPool(host='172.16.27.231', port=6794): Read timed out. (read timeout=300) Enabling auto-update for CPC BZ12 Enabling auto-update for CPC BZ09 Enabling auto-update for CPC BZ15 Enabling auto-update for CPC BZ16

andy-maier commented 4 months ago

@fulwang Excluding the metrics for specific CPCs is not possible at the moment. There is an issue #323 open for that, targeted for the upcoming 2.0 version.

andy-maier commented 4 months ago

I created issue #564 for the one traceback error in the new log file.

Update: PR #564 solved that issue and has been merged for the upcoming version 1.7.0.

andy-maier commented 4 months ago

I think we should release version 1.7.0 now - the remaining two issues (STP config, and too many threads) cannot be solved by the exporter.

To avoid the too many threads error, I suggest to disable the following metric groups in the metric definition file (set fetch: false):

If that causes the error to go away, you can gradually enable the metric groups again, starting from the top of the list.

andy-maier commented 4 months ago

@fulwang The "too many threads" error happens when the HMC user has more than 25 requests open at the WS-API that are being processed (i.e. request sent, but not yet complete). I think that also applies to asynchronous operations whose jobs are not yet complete.

The exporter can have a maximum of 2 concurrent HMC requests open (the main thread, and a background fetch thread, and they all wait for the operations to complete before starting the next one).

Are you using the HMC userid for other tasks that run at the same time?

Could you please post a log file (with --log-comp all=debug --log exporter.log) so I can see the interactions with the HMC?

fulwang commented 4 months ago

@andy-maier I have built a image with your latest code and it's now running on the testing env for debuging purpose. pls advise me when to feedback you the logs or any other information needed.

andy-maier commented 4 months ago

@fulwang So you currently do not experience the "too many threads" error anymore? If so, I don't need any additional logs, and will release version 1.7.0.

andy-maier commented 4 months ago

FuLong confirmed that the "too many threads" error did not show up anymore. I am closing this ticket now. Please open a new one if there are other issues.