xcp-ng / xcp-ng-xapi-plugins

XCP-ng's specific XAPI plugins
GNU Affero General Public License v3.0
7 stars 9 forks source link

Update smartctl.py #46

Open 1234Erwan opened 2 months ago

1234Erwan commented 2 months ago

Try to add hard raid support for smartctl.py

olivierlambert commented 2 months ago

Hey thanks a lot for your contribution! We'll review your PR ASAP :+1:

gthvn1 commented 2 months ago

I thought about a solution like this instead of managing an index.

--- a/SOURCES/etc/xapi.d/plugins/smartctl.py
+++ b/SOURCES/etc/xapi.d/plugins/smartctl.py
@@ -14,36 +14,31 @@ def _list_disks():
     disks = []
     result = run_command(['smartctl', '--scan'])
     for line in result['stdout'].splitlines():
-        disks.append(line.split()[2])
-        disks.append(line.split()[0])
+        disks.append({'name': line.split()[0], 'type': line.split()[2]})
     return disks

 @error_wrapped
 def get_information(session, args):
     results = {}
-    i = 0
     with OperationLocker():
         disks = _list_disks()
         for disk in disks:
-            cmd = run_command(["smartctl", "-j", "-a", "-d", disks[i], disks[i+1]], check=False)
-            results[disk] = json.loads(cmd['stdout'])
-            i = i + 2
+            cmd = run_command(["smartctl", "-j", "-a", "-d", disk['type'], disk['name']], check=False)
+            results[disk['name']] = json.loads(cmd['stdout'])
         return json.dumps(results)

 @error_wrapped
 def get_health(session, args):
     results = {}
-    i = 0
     with OperationLocker():
         disks = _list_disks()
         for disk in disks:
-            cmd = run_command(["smartctl", "-j", "-H", "-d", disks[i], disks[i+1]])
+            cmd = run_command(["smartctl", "-j", "-H", "-d", disk['type'], disk['name']])
             json_output = json.loads(cmd['stdout'])
             if json_output['smart_status']['passed']:
-                results[disk] = "PASSED"
+                results[disk['name']] = "PASSED"
             else:
-                results[disk] = "FAILED"
-            i = i + 2
+                results[disk['name']] = "FAILED"
         return json.dumps(results)
stormi commented 2 weeks ago

How do we move on on this PR?

gthvn1 commented 2 weeks ago

@

How do we move on on this PR?

I think that managing index is error prone and IMHO using the solution I posted looks better but I don't have a strong opinion so if it is ok for you we can merge it and fix it later. @1234Erwan what do you think?

gthvn1 commented 1 week ago

Also I just tested the patch on a physical machine that has a megaraid and it fails:

', stderr: '', Traceback (most recent call last):
  File "/etc/xapi.d/plugins/xcpngutils/__init__.py", line 127, in wrapper
    return func(*args, **kwds)
  File "/etc/xapi.d/plugins/smartctl.py", line 40, in get_health
    cmd = run_command(["smartctl", "-j", "-H", "-d", disks[i], disks[i+1]])
  File "/etc/xapi.d/plugins/xcpngutils/__init__.py", line 79, in run_command
    raise ProcessException(code, command, stdout, stderr)
ProcessException: Command '['smartctl', '-j', '-H', '-d', 'megaraid,0', '/dev/bus/0']' failed with code: 4

In fact the issue is because the command smartctl -j -H -d megaraid,0 /dev/bus/0 returns an output:

[10:01 r620-x1 ~]# smartctl -j -H -d megaraid,0 /dev/bus/0
{
  "json_format_version": [
    1,
    0
  ],
  "smartctl": {
    "version": [
      7,
      0
    ],
    "svn_revision": "4883",
    "platform_info": "x86_64-linux-4.19.0+1",
    "build_info": "(local build)",
    "argv": [
      "smartctl",
      "-j",
      "-H",
      "-d",
      "megaraid,0",
      "/dev/bus/0"
    ],
    "messages": [
      {
        "string": "Warning: This result is based on an Attribute check.",
        "severity": "warning"
      }
    ],
    "exit_status": 4
  },
  "device": {
    "name": "/dev/bus/0",
    "info_name": "/dev/bus/0 [megaraid_disk_00] [SAT]",
    "type": "sat+megaraid,0",
    "protocol": "ATA"
  },
  "smart_status": {
    "passed": true
  }
}

But the return code is not 0... it is 4. So we need to understand if the error is "normal" and if yes fix it to be able to use megaraid. Note: this error happens only with health function, the information function works well. It is probably why you didn't notice it @1234Erwan . @1234Erwan I will take some time and I will have a look to how to fix it.

gthvn1 commented 1 week ago

In fact the exit code status means : SMART status check returned "DISK FAILING". So I think that we need to add a better check of exit status because return values of smartctl are defined by a bitmask so we can exctract usefull information for the user.

1234Erwan commented 1 week ago

my python skills are quite limited, what is sure is that even if my script is not clean at all it works on my hard raid

gthvn1 commented 1 week ago

my python skills are quite limited, what is sure is that even if my script is not clean at all it works on my hard raid

When you say it works you mean also the xe host-call-plugin host-uuid=<uuid> plugin=smartctl.py fn=health right? No worries for the skills it already help us :+1: I will try to understand why I have the issue and we will update your PR to pass the checks. Thanks for the answer.

1234Erwan commented 1 week ago

my python skills are quite limited, what is sure is that even if my script is not clean at all it works on my hard raid

When you say it works you mean also the xe host-call-plugin host-uuid=<uuid> plugin=smartctl.py fn=health right? No worries for the skills it already help us ๐Ÿ‘ I will try to understand why I have the issue and we will update your PR to pass the checks. Thanks for the answer.

like 2 month ago i have replaced the original plugin in my xcp-ng with my last experiment and it's working on my hard RAID. Capture d'รฉcran 2024-10-25 112215

1234Erwan commented 1 week ago

@gthvn1 Thanks for all your suggestion and for your help, i'v just implemented your idea. It works for me I'm waiting to know if it works for you too.

1234Erwan commented 1 week ago

It's done, you've finnaly wrote the entierly of the code yourself ๐Ÿ˜….

gthvn1 commented 1 week ago

Your code helps us to think about the issue and even if it changed a little bit it helped us a lot :+1: Also you confirm that your last commit is working well on your host right?

gthvn1 commented 1 week ago

And for information I have created a branch gtn-pr-46 that has your current commit. The reason is because as it is a local branch it triggers more check (like pytest and pycodestyle) than yours. So don't worry I will use this new branch and fix the tests but your commits will be in. I will probably squash them into one commit but it will be there and your contribution will remain :)

1234Erwan commented 1 week ago

Fine, do as you please. I don't care much for my meager contribution. I much prefer to know that smartctl will work for everyone.

1234Erwan commented 1 week ago

Your code helps us to think about the issue and even if it changed a little bit it helped us a lot ๐Ÿ‘ Also you confirm that your last commit is working well on your host right?

No it didn't working, as you've said earlier the exit status is not 0.