xcat2 / xcat-core

Code repo for xCAT core packages
Eclipse Public License 1.0
356 stars 170 forks source link

[xCAT probe] Wrong disk space calculation in the sub-command xcatmn #1528

Closed neo954 closed 7 years ago

neo954 commented 8 years ago

While I do not have enough free disk space under /var, xcatprobe reports I have.

# df -h /var
Filesystem           Size  Used Avail Use% Mounted on
/dev/mapper/vg0-var  3.9G  3.1G  628M  84% /var

# xcatprobe xcatmn -n eth0
NIC eth0 exists on current server                                         [ OK ]
Get ip address of NIC eth0                                                [ OK ]
Sub process 'xcatd: SSL listener' is running                              [ OK ]
Sub process 'xcatd: DB Access' is running                                 [ OK ]
Sub process 'xcatd: UDP listener' is running                              [ OK ]
Sub process 'xcatd: install monitor' is running                           [ OK ]
Sub process 'xcatd: Discovery worker' is running                          [ OK ]
Sub process 'xcatd: Command log writer' is running                        [ OK ]
xcatd is listening on port 3001                                           [ OK ]
xcatd is listening on port 3002                                           [ OK ]
'lsxcatd -a' works                                                        [ OK ]
The value of 'master' in 'site' table is a IP address                     [ OK ]
The IP 10.2.3.27 of eth0 equals the value of 'master' in 'site' table     [ OK ]
IP 10.2.3.27 of NIC eth0 is a static IP on current server                 [ OK ]
10.2.3.27 belongs to one of networks defined in 'networks' table          [ OK ]
There is domain definition in 'site' table                                [ OK ]
There is configuration in 'passwd' table for 'system' for node provision  [ OK ]
There is /install directory on current server                             [ OK ]
There is /tftpboot directory on current server                            [ OK ]
The free space of / directory is more than 10 G                           [ OK ]
The free space of /var directory is more than 1 G                         [ OK ]
The free space of /tmp directory is more than 1 G                         [ OK ]
The free space of /install is less than 10 G                              [WARN]
SELinux is disabled on current server                                     [ OK ]
Firewall is closed on current server                                      [ OK ]
HTTP service is ready on 10.2.3.27                                        [ OK ]
TFTP service is ready on 10.2.3.27                                        [ OK ]
DNS server is ready on 10.2.3.27                                          [ OK ]
The size of /var/lib/dhcpd/dhcpd.leases is less than 100M                 [ OK ]
DHCP service is ready on 10.2.3.27                                        [ OK ]
neo954 commented 8 years ago
301  $expected  = 1;
302  $msg       = "The free space of /var directory is more than $expected G";
303  $diskspace = `df -h|awk '{print \$4,\$6}'|grep -E "/var\$"`;
304  if (!$?) {
305      chomp($diskspace);
306      my ($size, $dir) = split(" ", $diskspace);
307      $size =~ s/G//g;
308      probe_utils->send_msg("$output", "d", "The free space of /var is $size G") if ($verbose);
309      if ($size < $expected) {
310          probe_utils->send_msg("$output", "w", "The free space of /var is less than $expected G");
311      } else {
312          probe_utils->send_msg("$output", "o", "$msg");
313      }
314  }

The free space is calculated by run df -h, which is totally wrong. With -h command line argument of df, the unit will change base the number in order to make it "human read". It will show 628M as the example above, or 1.2T when there are large amount of disk space.

Actually, the h in the command df -h means "human readable". See man df. Since xcatprobe is a piece of machine code instead of human, use "human readable" display here is improper.

neo954 commented 8 years ago

And when the /var directory is not on a separate file system. This checking will not be performed at all. Which is pity :'(

neo954 commented 8 years ago

When there is anther file system mounted on a directory with the name ending /var, let us say /srv/chroot/ia32/var, this piece of code will display an improper result as well. Since it checks with grep -E "/var$". Which is not good.

hu-weihua commented 7 years ago

Fix this bug by pull request #1701, @neo954 , could you help to verify it? thanks

neo954 commented 7 years ago

Please read the last comment I wrote. That problem is still there.

neo954 commented 7 years ago

Okay, I will consider this bug was fixed properly, and close this one.

neo954 commented 7 years ago

Sorry. While I was reading the patch. There is one problem still unfixed.

Consider the directory /tmp or /install is not a separate file system, and is not a part of the root file system as well. Instead, it is a symbolic link point to another directory in a separate file system other than the root file system. In this case, the code logic is wrong. And the correct free space is not calculated correctly.

For example, /install is a symbolic link point to /media/cnfs/install. While /media/cnfs is a NFS mounted file system, which is a GPFS file system at the back end. This kind of configuration is quite common in large HPC cluster.

hu-weihua commented 7 years ago

@neo954 , I have fixed this issue by PR #2071, Could you verify it? thanks

neo954 commented 7 years ago

@hu-weihua,

Cool. I will verify this issue ASAP after new daily build is available.

neo954 commented 7 years ago

This issue was fixed and verified with 2.12.4 RC build. I will close this one.