xcat2 / xcat-core

Code repo for xCAT core packages
Eclipse Public License 1.0
364 stars 171 forks source link

xdsh and xdcp use 127.0.1.1 instead of remote node #6426

Open MasterGroosha opened 5 years ago

MasterGroosha commented 5 years ago

Hello, I hope this is the right place to report an issue with xCAT v2.14.6

In my setup, I'm trying to execute a command from my management node on remote node as non-root user. On management node I'm executing xCAT commands with root. On target node (let's call it mynode) I have a user named test and ssh config on management node for root looks like this:

Host mynode
  HostName 10.1.1.1
  User test
  IdentityFile /root/keys/mynode.privkey

Issuing command xdsh mynode -l test "ping 10.1.2.2 -c 2" works fine and output is returned. After that, I put ping 10.1.1.1 -c 2 and cat /etc/hostname commands to file named "test_ssh.txt" and tried to execute it as: xdsh mynode -l test -e test_ssh.txt

This time I get a very weird error:

Error: [dev]: Error from pping                                                                                      
127.0.1.1: Permission denied, please try again.                                                                     

127.0.1.1: Permission denied, please try again.                                                                     

127.0.1.1: root@127.0.1.1: Permission denied (publickey,password).                                                  

127.0.1.1: Permission denied, please try again.                                                                     

127.0.1.1: Permission denied, please try again.                                                                     

127.0.1.1: root@127.0.1.1: Permission denied (publickey,password).                                                  

127.0.1.1: rsync: connection unexpectedly closed (0 bytes received so far) [sender]                                 
127.0.1.1: rsync error: unexplained error (code 255) at io.c(235) [sender=3.1.2]                                    
127.0.1.1: Permission denied, please try again.                                                                     

127.0.1.1: Permission denied, please try again.                                                                     

127.0.1.1: root@127.0.1.1: Permission denied (publickey,password).                                                  

127.0.1.1: rsync: connection unexpectedly closed (0 bytes received so far) [sender]
127.0.1.1: rsync error: unexplained error (code 255) at io.c(235) [sender=3.1.2]

The following servicenodes: 127.0.1.1, have errors and cannot be updated
 Until the error is fixed, xdsh -e  will not work to nodes serviced by these service nodes. Run xdsh <servicenode,...> -c ,  to clean up the xdcp servicenode directory, and run the command again.

Why 127.0.1.1 when I explicitly mentioned exact node?! I tried xdcp instead of xdsh. Same problem. Maybe the issue with rsync? Created a file "dummy.txt" and tried copying it: rsync -zvh dummy.txt test@mynode:/home/test/ And this command works like a charm, leaving me in belief that there is something wrong with xCAT.

cxhong commented 5 years ago

what's your node definition of mynode? do u have servicenode attribute defined ? or can u try to run nslookup mynode and nslookup <management node> ?

MasterGroosha commented 5 years ago

mynode was added as mkdef -t node mynode groups=all mgt=ipmi cons=ipmi ip=10.1.1.1 bmc=10.0.1.1 bmcusername=user bmcpassword=password installnic=mac primarynic=mac mac=00:00:00:00:00:00 (I replaced sensitive parts)

root@server:/# nslookup mynode
Server:         127.0.0.53
Address:        127.0.0.53#53

Non-authoritative answer:
Name:   mynode.xxx
Address: 10.1.1.1

do u have servicenode attribute defined?

I guess no.

cxhong commented 5 years ago

what's in the /etc/resolv.conf ? and run hostname on your xcat manager node and nslookup for your xcat manager node. maybe just run xcatprobe xcatmn -i <interface name >

MasterGroosha commented 5 years ago

/etc/resolv.conf:

nameserver 127.0.0.53
options edns0
search (2 private DNS servers in our intranet)

hostname and nslookup for management node returns all valid answers. How is it relevant? I can even rsync and SSH to remote server, however running xdsh with -e option returns error while executing command itself (not from file) runs fine.

As a workaround, I can use: cat test_ssh.txt | xargs xdsh mynode -l test

But this is not what I expect from xCAT.

cxhong commented 5 years ago

is your nslookup come back with 127.0.1.1 ip address?
I can't recreate the issue, here is the result when I ran xdsh command

[root@f6u13k13 ~]# xdsh  f6u13k14 "cat /etc/hostname"
f6u13k14: localhost.localdomain
[root@f6u13k13 ~]# xdsh f6u13k14 -l cxhong -e /tmp/test_ssh.txt
f6u13k14: PING 10.6.13.13 (10.6.13.13) 56(84) bytes of data.
f6u13k14: 64 bytes from 10.6.13.13: icmp_seq=1 ttl=64 time=0.177 ms
f6u13k14: 64 bytes from 10.6.13.13: icmp_seq=2 ttl=64 time=0.182 ms
f6u13k14:
f6u13k14: --- 10.6.13.13 ping statistics ---
f6u13k14: 2 packets transmitted, 2 received, 0% packet loss, time 9ms
f6u13k14: rtt min/avg/max/mdev = 0.177/0.179/0.182/0.013 ms
f6u13k14: localhost.localdomain
MasterGroosha commented 5 years ago

@cxhong could you please put that command to file and pass it as "-e" argument? Something like

echo "cat /etc/hostname" | commands.txt xdsh f6u13k14 -e commands.txt

Sorry, English is not my native language. Passing commands explicitly in double quotes works fine for me too. But passing a file with commands (-e argument) causes errors described above. Btw I also found some strange things with how xCAT parses node names, will write a comment later

cxhong commented 5 years ago

yes, i put same as yours

[root@f6u13k13 ~]# cat /tmp/test_ssh.txt
ping 10.6.13.13 -c 2
cat /etc/hostname
MasterGroosha commented 5 years ago

@cxhong That's strange. I mean, xdsh mynode -l test "ping 10.1.2.2 -c 2" works fine while xdsh mynode -l test -e test_ssh.txt doesn't.

But there's more interesting stuff. There's a DNS server in our local network. And my server mynode has DNS name mynode.xxx (I can just ping mynode and get reply from 10.1.1.1 (mynode.xxx)).

As I said previously, I added mynode to xCAT table as mkdef -t node mynode groups=all mgt=ipmi cons=ipmi ip=10.1.1.1 bmc=10.0.1.1 bmcusername=user bmcpassword=password installnic=mac primarynic=mac mac=00:00:00:00:00:00 Then I also added the same server with the other name: mkdef -t node m1 groups=all mgt=ipmi cons=ipmi ip=10.1.1.1 bmc=10.0.1.1 bmcusername=user bmcpassword=password installnic=mac primarynic=mac mac=00:00:00:00:00:00

This is when the strangest thing appears. both commands: xdsh mynode -l test -e test_ssh.txt and xdsh m1 -l test -e test_ssh.txt return the long error from my first post (even when mynode/m1 is offline!). However, when I try these commands: xdsh mynode -l test "ping 10.1.3.1 -c 2" and xdsh m1 -l test "ping 10.1.3.1 -c 2"

The first one (with mynode as nodename) works fine, while the second one (with m1 as nodename) returns error: [mainserver]: m1: ssh: Could not resolve hostname m1: Temporary failure in name resolution

And this is what bothers me. According to the documentation the word after "xdsh" must be nodename which was previously defined in xCAT internal table with mkdef command. And since mynode and m1 are basically the same (I used exactly the same definition for both entries), errors/successes should be the same as well! Now it turns out that in some situations nodename is expected to be, well, node name from xCAT table, but in other situations xCAT expects valid DNS entries (which in my case are valid too!).

I hope I managed to describe this as simple as possible.

cxhong commented 5 years ago

-e options for xdsh command has different code path, it will check if this is hierarchy support and get it's service node or management node, so the executable files can put to correct directory. It needs DNS to resolve hostname/ip. for the ping test, it just executes as the remote shell commands to the targets nodes. The ip address 127.0.1.1 must from m1 because it can't resolve DNS.

can u use rmdef m1 to remove m1 node definition , then run xdsh command again?

MasterGroosha commented 5 years ago

@cxhong I added m1 node after first encountered problems with xCAT. Anyway, I removed m1 (via rmdef) and nothing changed. I still have that long error regarding 127.0.1.1. However, yesterday I tried replacing mynode with mynode.xxx (DNS name) and now tried it again:

xdsh mynode.xxx -l test -e test_ssh.txt

Output: Error: Invalid nodes and/or groups in noderange: mynode.xxx

WTF? When I enter mynode as node name, it tries to resolve it as DNS name and throws an error. When I enter mynode.xxx as DNS name it throws an error because there's no such node name. Weird.

P.S. Just to clarify: dig mynode.xxx resolves to actual IP address, so my DNS server is working fine

cxhong commented 5 years ago

xCAT always use short name as node name, I don't know how did u replace mynode to mynode.xxx, I don't think xCAT will allow you to change it if you use xCAT commands. can u show me

tabdump site | grep master
tabdump site | grep nameservers
tabdump site | grep domain
cat /etc/resolv.conf
grep mynode /etc/hosts
xcatprobe xcatmn -i <interface name>

did u run makedns for you system?

MasterGroosha commented 5 years ago
root@mainserver:~# tabdump site | grep master
"master","127.0.1.1",,

root@mainserver:~# tabdump site | grep nameservers
"nameservers","127.0.1.1",,

root@mainserver:~# tabdump site | grep domain
(empty)

root@mainserver:~# cat /etc/resolv.conf
nameserver 127.0.0.53
options edns0
search xxx <-- my DNS server here, located on another machine

root@mainserver:~# grep mynode /etc/hosts
(empty, since I'm using DNS server)

root@mainserver:~# xcatprobe xcatmn -i enp193s0f1
[mn]: Checking all xCAT daemons are running...                                  [ OK ]
[mn]: Checking xcatd can receive command request...                             [ OK ]
[mn]: Checking 'site' table is configured...                                    [FAIL]
[mn]: There isn't 'domain' definition in 'site' table
=================================== SUMMARY ====================================
[MN]: Checking on MN...                                                         [FAIL]
    Checking 'site' table is configured...                                      [FAIL]
        There isn't 'domain' definition in 'site' table

did u run makedns for you system?

Can't say for sure now (it was 1.5 months ago when I installed xCAT), but executing it now results in error: Error: [mainserver]: domain not defined in site table

However, I don't see anything about "site" table neither in "Prepare management node" documentation nor in "Quick Start"

cxhong commented 5 years ago

Here is one of doc talked about site table , we will try to add to Prepare management node or Quick Start sections https://xcat-docs.readthedocs.io/en/stable/guides/admin-guides/manage_clusters/ppc64le/configure/site.html?highlight=site

the master attribute in the site table should be ip address of enp193s0f1 (don't think is 127.0.1.1, right? ) nameserver should be same as master, domain should be xxx which is same as search string in the /etc/resolv.conf, run makedns after you make those changes. and run xcatprobe command too.

MasterGroosha commented 5 years ago

@cxhong Thank you so much for this comment! I did all the steps you mentioned here and finally it works! Though it shows "fail" since that interface (enp193s0f1) has some more IP addresses which are not used for xCAT, executing xdsh with -e works like a charm now.

Thanks again! Should I close the issue or leave it open until documentation is edited?