Open MasterGroosha opened 5 years ago
what's your node definition of mynode? do u have servicenode
attribute defined ?
or can u try to run nslookup mynode
and nslookup <management node>
?
mynode
was added as
mkdef -t node mynode groups=all mgt=ipmi cons=ipmi ip=10.1.1.1 bmc=10.0.1.1 bmcusername=user bmcpassword=password installnic=mac primarynic=mac mac=00:00:00:00:00:00
(I replaced sensitive parts)
root@server:/# nslookup mynode
Server: 127.0.0.53
Address: 127.0.0.53#53
Non-authoritative answer:
Name: mynode.xxx
Address: 10.1.1.1
do u have servicenode attribute defined?
I guess no.
what's in the /etc/resolv.conf
? and run hostname
on your xcat manager node and nslookup
for your xcat manager node.
maybe just run xcatprobe xcatmn -i <interface name >
/etc/resolv.conf:
nameserver 127.0.0.53
options edns0
search (2 private DNS servers in our intranet)
hostname
and nslookup
for management node returns all valid answers. How is it relevant? I can even rsync and SSH to remote server, however running xdsh with -e option returns error while executing command itself (not from file) runs fine.
As a workaround, I can use:
cat test_ssh.txt | xargs xdsh mynode -l test
But this is not what I expect from xCAT.
is your nslookup
come back with 127.0.1.1
ip address?
I can't recreate the issue, here is the result when I ran xdsh
command
[root@f6u13k13 ~]# xdsh f6u13k14 "cat /etc/hostname"
f6u13k14: localhost.localdomain
[root@f6u13k13 ~]# xdsh f6u13k14 -l cxhong -e /tmp/test_ssh.txt
f6u13k14: PING 10.6.13.13 (10.6.13.13) 56(84) bytes of data.
f6u13k14: 64 bytes from 10.6.13.13: icmp_seq=1 ttl=64 time=0.177 ms
f6u13k14: 64 bytes from 10.6.13.13: icmp_seq=2 ttl=64 time=0.182 ms
f6u13k14:
f6u13k14: --- 10.6.13.13 ping statistics ---
f6u13k14: 2 packets transmitted, 2 received, 0% packet loss, time 9ms
f6u13k14: rtt min/avg/max/mdev = 0.177/0.179/0.182/0.013 ms
f6u13k14: localhost.localdomain
@cxhong could you please put that command to file and pass it as "-e" argument? Something like
echo "cat /etc/hostname" | commands.txt xdsh f6u13k14 -e commands.txt
Sorry, English is not my native language. Passing commands explicitly in double quotes works fine for me too. But passing a file with commands (-e argument) causes errors described above. Btw I also found some strange things with how xCAT parses node names, will write a comment later
yes, i put same as yours
[root@f6u13k13 ~]# cat /tmp/test_ssh.txt
ping 10.6.13.13 -c 2
cat /etc/hostname
@cxhong That's strange. I mean, xdsh mynode -l test "ping 10.1.2.2 -c 2"
works fine while xdsh mynode -l test -e test_ssh.txt
doesn't.
But there's more interesting stuff. There's a DNS server in our local network. And my server mynode has DNS name mynode.xxx (I can just ping mynode
and get reply from 10.1.1.1 (mynode.xxx)).
As I said previously, I added mynode to xCAT table as
mkdef -t node mynode groups=all mgt=ipmi cons=ipmi ip=10.1.1.1 bmc=10.0.1.1 bmcusername=user bmcpassword=password installnic=mac primarynic=mac mac=00:00:00:00:00:00
Then I also added the same server with the other name:
mkdef -t node m1 groups=all mgt=ipmi cons=ipmi ip=10.1.1.1 bmc=10.0.1.1 bmcusername=user bmcpassword=password installnic=mac primarynic=mac mac=00:00:00:00:00:00
This is when the strangest thing appears.
both commands:
xdsh mynode -l test -e test_ssh.txt
and
xdsh m1 -l test -e test_ssh.txt
return the long error from my first post (even when mynode/m1 is offline!). However, when I try these commands:
xdsh mynode -l test "ping 10.1.3.1 -c 2"
and
xdsh m1 -l test "ping 10.1.3.1 -c 2"
The first one (with mynode as nodename) works fine, while the second one (with m1 as nodename) returns error: [mainserver]: m1: ssh: Could not resolve hostname m1: Temporary failure in name resolution
And this is what bothers me. According to the documentation the word after "xdsh" must be nodename which was previously defined in xCAT internal table with mkdef command. And since mynode and m1 are basically the same (I used exactly the same definition for both entries), errors/successes should be the same as well! Now it turns out that in some situations nodename is expected to be, well, node name from xCAT table, but in other situations xCAT expects valid DNS entries (which in my case are valid too!).
I hope I managed to describe this as simple as possible.
-e
options for xdsh
command has different code path, it will check if this is hierarchy support and get it's service node or management node, so the executable files can put to correct directory. It needs DNS to resolve hostname/ip. for the ping
test, it just executes as the remote shell commands to the targets nodes. The ip address 127.0.1.1
must from m1
because it can't resolve DNS.
can u use rmdef m1
to remove m1
node definition , then run xdsh command again?
@cxhong I added m1 node after first encountered problems with xCAT. Anyway, I removed m1 (via rmdef) and nothing changed. I still have that long error regarding 127.0.1.1. However, yesterday I tried replacing mynode with mynode.xxx (DNS name) and now tried it again:
xdsh mynode.xxx -l test -e test_ssh.txt
Output: Error: Invalid nodes and/or groups in noderange: mynode.xxx
WTF? When I enter mynode as node name, it tries to resolve it as DNS name and throws an error. When I enter mynode.xxx as DNS name it throws an error because there's no such node name. Weird.
P.S. Just to clarify: dig mynode.xxx
resolves to actual IP address, so my DNS server is working fine
xCAT always use short name as node name, I don't know how did u replace mynode to mynode.xxx, I don't think xCAT will allow you to change it if you use xCAT commands. can u show me
tabdump site | grep master
tabdump site | grep nameservers
tabdump site | grep domain
cat /etc/resolv.conf
grep mynode /etc/hosts
xcatprobe xcatmn -i <interface name>
did u run makedns
for you system?
root@mainserver:~# tabdump site | grep master
"master","127.0.1.1",,
root@mainserver:~# tabdump site | grep nameservers
"nameservers","127.0.1.1",,
root@mainserver:~# tabdump site | grep domain
(empty)
root@mainserver:~# cat /etc/resolv.conf
nameserver 127.0.0.53
options edns0
search xxx <-- my DNS server here, located on another machine
root@mainserver:~# grep mynode /etc/hosts
(empty, since I'm using DNS server)
root@mainserver:~# xcatprobe xcatmn -i enp193s0f1
[mn]: Checking all xCAT daemons are running... [ OK ]
[mn]: Checking xcatd can receive command request... [ OK ]
[mn]: Checking 'site' table is configured... [FAIL]
[mn]: There isn't 'domain' definition in 'site' table
=================================== SUMMARY ====================================
[MN]: Checking on MN... [FAIL]
Checking 'site' table is configured... [FAIL]
There isn't 'domain' definition in 'site' table
did u run makedns for you system?
Can't say for sure now (it was 1.5 months ago when I installed xCAT), but executing it now results in error: Error: [mainserver]: domain not defined in site table
However, I don't see anything about "site" table neither in "Prepare management node" documentation nor in "Quick Start"
Here is one of doc talked about site
table , we will try to add to Prepare management node
or Quick Start
sections
https://xcat-docs.readthedocs.io/en/stable/guides/admin-guides/manage_clusters/ppc64le/configure/site.html?highlight=site
the master
attribute in the site
table should be ip address of enp193s0f1
(don't think is 127.0.1.1, right? ) nameserver
should be same as master
, domain
should be xxx
which is same as search string in the /etc/resolv.conf
, run makedns
after you make those changes. and run xcatprobe
command too.
@cxhong Thank you so much for this comment! I did all the steps you mentioned here and finally it works! Though it shows "fail" since that interface (enp193s0f1) has some more IP addresses which are not used for xCAT, executing xdsh
with -e
works like a charm now.
Thanks again! Should I close the issue or leave it open until documentation is edited?
Hello, I hope this is the right place to report an issue with xCAT v2.14.6
In my setup, I'm trying to execute a command from my management node on remote node as non-root user. On management node I'm executing xCAT commands with root. On target node (let's call it mynode) I have a user named test and ssh config on management node for root looks like this:
Issuing command
xdsh mynode -l test "ping 10.1.2.2 -c 2"
works fine and output is returned. After that, I putping 10.1.1.1 -c 2
andcat /etc/hostname
commands to file named "test_ssh.txt" and tried to execute it as:xdsh mynode -l test -e test_ssh.txt
This time I get a very weird error:
Why 127.0.1.1 when I explicitly mentioned exact node?! I tried xdcp instead of xdsh. Same problem. Maybe the issue with rsync? Created a file "dummy.txt" and tried copying it:
rsync -zvh dummy.txt test@mynode:/home/test/
And this command works like a charm, leaving me in belief that there is something wrong with xCAT.