Open millingw opened 7 months ago
Details of the VM we've experienced this on:
# openstack \
--os-cloud "${cloudname:?}" \
server list
...
c76e93d4-3709-427c-880d-a4d3a33e6935 | iris-gaia-blue-20240221-zeppelin | ACTIVE | iris-gaia-blue-20240221-internal-network=10.10.3.8, 128.232.226.23 | gaia-dmp-fedora-cloud-38-1.6 | gaia.vm.cclake.54vcpu |
Potentially related, but it seems that network traffic is failing between different projects in Openstack, using their floating IPs Info can be found here: https://github.com/wfau/gaia-dmp/issues/1304
Corresponding Cambridge HPC support ticket: https://ucam-rcs.atlassian.net/servicedesk/customer/portal/4/HPCSSUP-67058
Connection fails trying to ssh from a VM in one project on Arcus (iris-gaia-green) to a VM in another project on Arcus (iris-gaia-data) using the target VMs public IP address (128.232.222.153).
Source VM:
de5ddc6b4d1e445bb73e45c7b8971673
(iris-gaia-green)76e46802-d35e-4018-8dd7-c6ea302a74af
Target VM:
e216e6b502134b6185380be6ccd0bf09
(iris-gaia-data)6556a1f3-3182-4d97-8013-01de1c081c95
128.232.222.153
hostname
iris-gaia-green-20231027-zeppelin
host data.gaia-dmp.uk
data.gaia-dmp.uk is an alias for iris-gaia-data.duckdns.org.
iris-gaia-data.duckdns.org has address 128.232.222.153
ssh -v data.gaia-dmp.uk
OpenSSH_8.0p1, OpenSSL 1.1.1d FIPS 10 Sep 2019
....
debug1: Connecting to data.gaia-dmp.uk [128.232.222.153] port 22.
debug1: connect to address 128.232.222.153 port 22: Connection timed out
ssh: connect to host data.gaia-dmp.uk port 22: Connection timed out
Connection also fails trying to connect via HTTP from one project in Arcus (iris-gaia-data) to a VM on a different Arcus project (iris-gaia-red) using the floating IP:
IP of VM on iris-gaia-red: 128.232.226.64
From source VM (on iris-gaia-data):
curl http://128.232.226.64
curl: (7) Failed to connect to 128.232.226.64 port 80: No route to host
From local machine (outside Arcus):
curl http://128.232.226.64
<html>
<head><title>301 Moved Permanently</title></head>
<body>
<center><h1>301 Moved Permanently</h1></center>
<hr><center>nginx/1.24.0</center>
</body>
@millingw Can you check that this is now fixed. If it is, then we need to update the corresponding ticket on the Cambridge HPC system. https://ucam-rcs.atlassian.net/servicedesk/customer/portal/4/HPCSSUP-67058
We are seeing internal connectivity issues within arcus between OpenStack nodes and the object storage service.
From an arcus OpenStack VM, the following download fails:
However, the download works fine when issued from an external VM (in this case, an EIDF OpenStack VM)
External downloads to the arcus VM appears unaffected, e.g. issuing the following download works fine on the arcus VM:
wget https://downloads.apache.org/zeppelin/zeppelin-0.11.0/zeppelin-0.11.0-bin-all.tgz
Therefore, we think there are currently internal routing errors within the arcus service