wfau / gaia-dmp

Gaia data analysis platform
GNU General Public License v3.0
1 stars 5 forks source link

Internal routing issues on arcus #1308

Open millingw opened 7 months ago

millingw commented 7 months ago

We are seeing internal connectivity issues within arcus between OpenStack nodes and the object storage service.

From an arcus OpenStack VM, the following download fails:

$ wget https://object.arcus.openstack.hpc.cam.ac.uk/swift/v1/AUTH_e216e6b502134b6185380be6ccd0bf09/archive/zeppelin-0.10.1-gaia-dmp-0.1.tar.gz
--2024-02-21 09:39:46--  https://object.arcus.openstack.hpc.cam.ac.uk/swift/v1/AUTH_e216e6b502134b6185380be6ccd0bf09/archive/zeppelin-0.10.1-gaia-dmp-0.1.tar.gz
Resolving object.arcus.openstack.hpc.cam.ac.uk (object.arcus.openstack.hpc.cam.ac.uk)... 128.232.222.148, 128.232.222.24
Connecting to object.arcus.openstack.hpc.cam.ac.uk (object.arcus.openstack.hpc.cam.ac.uk)|128.232.222.148|:443... failed: No route to host.
Connecting to object.arcus.openstack.hpc.cam.ac.uk (object.arcus.openstack.hpc.cam.ac.uk)|128.232.222.24|:443... failed: No route to host.

However, the download works fine when issued from an external VM (in this case, an EIDF OpenStack VM)

# wget https://object.arcus.openstack.hpc.cam.ac.uk/swift/v1/AUTH_e216e6b502134b6185380be6ccd0bf09/archive/zeppelin-0.10.1-gaia-dmp-0.1.tar.gz
--2024-02-21 11:02:20--  https://object.arcus.openstack.hpc.cam.ac.uk/swift/v1/AUTH_e216e6b502134b6185380be6ccd0bf09/archive/zeppelin-0.10.1-gaia-dmp-0.1.tar.gz
Resolving object.arcus.openstack.hpc.cam.ac.uk (object.arcus.openstack.hpc.cam.ac.uk)... 128.232.222.24, 128.232.222.148
Connecting to object.arcus.openstack.hpc.cam.ac.uk (object.arcus.openstack.hpc.cam.ac.uk)|128.232.222.24|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1716996866 (1.6G) [application/gzip]
Saving to: 'zeppelin-0.10.1-gaia-dmp-0.1.tar.gz'

zeppelin-0.10.1-gaia-dmp-0.1.tar.gz          100%[===========================================================================================>]   1.60G   111MB/s    in 16s     

2024-02-21 11:02:37 (100 MB/s) - 'zeppelin-0.10.1-gaia-dmp-0.1.tar.gz' saved [1716996866/1716996866]

External downloads to the arcus VM appears unaffected, e.g. issuing the following download works fine on the arcus VM:

wget https://downloads.apache.org/zeppelin/zeppelin-0.11.0/zeppelin-0.11.0-bin-all.tgz

Therefore, we think there are currently internal routing errors within the arcus service

millingw commented 7 months ago

Details of the VM we've experienced this on:

#  openstack \
        --os-cloud "${cloudname:?}" \
        server list

...

c76e93d4-3709-427c-880d-a4d3a33e6935 | iris-gaia-blue-20240221-zeppelin | ACTIVE | iris-gaia-blue-20240221-internal-network=10.10.3.8, 128.232.226.23 | gaia-dmp-fedora-cloud-38-1.6 | gaia.vm.cclake.54vcpu |

stvoutsin commented 7 months ago

Potentially related, but it seems that network traffic is failing between different projects in Openstack, using their floating IPs Info can be found here: https://github.com/wfau/gaia-dmp/issues/1304

Zarquan commented 7 months ago

Corresponding Cambridge HPC support ticket: https://ucam-rcs.atlassian.net/servicedesk/customer/portal/4/HPCSSUP-67058

Zarquan commented 7 months ago

Connection fails trying to ssh from a VM in one project on Arcus (iris-gaia-green) to a VM in another project on Arcus (iris-gaia-data) using the target VMs public IP address (128.232.222.153).

Source VM:

Target VM:

hostname

    iris-gaia-green-20231027-zeppelin

host data.gaia-dmp.uk

    data.gaia-dmp.uk is an alias for iris-gaia-data.duckdns.org.
    iris-gaia-data.duckdns.org has address 128.232.222.153

ssh -v data.gaia-dmp.uk

    OpenSSH_8.0p1, OpenSSL 1.1.1d FIPS  10 Sep 2019
    ....
    debug1: Connecting to data.gaia-dmp.uk [128.232.222.153] port 22.
    debug1: connect to address 128.232.222.153 port 22: Connection timed out
    ssh: connect to host data.gaia-dmp.uk port 22: Connection timed out
stvoutsin commented 7 months ago

Connection also fails trying to connect via HTTP from one project in Arcus (iris-gaia-data) to a VM on a different Arcus project (iris-gaia-red) using the floating IP:

IP of VM on iris-gaia-red: 128.232.226.64

From source VM (on iris-gaia-data):

curl http://128.232.226.64
curl: (7) Failed to connect to 128.232.226.64 port 80: No route to host

From local machine (outside Arcus):

    curl http://128.232.226.64
          <html>
          <head><title>301 Moved Permanently</title></head>
          <body>
          <center><h1>301 Moved Permanently</h1></center>
          <hr><center>nginx/1.24.0</center>
          </body>
Zarquan commented 7 months ago

@millingw Can you check that this is now fixed. If it is, then we need to update the corresponding ticket on the Cambridge HPC system. https://ucam-rcs.atlassian.net/servicedesk/customer/portal/4/HPCSSUP-67058