Compute resources meeting with Cambridge

Zarquan commented 1 year ago

We organised a meeting with people at Cambridge to discuss the Gaia-DMp resource requirements for 2022 and 2023. This issue is to keep track of the meeting planning, an outline of the meeting agenda and pull together any resources we need.

Initial meeting schedule done via when2meet, followed by a Doodle poll.

Zarquan commented 1 year ago

Based on responses, Friday 9th is the best so I will confirm the date/time and send an invite.

Zarquan commented 1 year ago

I ended up sending invites twice, but (I hope) only to our own team. The new Doodle user interface isn't the easiest to use, it defaulted to the 11-12 rather than the 10-11 slot.

I will send a personal email direct to all participants.

Zarquan commented 1 year ago

Invite email sent to everyone on the original email group.

Best date for the meeting is 11:00-12:00 Friday Sep 9.

Main topic of the meeting is to discuss the resources our project currently has, what resources we think we will need going forward and how best to fit that in to what Cambridge can provide.

A key issue of us will be whether it is possible to get Tbytes of direct attached SSD storage installed in some of the compute nodes.

A more long term issue will be how to break out of the pinned resource allocation we currently have to a more flexible allocation that can scale to meet the variable demand generated by an interactive notebook platform. This may include looking at StackHPC's work on adding Blazar to Openstack and whether it would be available on the Arcus cloud in the future.

Zarquan commented 1 year ago

Notes from the meeting.

At the end of the IRIS resource request process Cambridge ended up receiving a single figure for 144Tbytes of disc storage, with no indication of what type of disc. They have allocated this as 70Tbytes CeppFS and 74Tbytes Cinder volumes.

Cambridge are happy to provide direct attached SSD, but they didn't know about our requirements.

We are not the only ML project asking asking for faster storage.

Cambridge are looking at faster storage - NVMe storage servers with high speed (infiniband | RDMA Fabrics over) networking, but no timescale yet.

Gaia DR3 dataset is ballpark 5Tbytes at the moment, expected to grow as we injest the large tables, and DR4 will be an order of magnitude larger.

We need space for more than one copy of the dataset. One for the live production service and at least one possibly more copies to experiment with partitioning to fit the worker memory and DR4 scale test data to prepare for the next release.

We don't need 5Tbytes in a single SSD, so we would implement our own network file system to combine multiple smaller SSD drives to create the required 5Tbyte filesystem.

Cambridge will look into providing 4 (? size to be determined) Tbyte SSD in each of the 8 compute nodes.

We can look at using a network file system like "BeeGFS On Demand" (BeeOND) to combine smaller ephemeral SSD drives into a larger filesystem. Initial experiments can be done using the existing local discs, although we may need to modify the Openstack flavors to facilitate this.

The current flavor with the largest ephemeral disc is also the high memory high cpu gaia.vm.cclake.54vcpu instances. To run something like the BeeOND filesystem it would be better to use a smaller cpu/memory flavor with a large local disc allocation. Alternatively, we could use an existing small flavor and add multiple ephemeral discs (see email quote below) to create a BeeOND server.

Regarding future IRIS resource requests. Cambridge would rather work to fix the IRIS request system rather than have 1:1 meeting with individual projects. This meeting was OK, but they can't scale to having 1:1 meetings with all x100 projects using their system. Cambridge will raise this issue with IRIS, but Gaia should also raise it from our side via our PI (NicW).

Ideally IRIS should add a distinction for types of storage in the resource allocation spread sheet. The spread sheet already distinguishes between CPU and GPU, so there could be a similar distinction for storage.

Zarquan commented 1 year ago

On 2022-09-09 11:39, Paul Browne wrote:

On k8s/Magnum, we did run Magnum for a long time, but with little adoption. Magnum itself as an OpenStack project was a fair amount of trouble to run and maintain, largely due to the reason that Magnum's centering around manipulating Heat stacks to build and maintain a k8s cluster has a very poor UX for regular users.

Alternatives exist, such as the Azimuth cloud app portal developed by StackHPC, which leverages k8s ClusterAPI to bootstrap k8s clusters into an OpenStack cloud in a more user-friendly way. It's under active development, but we host a local-to-Arcus instance

On ephemeral disks, recent OS CLI will accept e.g,. "openstack server create --ephemeral SIZE_X" , and this argument can be repeated. The flavor defines a max ephemeral size, which can be broken up into multiple (if desired) ephemeral disks in addition to the root disk.

Zarquan commented 1 year ago

On 2022-09-09 11:39, Paul Browne wrote:

Here's the link with the current allocations/quotas under IRIS Gaia Science Platform 2022/23 round;

CPU (Available)			22/23 Allocation
Lo-Mem CCLake	6 nodes	550 usable vCPU	495
Hi-Mem Cclake	2 nodes	220 usable vCPU	195

Storage (Quota)	CephFS (TiB)	Ceph Volume (TiB)	22/23 Allocation
iris-gaia-red	0.05	24
iris-gaia-blue	0.05	24
iris-gaia-green	0.05	24
iris-gaia-data	70	1
Totals	70.15	73
Grand Total	143.15		144
Under quota by	0.15

Zarquan commented 1 year ago

My own thoughts:

Best compromise might be to keep the 70Tbytes of (Manila) CephFS storage as-is and ask Cambridge to split the 24Tbytes of (Cinder) Ceph volume storage into 12T bytes of (Cinder) Ceph volume storage and 12Tbytes of direct attached SSD ? Need to do the math to figure out how that would map to SSD's per compute node etc.

We will need to schedule time to experiment with BeeGFS using the existing ephemeral SSDs (current max 380G per large VM). There are other options which we looked at in the last and may be worth a re-visit. A simple S3 system might be sufficient for our needs, and would map well into the K8s cloud environment.

Rather than specifying SSD and disc, network and remote in the IRIS request spread sheet, would it be better to have more general categories like CERN/Rucio and Amazon have [slow and safe with replication, medium general purpose, high-bandwidth fast and ephemeral] and allow the local sites to deal with the implementation details. I'd prefer not to know the details, but we are having to learn to cope with the issues.

Zarquan commented 1 year ago

Notes on the --ephemeral option that Paul mentioned.

This is available via the Openstack command line interface:

Attach swap or ephemeral disk to an instance https://docs.openstack.org/ocata/user-guide/cli-nova-launch-instance-from-volume.html

Use the nova boot --swap parameter to attach a swap disk on boot or the nova boot --ephemeral parameter to attach an ephemeral disk on boot. When you terminate the instance, both disks are deleted.

Boot an instance with a 512 MB swap disk and 2 GB ephemeral disk.
nova boot --flavor FLAVOR --image IMAGE_ID --swap 512 \
--ephemeral size=2 NAME
Note The flavor defines the maximum swap and ephemeral disk size. You cannot exceed these maximum values.

That last note means the ephemeral disk size is defined in the flavor, so we would still need to define a new flavor optimised for the BeeGFS storage nodes.

As far as I can tell, this isn't available via the Ansible module. https://docs.ansible.com/ansible/latest/collections/openstack/cloud/server_module.html

Which means we would need to restructure parts of our deployment code to use Openstack command line rather than Ansible to allocate the BeeGFS storage nodes. Possible, but it would mean extra time.

Zarquan commented 1 year ago

Regarding Kubernetes and Magnum. Magnum was useful for us because it gave us a single command line call to create a cluster, and then we would use the K8s tools to work inside that.

The Azimuth cloud app portal from StackHPC is not that useful for us, as our deployment process is based on using command line tools and standard service interfaces to make our deployments portable across different cloud platforms. If the Azimuth cloud app provided a REST API it might be useful, but it would mean our deployment process would become dependent on having the Azimuth cloud app installed, which would not be the case on other cloud systems.

Based on this I think it would be worth spending the time to learn how to use the Kubernetes Cluster API to deploy our own clusters. This would be a worthwhile step towards making our deployments portable across different platforms.

(*) the StackHPC code base for Azimuth will be a very useful resource for us to learn how to drive the Cluster API.

Zarquan commented 1 year ago

Old issues that might need to be re-visited:

493
434
247

Zarquan commented 1 year ago

Keeping this issue open for now until we collect all this information together in a document/wiki page.

Zarquan commented 1 year ago

Keeping this open because I need to capture some of the information before we close it.

wfau / gaia-dmp

Compute resources meeting with Cambridge #995

493

434

247