Redeploy GPU nodes to HTCondor cluster

xchem / xchem_it

Issues for XChem IT work

0 stars 0 forks source link

Redeploy GPU nodes to HTCondor cluster #8

Closed tdudgeon closed 3 years ago

tdudgeon commented 3 years ago

The v100.small GPU nodes have a problem in that they request too much memory, causing them to fail. STFC are planning on changing the configuration to request slightly less memory which should fix the problem. Once this change has been made we will need to re-create the GPU nodes to pick up the changes. Until this is done the GPU nodes will be unstable,

tdudgeon commented 3 years ago

The re-configuration of the memory for the v100.small flavour has now been done (now set to 75GB) so these GPU VMs should now be re-created.

tdudgeon commented 3 years ago

These were rebuilt on 10 March using names of pulsar-exec-node-cuda-1.xchem One day later they are still all running OK.

The /root/update-hosts.sh file on the pulsar-central-manager had to be edited to reflect the new names so that the gpu gender was set correctly.