zhenrong-wang / hpc-now

A Cross-Platform, Multi-Cloud High-Performance Computing Platform
https://www.hpc-now.com
MIT License
246 stars 111 forks source link
aliyun aws azure baiduyun c cloud cluster devops google-cloud hpc huaweicloud linux opentofu scripts slurm tencent-cloud terraform
HPC-NOW Logo

HPC-NOW, start your HPC journey in the cloud now, with no operation workload!

A full-stack HPC solution in the cloud, for the HPC community.

main_page

Contributions are highly welcomed and respected. Please see CONTRIBUTING.

This project is sponsored by the OpenAtom Foundation. openatom_logo

1. Project Background

Cloud High-Performance Computing - Cloud HPC, differs from on-premise HPC significantly. Cloud service brings high scalability and flexibility to High-Performance Computing. However, most HPC users are not familiar with building and maintaining HPC services in the cloud. The technical barrier of cloud computing is very high to researchers, engineers, and developers in different scientific and engineering domains, e.g. energy, chemistry, physics, materials, bioscience.

In order to make it super easy to start and manage HPC workloads in the cloud, we have been developing this project: HPC-NOW. NOW stands for:

Currently, the HPC-NOW platform supports 8 popular cloud platforms, shown as below:

DEMO: How easy it is to create a cloud HPC cluster using HPC-NOW?

cluster_init

Demo: You can easily manage multiple clusters across multiple clouds.

multiple_clusters

2. Core Components

Thanks to Terraform and openTofu for making it possible to orchestrate cloud resources in a unified and simple way.

In this project, we are developing several components:

The high-level architecture of this project is:

architecture

NOTICE: This project integrates several third-party components at execution level. Please see the NOTICE.

3. How-To: Build, Install, Run, and Use

The HPC-NOW platform is very easy to build, run, and use. It is also cross-platform, which means you can run the HPC-NOW on Microsoft Windows, GNU/Linux (with APT, DNF or YUM), and macOS (Darwin).

Note 1: Currently only x86_64 platform is supported. If you are using other cpu platforms, please let us know.

Note 2: Instead of compiling/building from the source code, you can download pre-built executables/binaries from the release of this repository. In this case, the dev/build tools (git, gcc, clang, or mingw-w64) are NOT needed.

Note 3: The HPC-NOW relies on some fundamental system utilities. In most cases, these utilities have been included in the OS distros. See the list below. If you are not sure whether the utilities are installed or not, please run the commands in a terminal/command prompt window.

If the utility curl is not pre-installed, please manually install it from the official site or with the package manager (e.g. yum, apt), and add the PATH to the system environment variables. Usually, tar, unzip ssh and scp are pre-installed.

3.1 Build

Prerequisites

Step 1. Clone this repository

git clone https://github.com/zhenrong-wang/hpc-now

If your connectivity to github is not stable, you can also try to clone from gitee:

git clone https://gitee.com/zhenrong-wang/hpc-now

Step 2. Change the directory

cd hpc-now

Step 3. Run the build script

If everything goes well, the binaries will be built to the build folder.

3.2 Install

Step 1. Run the installer

Temporary Administrator or root privilege is required to run the installer why?.

IMPORTANT: Please replace the sample version code 0.3.2 with the real code of your own build.

IMPORTANT: Please keep the window open for the next step.

Step 2. Initialize the hpcopr

The hpcopr.exe is designed to be executed by the dedicated system OS user named hpc-now, which has been created by the installer in the last step.

In order to run the hpcopr.exe, you'll need to set a password and switch to that user. See the steps below:

Several extra packages (around 500 MB) will be downloaded and installed. This process may needs minutes (depending on your internet connectivity).

NOTE 1: For UNIX-like OS, it is not necessary to set a password for 'hpc-now' and switch to it in the terminal. You can just run hpcopr.exe with sudo -Hu hpc-now prefix. e.g.:

sudo -Hu hpc-now hpcopr envcheck The -Hu specifies the user hpc-now and its home directory

This method is only valid for sudoers.

NOTE 2: If you are using a GNU/Linux distro with desktop envrionment (E.g. Debian with GNOME), after switching to the user hpc-now in a terminal, your desktop environment may not be authorized to hpc-now by default. The hpcopr rdp --copypass function would not work properly. Please follow the instructions below:

3.3 Run

The hpcopr is the main CLI for you to run. Please see the description above.

If you'd like to update/uninstall the HPC-NOW services, you will need to run the installer with sudo(for UNIX-like OS) or as administrator(for Windows).

3.4 Basic Workflow

In order to use and manage HPC in the cloud with HPC-NOW, please follow the workflow:

DEMO: An example of an HPC-NOW cluster running Paraview

example

3.5 The Installer Commands

The installer is designed to manage the installation/update/removal of the HPC-NOW services. It needs temporary administrator privilege to:

We follow the least privilege principle. Please check the source code directory of installer.

USAGE:

General Options(Required)

Advanced Options(Optional)

Examples

3.6 The hpcopr Commands

The hpcopr is a very powerful Command Line Interface (CLI) for you to use.

USAGE: hpcopr [-b] CMD_NAME CMD_FLAG ... [CMD_KEYWORD1 CMD_KEY_STRING1] ...

Examples:

CMD_NAME LIST:

Get-Started

Multi-Cluster Management

Global Management

Advanced - For developers:

Cluster Initialization

Cluster Management

Cluster Operation

Cluster User Management

Usage: hpcopr userman --ucmd USER_CMD [ KEY_WORD1 KEY_STRING1 ] ...

The cluster must be in running state (minimal or all).

--ucmd list      List all the current cluster users.
--ucmd add       Add a user to the cluster. By default, added users are enabled.
--ucmd delete    Delete a user from the cluster.
--ucmd enable    Enable a *disabled* user. Enabled users can run HPC workloads.
--ucmd disable   Disable a user. Disabled users still can access the cluster.
--ucmd passwd    Change user's password.

Cluster Data Management

Usage: hpcopr dataman CMD_FLAG... [ KEY_WORD1 KEY_STRING1 ] ...

General Flags: -r, -rf, --recursive, --force, -f.

-s SOURCE_PATH    Source path of the binary operations. e.g. cp
-d DEST_PATH      Destination path of binary operations. e.g. cp
-t TARGET_PATH    Target path of unary operations. e.g. ls

Bucket Operations

Transfer and manage data with the bucket.

--dcmd put         Upload a local file or folder to the bucket path.
--dcmd get         Download a bucket object(file or folder) to the local path.
--dcmd copy        Copy a bucket object to another folder/path.
--dcmd list        Show the object list of a specified folder/path.
--dcmd delete      Delete an object (file or folder) of the bucket.
--dcmd move        Move an existed object (file or folder) in the bucket.

Example: hpcopr dataman --dcmd put -s ./foo -d /foo -u user1

Direct Operations

Transfer and manage data in the cluster storage.

The cluster must be in running state (minimal or all).

--dcmd cp          Remote copy between local and the cluster storage.
--dcmd mv          Move the remote files/folders in the cluster storage.
--dcmd ls          List the files/folders in the cluster storage.
--dcmd rm          Remove the files/folders in the cluster storage.
--dcmd mkdir       Make a directory in the cluster storage.
--dcmd cat         Print out a remote plain text file.
--dcmd more        Read a remote file.
--dcmd less        Read a remote file.
--dcmd tail        Streaming out a remote file dynamically.
--dcmd rput        Upload a *remote* file or folder to the bucket path.
--dcmd rget        Download a bucket object(file or folder) to the *remote* path.

    @h/ to specify the $HOME prefix of the cluster.
    @d/ to specify the /hpc_data/user_data prefix.
    @a/ to specify the /hpc_apps/ prefix, only for root or user1.
    @p/ to specify the public folder prefix ( INSECURE !).
    @R/ to specify the / prefix, only for root or user1.
    @t/ to specify the /tmp prefix.

Example: hpcopr dataman --dcmd cp -s ~/foo/ -d @h/foo -r -u user1

Cluster App Management

Usage: hpcopr appman --acmd APP_CMD CMD_FLAG [ KEY_WORD1 KEY_STRING1 ] ...

The cluster must be in running state (minimal or all).

-u USERNAME A valid user name. Use 'root' for all users. Admin or Operator role is required for root.

--acmd store         List out the apps in store.
--acmd avail         List out all the installed apps.
--acmd check         Check whether an app is available.
--acmd install       Install an app to all users or a specified user.
--acmd build         Compile and build an app to all users or a specified user.
--acmd remove        Remove an app from the cluster.
--acmd update-config Update the locations for scripts and pacakge repository
--acmd show-config   Display the locations for scripts and pacakge repository

Cluster Job Management

Usage: hpcopr jobman --jcmd APP_CMD [ KEY_WORD1 KEY_STRING1 ] ...

The cluster must be in running state (minimal or all).

-u USERNAME A valid user name. The root user CANNOT submit jobs.

--jcmd submit    Submit a job to the cluster.
--jcmd list      List out all the jobs.
--jcmd cancel    Cancel a job with specified ID

Others

For more information, please refer to docs/UserManual-EN.pdf CAUTION: This file may not be updated.

The most detailed and updated help info can be found by the command hpcopr help. We are also considering writing a standard mannual for hpcopr. If you are interested, please let us know.

4. Contributing

Please see the contributing guide .

Also, please feel free to mailto:

5. Appendix: HPC-NOW Directories

The hpc-now service manages 2 top-level directories and several subdirectories on your device and OS. Here is the architecture:

5.1 Top-level directories

5.2 Sub-directories

+- BINARY_ROOT/\   +- hpcopr  The hpcopr executable\   +- utils/  Including cryoto, terraform/tofu and cloud utilities\     +- now-crypto-aes\     +- terraform/tofu\     +- cloud utilities\ +- RUNNING_ROOT/\   +- .now_crypto_seed.lock  The hpcopr crypto string\   +- now_logs/  Usages and Logs\     +- log_trashbin.txt  The trashbin of clusters' logs\     +- now-cluster-usage.log  The cluster usage log\     +- system_command_error.log  The system command error\     +- now-cluster-operation.log  The hpcopr command log\   +- .tmp/  Temporary files\   +- .now-ssh/  SSH keys for connectivity with your clusters\     +- now-cluster-login.tmp  Encrypted operator's private key\     +- now-cluster-login.pub  Operator's public key\     +- .CLUSTER_NAME/  Each cluster has its own directory\       +- USER_PRIVATE_KEYS.tmp  Cluster users' encrypted private keys\   +- .etc/  General configuration files\     +- .all_clusters.dat.tmp  Encrypted cluster registry\     +- .all_clusters.dat.dec.bak  Decrypted cluster registry\     +- current_cluster.dat  Current cluster indicator\     +- google_check.dat  Google connectivity indicator\     +- locations.conf  Locations of components\     +- components.conf  Version and SHA of components\     +- tf_running.conf  TF running configuration\   +- .destroyed/  Files of destroyed clusters\   +- workdir/  Working directories for all clusters\     +- CLUSTER_NAME/  Each cluster has its own directory\       +- log/  Cluster-level running logs\       +- stack/  TF running dorectory\       +- conf/  Cluster configuration\       +- vault/  Cluster's sensitive files\   +- mon_data/  Monitoring data of all clusters

NOTE:

1. All the directories and files except the .now_crypto_seed.lock are set to be readable, writable, and executable only by the system user hpc-now.

2. The .now_crypto_seed.lock file is set to be readable only by root/Admin and hpc-now. And it is NOT writable even for root/Admin.

3. Manually modification of the .now_crypto_seed.lock file will destroy the whole HPC-NOW because you may not be able to decrypt critical files.