xcat2 / xcat-core

Code repo for xCAT core packages
Eclipse Public License 1.0
358 stars 171 forks source link

Customizing osimage for diskless deployment #6350

Open morepixel2crsi opened 5 years ago

morepixel2crsi commented 5 years ago

Hi,

I just discovered the software, and the basic usage. I would like to customize a centos image from GPU point of view :

1/ Disable nouveau driver 2/ Adding nvidia driver + all cuda tools

wget https://developer.nvidia.com/compute/cuda/10.1/Prod/local_installers/cuda-repo-rhel7-10-1-local-10.1.168-418.67-1.0-1.x86_64.rpm

mkdir -p /tmp/cuda

mv /root/cuda-repo-rhel7-10-1-local-10.1.168-418.67-1.0-1.x86_64.rpm /tmp/cuda/

cd /tmp/cuda

rpm2cpio /tmp/cuda/cuda-repo-rhel7-10-1-local-10.1.168-418.67-1.0-1.x86_64.rpm | cpio -i -d

mkdir -p /install/cuda-10.1/x86_64/cuda-core

cp /tmp/cuda/cuda-repo-rhel7-10-1-local-10.1.168-418.67-1.0-1.x86_64.rpm /install/cuda-10.1/x86_64/cuda-core

createrepo /install/cuda-10.1/x86_64/cuda-core

wget http://download-ib01.fedoraproject.org/pub/epel/7/x86_64/Packages/d/dkms-2.6.1-1.el7.noarch.rpm

mkdir -p /install/cuda-10.1/x86_64/cuda-deps/

mv /root/dkms-2.6.1-1.el7.noarch.rpm /install/cuda-10.1/x86_64/cuda-deps/

createrepo /install/cuda-10.1/x86_64/cuda-deps/

lsdef -t osimage -z centos7.6-x86_64-netboot-compute | sed 's/netboot-compute:/netboot-nvidia:/' | mkdef -z

mkdef -t osimage -o centos7.6-x86_64-netboot-cudafull --template centos7.6-x86_64-netboot-compute otherpkgdir=/install/post/otherpkgs/centos7.6/x86_64/cuda-10.1

lsdef -t osimage centos7.6-x86_64-netboot-cudafull -i otherpkgdir

ln -s /install/cuda-10.1 /install/post/otherpkgs/centos7.6/x86_64/cuda-10.1

\rootimgdir=/install/netboot/centos7.6/x86_64/cudafull chdef -t osimage -o centos7.6-x86_64-netboot-cudafull rootimgdir=/install/netboot/centos7.6/x86_64/cudafull

mkdir -p /install/custom/netboot/centos/

vi /install/custom/netboot/centos/cudafull.centos7-6.x86_64.pkglist => ADD : pciutils ( example )

chdef -t osimage -o centos7.6-x86_64-netboot-cudafull pkglist=/install/custom/netboot/centos/cudafull.centos7-6.x86_64.pkglist

vi /install/custom/netboot/centos/cudafull.centos7-6.x86_64.otherpkgs.pkglist => ADD : cuda-10.1/x86_64/cuda-deps/dkms

chdef -t osimage -o centos7.6-x86_64-netboot-cudafull otherpkglist=/install/custom/netboot/centos/cudafull.centos7-6.x86_64.otherpkgs.pkglist

genimage centos7.6-x86_64-netboot-cudafull

packimage centos7.6-x86_64-netboot-cudafull

nodeset testnvidia osimage=centos7.6-x86_64-netboot-cudafull

Point 1 is not that difficult

Point 2 : I followed xcat docs in order to create repo etc... but the after the pxeboot, the driver is not loaded.

Can you explain how to add these packages into new osimage for diskless deployment ? ( like driverupdatesrc=rpm: )

Regards,

whowutwut commented 5 years ago

Hi @morepixel2crsi , Welcome! Ok, let me see if i can help out..

1. disable nouveau drivers

So for this, In the xCAT node definition, you can add attribute addkcmdline to it.. then in the boot config file, it will send it as a boot option to disable:

chdef -t node <your node> addkcmdline=modprobe.blacklist=nouveau

2. Cuda Driver

I took your above steps and copied it down here for better readability so i can double check...

wget https://developer.nvidia.com/compute/cuda/10.1/Prod/local_installers/cuda-repo-rhel7-10-1-local-10.1.168-418.67-1.0-1.x86_64.rpm
mkdir -p /tmp/cuda
mv /root/cuda-repo-rhel7-10-1-local-10.1.168-418.67-1.0-1.x86_64.rpm /tmp/cuda/
cd /tmp/cuda

rpm2cpio /tmp/cuda/cuda-repo-rhel7-10-1-local-10.1.168-418.67-1.0-1.x86_64.rpm | cpio -i -d
mkdir -p /install/cuda-10.1/x86_64/cuda-core
cp /tmp/cuda/cuda-repo-rhel7-10-1-local-10.1.168-418.67-1.0-1.x86_64.rpm /install/cuda-10.1/x86_64/cuda-core
createrepo /install/cuda-10.1/x86_64/cuda-core

The steps above and after the rpm2cpio steps, you made a mistake in the copy.. you should be copying the rpms under /tmp/cuda/var... not the rpm file itself.

From https://xcat-docs.readthedocs.io/en/stable/advanced/gpu/nvidia/repo/rhels.html should be image

So when I ran the steps above on my machine, the rpm2cpio extracts out the rpms and it should look like this below, you want the files under /tmp/cuda/var...

# ls -ltr /tmp/cuda/
total 2359472
-rw-r--r-- 1 root root 2416096987 May  6 20:33 cuda-repo-rhel7-10-1-local-10.1.168-418.67-1.0-1.x86_64.rpm
drwxr-xr-x 3 root root         50 Jun  6 17:28 var   <--------
drwxr-xr-x 3 root root         25 Jun  6 17:28 etc
[root@c910f04x18 cuda]#

So to fix, run:

rm /install/cuda-10.1/x86_64/cuda-core/cuda-repo-rhel7-10-1-local-10.1.168-418.67-1.0-1.x86_64.rpm
cp /tmp/cuda/var/cuda-repo-10-1-local-10.1.168-418.67/*.rpm /install/cuda-10.1/x86_64/cuda-core/
createrepo /install/cuda-10.1/x86_64/cuda-core

Ok, then let's go to the cuda-deps...

wget http://download-ib01.fedoraproject.org/pub/epel/7/x86_64/Packages/d/dkms-2.6.1-1.el7.noarch.rpm`
mkdir -p /install/cuda-10.1/x86_64/cuda-deps/
mv /root/dkms-2.6.1-1.el7.noarch.rpm /install/cuda-10.1/x86_64/cuda-deps/
createrepo /install/cuda-10.1/x86_64/cuda-deps/

Then you create an image and change some properties...

lsdef -t osimage -z centos7.6-x86_64-netboot-compute | sed 's/netboot-compute:/netboot-nvidia:/' | mkdef -z
mkdef -t osimage -o centos7.6-x86_64-netboot-cudafull --template centos7.6-x86_64-netboot-compute otherpkgdir=/install/post/otherpkgs/centos7.6/x86_64/cuda-10.1
ln -s /install/cuda-10.1 /install/post/otherpkgs/centos7.6/x86_64/cuda-10.1

chdef -t osimage -o centos7.6-x86_64-netboot-cudafull \
 rootimgdir=/install/netboot/centos7.6/x86_64/cudafull

Set custom attributes to otherpkgdir and the rootimgdir..... 👍

mkdir -p /install/custom/netboot/centos/
vi /install/custom/netboot/centos/cudafull.centos7-6.x86_64.pkglist  # ADD : pciutils ( example )

chdef -t osimage -o centos7.6-x86_64-netboot-cudafull \ pkglist=/install/custom/netboot/centos/cudafull.centos7-6.x86_64.pkglist

Ok, added pciutils to the OS pkglist....

vi /install/custom/netboot/centos/cudafull.centos7-6.x86_64.otherpkgs.pkglist # ADD : cuda-10.1/x86_64/cuda-deps/dkms
chdef -t osimage -o centos7.6-x86_64-netboot-cudafull otherpkglist=/install/custom/netboot/centos/cudafull.centos7-6.x86_64.otherpkgs.pkglist

Ok, added the dkms to the otherpkglist...

Then create the image:

genimage centos7.6-x86_64-netboot-cudafull
packimage centos7.6-x86_64-netboot-cudafull
nodeset testnvidia osimage=centos7.6-x86_64-netboot-cudafull`

Ok so right... normally how I would like to organize this is pkgdir is really what is shipped by the OS distribution. (i.e pciutils) otherpkgdir is everything else, like 3rd party software (cuda, etc)...

But for diskless, you should only specify a single otherpkgdir variable, and the otherpkglist should be relative to that path. For example, I put a bunch of software under /install/REPO/software , that's what goes into my otherpkdir

# lsdef -t osimage  xcat.netboot.redhat-alt.cuda.mofed -i otherpkgdir
Object name: xcat.netboot.redhat-alt.cuda.mofed
    otherpkgdir=/install/REPO/software

Then for CUDA support, under otherpkglist I have cuda.otherpkglist which contains:

nvidia/cuda-core/10.1.168-418.67-1.0-1/repo/ppc64le/cuda
nvidia/cuda-core/10.1.168-418.67-1.0-1/repo/ppc64le/cuda-drivers
nvidia/cuda-dep/repo/ppc64le/dkms

Specifying cuda and cuda-drivers would pull in the toolkit. The name is the rpm name from the package, for example:

# rpm -qip /install/REPO/software/nvidia/cuda-core/10.1.168-418.67-1.0-1/repo/ppc64le/cuda-10.1.168-1.ppc64le.rpm | grep Name
Name        : cuda

Then genimage should show if the package is there, you can chroot into the rootimgdir...

This was alot of info, so try fixing the cuda-core repo first, and re-run the genimage, you should see the cuda packages get installed... and let me know where it sits and I can help some more

adityashewale commented 4 years ago

**simple method to create cuda 10.2 state less image** yum -y --installroot=$CHROOT install kernel-headers-3.10.0-957.el7.x86_64.rpm yum -y --installroot=$CHROOT install kernel-devel-3.10.0-957.el7.x86_64.rpm yum -y --installroot=$CHROOT install dkms rpm -ivh cuda-repo-rhel7-10-2-local-10.2.89-440.33.01-1.0-1.x86_64.rpm

mount --bind /proc /install/netboot/centos7.6/x86_64/cuda10.2/rootimg/proc mount --bind /dev /install/netboot/centos7.6/x86_64/cuda10.2/rootimg/dev mount --bind /sys /install/netboot/centos7.6/x86_64/cuda10.2/rootimg/sys

yum -y --installroot=$CHROOT install cuda cp cuda-repo-rhel7-10-2-local-10.2.89-440.33.01-1.0-1.x86_64.rpm $CHROOT/ yum -y --installroot=$CHROOT install cuda umount $CHROOT/sys umount $CHROOT/proc umount $CHROOT/dev

packimage centos7.6-x86_64-netboot-cuda10.2

greemi7 commented 1 year ago

adityashewale - May want to show more details here, There are many assumptions we have to try to come up with answers for we need to know, like how you do versioning control with your rootimg so we don't destroy the current image, where did you get your rpm, why do we have to install cuda twice, how do you point your nodes to the new image. This looks like a cut-n-paste from some other source that is missing some key information and steps. Thanks! -greemi7