Open morepixel2crsi opened 5 years ago
Hi @morepixel2crsi , Welcome! Ok, let me see if i can help out..
So for this, In the xCAT node definition, you can add attribute addkcmdline
to it.. then in the boot config file, it will send it as a boot option to disable:
chdef -t node <your node> addkcmdline=modprobe.blacklist=nouveau
I took your above steps and copied it down here for better readability so i can double check...
wget https://developer.nvidia.com/compute/cuda/10.1/Prod/local_installers/cuda-repo-rhel7-10-1-local-10.1.168-418.67-1.0-1.x86_64.rpm
mkdir -p /tmp/cuda
mv /root/cuda-repo-rhel7-10-1-local-10.1.168-418.67-1.0-1.x86_64.rpm /tmp/cuda/
cd /tmp/cuda
rpm2cpio /tmp/cuda/cuda-repo-rhel7-10-1-local-10.1.168-418.67-1.0-1.x86_64.rpm | cpio -i -d
mkdir -p /install/cuda-10.1/x86_64/cuda-core
cp /tmp/cuda/cuda-repo-rhel7-10-1-local-10.1.168-418.67-1.0-1.x86_64.rpm /install/cuda-10.1/x86_64/cuda-core
createrepo /install/cuda-10.1/x86_64/cuda-core
The steps above and after the rpm2cpio
steps, you made a mistake in the copy.. you should be copying the rpms under /tmp/cuda/var
... not the rpm file itself.
From https://xcat-docs.readthedocs.io/en/stable/advanced/gpu/nvidia/repo/rhels.html should be
So when I ran the steps above on my machine, the rpm2cpio
extracts out the rpms and it should look like this below, you want the files under /tmp/cuda/var
...
# ls -ltr /tmp/cuda/
total 2359472
-rw-r--r-- 1 root root 2416096987 May 6 20:33 cuda-repo-rhel7-10-1-local-10.1.168-418.67-1.0-1.x86_64.rpm
drwxr-xr-x 3 root root 50 Jun 6 17:28 var <--------
drwxr-xr-x 3 root root 25 Jun 6 17:28 etc
[root@c910f04x18 cuda]#
So to fix, run:
rm /install/cuda-10.1/x86_64/cuda-core/cuda-repo-rhel7-10-1-local-10.1.168-418.67-1.0-1.x86_64.rpm
cp /tmp/cuda/var/cuda-repo-10-1-local-10.1.168-418.67/*.rpm /install/cuda-10.1/x86_64/cuda-core/
createrepo /install/cuda-10.1/x86_64/cuda-core
Ok, then let's go to the cuda-deps...
wget http://download-ib01.fedoraproject.org/pub/epel/7/x86_64/Packages/d/dkms-2.6.1-1.el7.noarch.rpm`
mkdir -p /install/cuda-10.1/x86_64/cuda-deps/
mv /root/dkms-2.6.1-1.el7.noarch.rpm /install/cuda-10.1/x86_64/cuda-deps/
createrepo /install/cuda-10.1/x86_64/cuda-deps/
Then you create an image and change some properties...
lsdef -t osimage -z centos7.6-x86_64-netboot-compute | sed 's/netboot-compute:/netboot-nvidia:/' | mkdef -z
mkdef -t osimage -o centos7.6-x86_64-netboot-cudafull --template centos7.6-x86_64-netboot-compute otherpkgdir=/install/post/otherpkgs/centos7.6/x86_64/cuda-10.1
ln -s /install/cuda-10.1 /install/post/otherpkgs/centos7.6/x86_64/cuda-10.1
chdef -t osimage -o centos7.6-x86_64-netboot-cudafull \
rootimgdir=/install/netboot/centos7.6/x86_64/cudafull
Set custom attributes to otherpkgdir
and the rootimgdir
..... 👍
mkdir -p /install/custom/netboot/centos/
vi /install/custom/netboot/centos/cudafull.centos7-6.x86_64.pkglist # ADD : pciutils ( example )
chdef -t osimage -o centos7.6-x86_64-netboot-cudafull \ pkglist=/install/custom/netboot/centos/cudafull.centos7-6.x86_64.pkglist
Ok, added pciutils to the OS pkglist....
vi /install/custom/netboot/centos/cudafull.centos7-6.x86_64.otherpkgs.pkglist # ADD : cuda-10.1/x86_64/cuda-deps/dkms
chdef -t osimage -o centos7.6-x86_64-netboot-cudafull otherpkglist=/install/custom/netboot/centos/cudafull.centos7-6.x86_64.otherpkgs.pkglist
Ok, added the dkms
to the otherpkglist...
Then create the image:
genimage centos7.6-x86_64-netboot-cudafull
packimage centos7.6-x86_64-netboot-cudafull
nodeset testnvidia osimage=centos7.6-x86_64-netboot-cudafull`
Ok so right... normally how I would like to organize this is pkgdir is really what is shipped by the OS distribution. (i.e pciutils) otherpkgdir is everything else, like 3rd party software (cuda, etc)...
But for diskless, you should only specify a single otherpkgdir
variable, and the otherpkglist should be relative to that path. For example, I put a bunch of software under /install/REPO/software
, that's what goes into my otherpkdir
# lsdef -t osimage xcat.netboot.redhat-alt.cuda.mofed -i otherpkgdir
Object name: xcat.netboot.redhat-alt.cuda.mofed
otherpkgdir=/install/REPO/software
Then for CUDA support, under otherpkglist
I have cuda.otherpkglist
which contains:
nvidia/cuda-core/10.1.168-418.67-1.0-1/repo/ppc64le/cuda
nvidia/cuda-core/10.1.168-418.67-1.0-1/repo/ppc64le/cuda-drivers
nvidia/cuda-dep/repo/ppc64le/dkms
Specifying cuda and cuda-drivers would pull in the toolkit. The name is the rpm name from the package, for example:
# rpm -qip /install/REPO/software/nvidia/cuda-core/10.1.168-418.67-1.0-1/repo/ppc64le/cuda-10.1.168-1.ppc64le.rpm | grep Name
Name : cuda
Then genimage should show if the package is there, you can chroot into the rootimgdir...
This was alot of info, so try fixing the cuda-core repo first, and re-run the genimage, you should see the cuda packages get installed... and let me know where it sits and I can help some more
**simple method to create cuda 10.2 state less image** yum -y --installroot=$CHROOT install kernel-headers-3.10.0-957.el7.x86_64.rpm yum -y --installroot=$CHROOT install kernel-devel-3.10.0-957.el7.x86_64.rpm yum -y --installroot=$CHROOT install dkms rpm -ivh cuda-repo-rhel7-10-2-local-10.2.89-440.33.01-1.0-1.x86_64.rpm
mount --bind /proc /install/netboot/centos7.6/x86_64/cuda10.2/rootimg/proc mount --bind /dev /install/netboot/centos7.6/x86_64/cuda10.2/rootimg/dev mount --bind /sys /install/netboot/centos7.6/x86_64/cuda10.2/rootimg/sys
yum -y --installroot=$CHROOT install cuda cp cuda-repo-rhel7-10-2-local-10.2.89-440.33.01-1.0-1.x86_64.rpm $CHROOT/ yum -y --installroot=$CHROOT install cuda umount $CHROOT/sys umount $CHROOT/proc umount $CHROOT/dev
packimage centos7.6-x86_64-netboot-cuda10.2
adityashewale - May want to show more details here, There are many assumptions we have to try to come up with answers for we need to know, like how you do versioning control with your rootimg so we don't destroy the current image, where did you get your rpm, why do we have to install cuda twice, how do you point your nodes to the new image. This looks like a cut-n-paste from some other source that is missing some key information and steps. Thanks! -greemi7
Hi,
I just discovered the software, and the basic usage. I would like to customize a centos image from GPU point of view :
wget https://developer.nvidia.com/compute/cuda/10.1/Prod/local_installers/cuda-repo-rhel7-10-1-local-10.1.168-418.67-1.0-1.x86_64.rpm
mkdir -p /tmp/cuda
mv /root/cuda-repo-rhel7-10-1-local-10.1.168-418.67-1.0-1.x86_64.rpm /tmp/cuda/
cd /tmp/cuda
rpm2cpio /tmp/cuda/cuda-repo-rhel7-10-1-local-10.1.168-418.67-1.0-1.x86_64.rpm | cpio -i -d
mkdir -p /install/cuda-10.1/x86_64/cuda-core
cp /tmp/cuda/cuda-repo-rhel7-10-1-local-10.1.168-418.67-1.0-1.x86_64.rpm /install/cuda-10.1/x86_64/cuda-core
createrepo /install/cuda-10.1/x86_64/cuda-core
wget http://download-ib01.fedoraproject.org/pub/epel/7/x86_64/Packages/d/dkms-2.6.1-1.el7.noarch.rpm
mkdir -p /install/cuda-10.1/x86_64/cuda-deps/
mv /root/dkms-2.6.1-1.el7.noarch.rpm /install/cuda-10.1/x86_64/cuda-deps/
createrepo /install/cuda-10.1/x86_64/cuda-deps/
lsdef -t osimage -z centos7.6-x86_64-netboot-compute | sed 's/netboot-compute:/netboot-nvidia:/' | mkdef -z
mkdef -t osimage -o centos7.6-x86_64-netboot-cudafull --template centos7.6-x86_64-netboot-compute otherpkgdir=/install/post/otherpkgs/centos7.6/x86_64/cuda-10.1
lsdef -t osimage centos7.6-x86_64-netboot-cudafull -i otherpkgdir
ln -s /install/cuda-10.1 /install/post/otherpkgs/centos7.6/x86_64/cuda-10.1
\rootimgdir=/install/netboot/centos7.6/x86_64/cudafull
chdef -t osimage -o centos7.6-x86_64-netboot-cudafull
rootimgdir=/install/netboot/centos7.6/x86_64/cudafull
mkdir -p /install/custom/netboot/centos/
vi /install/custom/netboot/centos/cudafull.centos7-6.x86_64.pkglist
=> ADD : pciutils ( example )chdef -t osimage -o centos7.6-x86_64-netboot-cudafull pkglist=/install/custom/netboot/centos/cudafull.centos7-6.x86_64.pkglist
vi /install/custom/netboot/centos/cudafull.centos7-6.x86_64.otherpkgs.pkglist
=> ADD : cuda-10.1/x86_64/cuda-deps/dkmschdef -t osimage -o centos7.6-x86_64-netboot-cudafull otherpkglist=/install/custom/netboot/centos/cudafull.centos7-6.x86_64.otherpkgs.pkglist
genimage centos7.6-x86_64-netboot-cudafull
packimage centos7.6-x86_64-netboot-cudafull
nodeset testnvidia osimage=centos7.6-x86_64-netboot-cudafull
Point 1 is not that difficult
Can you explain how to add these packages into new osimage for diskless deployment ? ( like driverupdatesrc=rpm: )
Regards,