Closed tcchau closed 2 years ago
miopenmilr is one of dependency in requirement.txt , https://github.com/ROCmSoftwarePlatform/MIOpen/blob/release/rocm-rel-5.1/requirements.txt#L4 .
it should be installed when we run install_deps.cmake.
And not every components of ROCm had its install_deps scripts, now we have to install dependencies one by one.
@xuhuisheng I figured it out. For some reason the call to
sudo cmake -P $ROCM_GIT_DIR/MIOpen/install_deps.cmake --minimum --prefix /usr/local
did not work. I manually ran and used a different, temporary directory for the prefix, and it did build the libraries. I then manually copied it over to /usr/local/lib and then script # 35 could run because it was able to find the library files in that case.
Since I'm trying to document steps for a Dockerfile, I retried this. Apparently, the dependencies installer here
sudo cmake -P $ROCM_GIT_DIR/MIOpen/install_deps.cmake --minimum --prefix /usr/local
works to install the "minimum" as suggested by the command line argument to make, but the build scripts are still expecting the MLIR library. After finding this issue: https://github.com/ROCmSoftwarePlatform/MIOpen/issues/908 I modified script # 35 to define MIOPEN_USE_MLIR=0 in order to remove the dependency on MLIR.
So far, this seems to be okay.
Thank you for noticing this, I can append MIOPEN_USE_MILR=0 parameter to skip milr.
No problem, @xuhuisheng ! However, I'm really stuck this time: while running script # 43, after a certain point, the build process actually crashes and takes down my docker container....
CXX p-exp.o
CXX version.o
CXX xml-builtin.o
CXX init.o
It gets up to here, and then the docker container process exits. Any clues as to what is going on at all, or how to troubleshoot? Unfortunately, I'm a noob when it comes to Ubuntu.
Actually , I can only build rocGDB, not package it to a deb file. So I suggest to skip rocGDB, continue build other components. We can use ROCm without rocGDB.
Cool @xuhuisheng good to know. I'm going to give that a try.
Just to update you @xuhuisheng , I was able to find that the exit code for the container was 137, which means either it was sent a SIGKILL or it ran out of memory. By doing
sudo docker stats <container_id>
I was able to see that during the build of ROCgdb, it was using more and more memory to the point that it exhausted my 32GB of memory and then crashed.
Per your suggestion, I'm going to skip this and go to the next build script; this looks like some kind of error potentially in the ROCm scripts themselves.
Sorry to keep bugging you @xuhuisheng I'm having a problem with Script # 61. I'm getting this strange problem and I'm not sure how critical it is:
CMake Error at /usr/local/share/cmake/cmakeget/CMakeGet.cmake:169 (message):
cget_install_dir(): /tmp/cget-11-48-50-TPugQ-1/download/protobuf-3.2.0 is
not a cmake package
Any ideas?
AMDMIGraphX is uneccessory component, too. It is the AMD onnx runtime, it need cmake to install dependencies, too, just like miopen, so if your network environment is not so good, the cmake script may throw errors. I just suggest you to skip this component, and go on try other components.
But the protobuf version of requirement.txt i 3.11.0, No idea why you want to download protobuf-3.2.0. https://github.com/ROCmSoftwarePlatform/AMDMIGraphX/blob/release/rocm-rel-5.1/requirements.txt#L1
@xuhuisheng Final follow up on this issue: I tried replicating the build in a fresh docker image and discovered that it's not sufficient to use the MIOPEN_USE_MLIR=0 define... you also have to build the dependencies without the --minimum
flag. Otherwise there are some header files that are missing that the MIOpen build needs and it'll fail.
Anyway, I think it's okay to close this issue now. I was able to eventually build a local version of pytorch v1.11.0 with ROCm enabled and I'm using it in my project at the moment. It results in about a 3x speed for training, in my particular project.
Thanks for your help!
Environment
What is the expected behavior
No error
What actually happens
Error while running script # 35
How to reproduce
Run script #35
Is this an expected hiccup in the build process? Should we be going through each of the ROCm projects and running the install_deps script?