xuhuisheng / rocm-build

build scripts for ROCm
Apache License 2.0
181 stars 35 forks source link

libMLIRMIOpen not found problem while running script #35 #28

Closed tcchau closed 2 years ago

tcchau commented 2 years ago

Environment

Hardware description
GPU RX580
CPU i7
Software version
OS Ubuntu 20.04
ROCm 5.1.3
Python 3.8

What is the expected behavior

No error

What actually happens

Error while running script # 35

How to reproduce

Run script #35

Is this an expected hiccup in the build process? Should we be going through each of the ROCm projects and running the install_deps script?

xuhuisheng commented 2 years ago

miopenmilr is one of dependency in requirement.txt , https://github.com/ROCmSoftwarePlatform/MIOpen/blob/release/rocm-rel-5.1/requirements.txt#L4 .

it should be installed when we run install_deps.cmake.

And not every components of ROCm had its install_deps scripts, now we have to install dependencies one by one.

tcchau commented 2 years ago

@xuhuisheng I figured it out. For some reason the call to

sudo cmake -P $ROCM_GIT_DIR/MIOpen/install_deps.cmake --minimum --prefix /usr/local

did not work. I manually ran and used a different, temporary directory for the prefix, and it did build the libraries. I then manually copied it over to /usr/local/lib and then script # 35 could run because it was able to find the library files in that case.

tcchau commented 2 years ago

Since I'm trying to document steps for a Dockerfile, I retried this. Apparently, the dependencies installer here

sudo cmake -P $ROCM_GIT_DIR/MIOpen/install_deps.cmake --minimum --prefix /usr/local

works to install the "minimum" as suggested by the command line argument to make, but the build scripts are still expecting the MLIR library. After finding this issue: https://github.com/ROCmSoftwarePlatform/MIOpen/issues/908 I modified script # 35 to define MIOPEN_USE_MLIR=0 in order to remove the dependency on MLIR.

So far, this seems to be okay.

xuhuisheng commented 2 years ago

Thank you for noticing this, I can append MIOPEN_USE_MILR=0 parameter to skip milr.

tcchau commented 2 years ago

No problem, @xuhuisheng ! However, I'm really stuck this time: while running script # 43, after a certain point, the build process actually crashes and takes down my docker container....

 CXX    p-exp.o
  CXX    version.o
  CXX    xml-builtin.o
  CXX    init.o

It gets up to here, and then the docker container process exits. Any clues as to what is going on at all, or how to troubleshoot? Unfortunately, I'm a noob when it comes to Ubuntu.

xuhuisheng commented 2 years ago

Actually , I can only build rocGDB, not package it to a deb file. So I suggest to skip rocGDB, continue build other components. We can use ROCm without rocGDB.

tcchau commented 2 years ago

Cool @xuhuisheng good to know. I'm going to give that a try.

tcchau commented 2 years ago

Just to update you @xuhuisheng , I was able to find that the exit code for the container was 137, which means either it was sent a SIGKILL or it ran out of memory. By doing

sudo docker stats <container_id>

I was able to see that during the build of ROCgdb, it was using more and more memory to the point that it exhausted my 32GB of memory and then crashed.

Per your suggestion, I'm going to skip this and go to the next build script; this looks like some kind of error potentially in the ROCm scripts themselves.

tcchau commented 2 years ago

Sorry to keep bugging you @xuhuisheng I'm having a problem with Script # 61. I'm getting this strange problem and I'm not sure how critical it is:

CMake Error at /usr/local/share/cmake/cmakeget/CMakeGet.cmake:169 (message):
  cget_install_dir(): /tmp/cget-11-48-50-TPugQ-1/download/protobuf-3.2.0 is
  not a cmake package

Any ideas?

xuhuisheng commented 2 years ago

AMDMIGraphX is uneccessory component, too. It is the AMD onnx runtime, it need cmake to install dependencies, too, just like miopen, so if your network environment is not so good, the cmake script may throw errors. I just suggest you to skip this component, and go on try other components.

But the protobuf version of requirement.txt i 3.11.0, No idea why you want to download protobuf-3.2.0. https://github.com/ROCmSoftwarePlatform/AMDMIGraphX/blob/release/rocm-rel-5.1/requirements.txt#L1

tcchau commented 2 years ago

@xuhuisheng Final follow up on this issue: I tried replicating the build in a fresh docker image and discovered that it's not sufficient to use the MIOPEN_USE_MLIR=0 define... you also have to build the dependencies without the --minimum flag. Otherwise there are some header files that are missing that the MIOpen build needs and it'll fail.

Anyway, I think it's okay to close this issue now. I was able to eventually build a local version of pytorch v1.11.0 with ROCm enabled and I'm using it in my project at the moment. It results in about a 3x speed for training, in my particular project.

Thanks for your help!