Problem in training process

CarrieX6 commented 9 months ago

Hi, I'm tring to reproduct your work code after recompiling the NMS and 3D Crop files with CUDA =11.3, tf = 2.5.0. Python = 3.8. However I still getting the error messages. Could you help with this problem, thanks a lot!

xychen2022 commented 9 months ago

This is because you haven't compiled the CUDA C codes successfully. Please follow the instructions given in ReadMe.txt to compile the codes. Since you are using CUDA 11.3, you need to change cuda-11.2 to cuda-11.3 in the commands. BTW, you need to make sure the paths are valid on your machine. Let me know if you have problems when doing this.

xychen2022 commented 9 months ago

If you have compiled the codes without any problem, make sure you have replaced the old non_max_suppression_op.so and crop_and_resize_op_gpu.so with the new ones you get. The old ones were compiled by me and they are not suitable for your environment.

CarrieX6 commented 9 months ago

Thank you for your responding! I have compiled the codes without any problem and replaced non_max_suppression_op.so and crop_and_resize_op_gpu.so. But I still got the same problem. I also tried the non_max_suppression_op_gpu.so, but it didn't work.

xychen2022 commented 9 months ago

From the 3rd screenshot above, the problem is several dynamic libraries, e.g., libcublas.so.11, are not loaded on your side. Please solve this problem first. Sorry that I am not sure what else can be the cause except code recompilation.

CarrieX6 commented 9 months ago

I'm also trying to fix the issue with the dynamic libraries. I've confirmed that these libraries exist but still get errors, so I'm not sure if this is related to recompilation. Anyway, I will try to work on it. Thank you again for your prompt reply. I'm appreciate your help!

xychen2022 commented 9 months ago

Let me know if you find any other causes to the problem. Good luck! BTW, you can also try to install CUDA 11.2 instead and see if the code runs normally.

CarrieX6 commented 9 months ago

Sure, I'll try it if I cannot handle this problem. Thank you!

xychen2022 commented 9 months ago

The cause might be related to missing files/invalid paths. I googled a little bit and found the following highly voted solution. Hope it helps. You may need to change the command slightly to adapt to your machine.

First, find out where the "libcudart.so.11.0" is If you lost other at error stack, you can replace the "libcudart.so.11.0" by your word in below:

sudo find / -name 'libcudart.so.11.0'

Output in my system. This result shows where the "libcudart.so.11.0" is in my system:

/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudart.so.11.0

If the result shows nothing, please make sure you have install cuda or other staff that must install in your system.

Second, add the path to environment file.

edit /etc/profile using "sudo vim /etc/profile" append path to "LD_LIBRARY_PATH" in profile file using "export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-11.1/targets/x86_64-linux/lib"

make environment file work using "source /etc/profile"

You can also check out: https://github.com/tensorflow/tensorflow/issues/45930 and https://stackoverflow.com/questions/70967651/could-not-load-dynamic-library-libcudart-so-11-0

CarrieX6 commented 9 months ago

Thanks a lot! I've solved this problem by updating the LD_LIBRARY_PATH. I found that recompiling adds redundant paths to LD_LIBRARY_PATH when I have two different CUDA versions(11.3 and 11.8).

The following steps work for me:

Check the dynamic libraries exist. Ensure the cuDnn and associated libraries are installed.
In the CUDA=11.3, recompile and train, got errors with dynamic libraries. Check tf.test.is_gpu_available(), output False.
Switch to CUDA=11.8, Check tf.test.is_gpu_available(), output True.
In the CUDA=11.8, recompile and train, got errors with dynamic libraries, too. Check tf.test.is_gpu_available(), output False.
Switch back to CUDA=11.3, check the 'export' command, find redundant paths in LD_LIBRARY_PATH.
Export the LD_LIBRARY_PATH again.
Switch back to CUDA=11.8, Check tf.test.is_gpu_available(), output True. Run directly, it worked.

So I guess the export will lead to an incorrect link in dynamic libraries when there are several CUDA versions in the environment. In total, recompiling C files without error is required (Ensure the recompiling is correct) and the TensorFlow should be connected to GPU (Ensure the libraries are installed). Then it should be worked.

xychen2022 commented 9 months ago

Glad to know that your problem was solved. It seems unrelated to any bugs in the code but the environment variables. I will close this issue for now.

xychen2022 / 3DFasterRCNN

Problem in training process #5