ybkscht / EfficientPose

Other
236 stars 67 forks source link

NAN loss using your conda env and data and command after 500 epochs #69

Open monajalal opened 8 months ago

monajalal commented 8 months ago

I am getting NAN loss, using your own processed data, conda env, and command. Is there a fix to it?

(EfficientPose) mona@mona-ThinkStation-P7:~/EfficientPose$ python train.py --phi 0 --weights weights/Weights/Linemod/object_8/phi_0_linemod_best_ADD.h5   linemod data/Linemod_preprocessed/ --object-id 8

etc etc etc

Epoch 00499: ADD did not improve from 0.00000
1790/1790 [==============================] - 240s 134ms/step - loss: nan - classification_loss: 17758.1582 - regression_loss: nan - transformation_loss: 0.0000e+00
Epoch 500/500
Running network: 100% (1009 of 1009) |##############################################################################################################################| Elapsed Time: 0:00:32 Time:  0:00:32
Parsing annotations: 100% (1009 of 1009) |##########################################################################################################################| Elapsed Time: 0:00:00 Time:  0:00:00
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1009/1009 [00:06<00:00, 152.32it/s]
1009 instances of class object with average precision: 0.0000
1009 instances of class object with ADD accuracy: 0.0000
1009 instances of class object with ADD-S-Accuracy: 0.0000
1009 instances of class object with 5cm-5degree-Accuracy: 0.0000
class object with Translation Differences in mm: Mean: nan and Std: nan
class object with Rotation Differences in degree: Mean: nan and Std: nan
1009 instances of class object with 2d-projection-Accuracy: 0.0000
1009 instances of class object with ADD(-S)-Accuracy: 0.0000
class object with Transformed Point Distances in mm: Mean: nan and Std: nan
class object with Transformed Symmetric Point Distances in mm: Mean: nan and Std: nan
class object with Mixed Transformed Point Distances in mm: Mean: nan and Std: nan
mAP: 0.0000
ADD: 0.0000
ADD-S: 0.0000
5cm_5degree: 0.0000
TranslationErrorMean_in_mm: nan
TranslationErrorStd_in_mm: nan
RotationErrorMean_in_degree: nan
RotationErrorStd_in_degree: nan
2D-Projection: 0.0000
Summed_Translation_Rotation_Error: nan
ADD(-S): 0.0000
AveragePointDistanceMean_in_mm: nan
AveragePointDistanceStd_in_mm: nan
AverageSymmetricPointDistanceMean_in_mm: nan
AverageSymmetricPointDistanceStd_in_mm: nan
MixedAveragePointDistanceMean_in_mm: nan
MixedAveragePointDistanceStd_in_mm: nan

Epoch 00500: ADD did not improve from 0.00000

Epoch 00500: ReduceLROnPlateau reducing learning rate to 1e-07.
1790/1790 [==============================] - 221s 123ms/step - loss: nan - classification_loss: 17888.1914 - regression_loss: nan - transformation_loss: 0.0000e+00
Jingranxia commented 8 months ago

Use this command to install TensorFlow in the Python 3.8 environment pip install nvidia-tensorflow==1.15.4

Jingranxia commented 8 months ago

There is an issue with your algorithm environment

monajalal commented 7 months ago

@Jingranxia

your command didn't work. How did you create the environment?

(base) mona@ada:~/EfficientPose$ conda create --name effpose python=3.8
Collecting package metadata (current_repodata.json): done
Solving environment: done

==> WARNING: A newer version of conda exists. <==
  current version: 23.7.4
  latest version: 23.10.0

Please update conda by running

    $ conda update -n base -c defaults conda

Or to minimize the number of packages updated during conda update use

     conda install conda=23.10.0

## Package Plan ##

  environment location: /home/mona/anaconda3/envs/effpose

  added / updated specs:
    - python=3.8

The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    pip-23.3.1                 |   py38h06a4308_0         2.6 MB
    python-3.8.18              |       h955ad1f_0        25.3 MB
    setuptools-68.0.0          |   py38h06a4308_0         927 KB
    wheel-0.41.2               |   py38h06a4308_0         108 KB
    ------------------------------------------------------------
                                           Total:        28.9 MB

The following NEW packages will be INSTALLED:

  _libgcc_mutex      pkgs/main/linux-64::_libgcc_mutex-0.1-main 
  _openmp_mutex      pkgs/main/linux-64::_openmp_mutex-5.1-1_gnu 
  ca-certificates    pkgs/main/linux-64::ca-certificates-2023.08.22-h06a4308_0 
  ld_impl_linux-64   pkgs/main/linux-64::ld_impl_linux-64-2.38-h1181459_1 
  libffi             pkgs/main/linux-64::libffi-3.4.4-h6a678d5_0 
  libgcc-ng          pkgs/main/linux-64::libgcc-ng-11.2.0-h1234567_1 
  libgomp            pkgs/main/linux-64::libgomp-11.2.0-h1234567_1 
  libstdcxx-ng       pkgs/main/linux-64::libstdcxx-ng-11.2.0-h1234567_1 
  ncurses            pkgs/main/linux-64::ncurses-6.4-h6a678d5_0 
  openssl            pkgs/main/linux-64::openssl-3.0.12-h7f8727e_0 
  pip                pkgs/main/linux-64::pip-23.3.1-py38h06a4308_0 
  python             pkgs/main/linux-64::python-3.8.18-h955ad1f_0 
  readline           pkgs/main/linux-64::readline-8.2-h5eee18b_0 
  setuptools         pkgs/main/linux-64::setuptools-68.0.0-py38h06a4308_0 
  sqlite             pkgs/main/linux-64::sqlite-3.41.2-h5eee18b_0 
  tk                 pkgs/main/linux-64::tk-8.6.12-h1ccaba5_0 
  wheel              pkgs/main/linux-64::wheel-0.41.2-py38h06a4308_0 
  xz                 pkgs/main/linux-64::xz-5.4.2-h5eee18b_0 
  zlib               pkgs/main/linux-64::zlib-1.2.13-h5eee18b_0 

Proceed ([y]/n)? y

Downloading and Extracting Packages

Preparing transaction: done                                                                                                                                                                                 
Verifying transaction: done                                                                                                                                                                                 
Executing transaction: done                                                                                                                                                                                 
#
# To activate this environment, use
#
#     $ conda activate effpose
#
# To deactivate an active environment, use
#
#     $ conda deactivate

(base) mona@ada:~/EfficientPose$ conda activate effpose
(effpose) mona@ada:~/EfficientPose$ pip install nvidia-tensorflow==1.15.4
ERROR: Could not find a version that satisfies the requirement nvidia-tensorflow==1.15.4 (from versions: 0.0.1.dev4, 0.0.1.dev5)
ERROR: No matching distribution found for nvidia-tensorflow==1.15.4

this is what bard says:

The error message indicates that the package nvidia-tensorflow==1.15.4 is not available for your current version of Python (3.8). To fix this, you can either install a different version of TensorFlow that is compatible with Python 3.8, or you can downgrade your version of Python to 3.6, which is the version that nvidia-tensorflow==1.15.4 was built for.
monajalal commented 7 months ago

even with Python 3.6 I couldn't install that version of tensorflow you mentioned @Jingranxia

(base) mona@ada:~/EfficientPose$ conda create --name effpose python=3.6
Collecting package metadata (current_repodata.json): done
Solving environment: unsuccessful attempt using repodata from current_repodata.json, retrying with next repodata source.
Collecting package metadata (repodata.json): done
Solving environment: done

==> WARNING: A newer version of conda exists. <==
  current version: 23.7.4
  latest version: 23.10.0

Please update conda by running

    $ conda update -n base -c defaults conda

Or to minimize the number of packages updated during conda update use

     conda install conda=23.10.0

## Package Plan ##

  environment location: /home/mona/anaconda3/envs/effpose

  added / updated specs:
    - python=3.6

The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    python-3.6.13              |       h12debd9_1        32.5 MB
    ------------------------------------------------------------
                                           Total:        32.5 MB

The following NEW packages will be INSTALLED:

  _libgcc_mutex      pkgs/main/linux-64::_libgcc_mutex-0.1-main 
  _openmp_mutex      pkgs/main/linux-64::_openmp_mutex-5.1-1_gnu 
  ca-certificates    pkgs/main/linux-64::ca-certificates-2023.08.22-h06a4308_0 
  certifi            pkgs/main/linux-64::certifi-2021.5.30-py36h06a4308_0 
  ld_impl_linux-64   pkgs/main/linux-64::ld_impl_linux-64-2.38-h1181459_1 
  libffi             pkgs/main/linux-64::libffi-3.3-he6710b0_2 
  libgcc-ng          pkgs/main/linux-64::libgcc-ng-11.2.0-h1234567_1 
  libgomp            pkgs/main/linux-64::libgomp-11.2.0-h1234567_1 
  libstdcxx-ng       pkgs/main/linux-64::libstdcxx-ng-11.2.0-h1234567_1 
  ncurses            pkgs/main/linux-64::ncurses-6.4-h6a678d5_0 
  openssl            pkgs/main/linux-64::openssl-1.1.1w-h7f8727e_0 
  pip                pkgs/main/linux-64::pip-21.2.2-py36h06a4308_0 
  python             pkgs/main/linux-64::python-3.6.13-h12debd9_1 
  readline           pkgs/main/linux-64::readline-8.2-h5eee18b_0 
  setuptools         pkgs/main/linux-64::setuptools-58.0.4-py36h06a4308_0 
  sqlite             pkgs/main/linux-64::sqlite-3.41.2-h5eee18b_0 
  tk                 pkgs/main/linux-64::tk-8.6.12-h1ccaba5_0 
  wheel              pkgs/main/noarch::wheel-0.37.1-pyhd3eb1b0_0 
  xz                 pkgs/main/linux-64::xz-5.4.2-h5eee18b_0 
  zlib               pkgs/main/linux-64::zlib-1.2.13-h5eee18b_0 

Proceed ([y]/n)? y

Downloading and Extracting Packages

Preparing transaction: done
Verifying transaction: done
Executing transaction: done
#
# To activate this environment, use
#
#     $ conda activate effpose
#
# To deactivate an active environment, use
#
#     $ conda deactivate

(base) mona@ada:~/EfficientPose$ conda activate effpose
(effpose) mona@ada:~/EfficientPose$ pip install nvidia-tensorflow==1.15.4
ERROR: Could not find a version that satisfies the requirement nvidia-tensorflow==1.15.4 (from versions: 0.0.1.dev4, 0.0.1.dev5)
ERROR: No matching distribution found for nvidia-tensorflow==1.15.4
(effpose) mona@ada:~/EfficientPose$ python
Python 3.6.13 |Anaconda, Inc.| (default, Jun  4 2021, 14:25:59) 
[GCC 7.5.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
Jingranxia commented 7 months ago

even with Python 3.6 I couldn't install that version of tensorflow you mentioned 即使使用 Python 3.6,我也无法安装您提到的 tensorflow 版本@Jingranxia

(base) mona@ada:~/EfficientPose$ conda create --name effpose python=3.6
Collecting package metadata (current_repodata.json): done
Solving environment: unsuccessful attempt using repodata from current_repodata.json, retrying with next repodata source.
Collecting package metadata (repodata.json): done
Solving environment: done

==> WARNING: A newer version of conda exists. <==
  current version: 23.7.4
  latest version: 23.10.0

Please update conda by running

    $ conda update -n base -c defaults conda

Or to minimize the number of packages updated during conda update use

     conda install conda=23.10.0

## Package Plan ##

  environment location: /home/mona/anaconda3/envs/effpose

  added / updated specs:
    - python=3.6

The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    python-3.6.13              |       h12debd9_1        32.5 MB
    ------------------------------------------------------------
                                           Total:        32.5 MB

The following NEW packages will be INSTALLED:

  _libgcc_mutex      pkgs/main/linux-64::_libgcc_mutex-0.1-main 
  _openmp_mutex      pkgs/main/linux-64::_openmp_mutex-5.1-1_gnu 
  ca-certificates    pkgs/main/linux-64::ca-certificates-2023.08.22-h06a4308_0 
  certifi            pkgs/main/linux-64::certifi-2021.5.30-py36h06a4308_0 
  ld_impl_linux-64   pkgs/main/linux-64::ld_impl_linux-64-2.38-h1181459_1 
  libffi             pkgs/main/linux-64::libffi-3.3-he6710b0_2 
  libgcc-ng          pkgs/main/linux-64::libgcc-ng-11.2.0-h1234567_1 
  libgomp            pkgs/main/linux-64::libgomp-11.2.0-h1234567_1 
  libstdcxx-ng       pkgs/main/linux-64::libstdcxx-ng-11.2.0-h1234567_1 
  ncurses            pkgs/main/linux-64::ncurses-6.4-h6a678d5_0 
  openssl            pkgs/main/linux-64::openssl-1.1.1w-h7f8727e_0 
  pip                pkgs/main/linux-64::pip-21.2.2-py36h06a4308_0 
  python             pkgs/main/linux-64::python-3.6.13-h12debd9_1 
  readline           pkgs/main/linux-64::readline-8.2-h5eee18b_0 
  setuptools         pkgs/main/linux-64::setuptools-58.0.4-py36h06a4308_0 
  sqlite             pkgs/main/linux-64::sqlite-3.41.2-h5eee18b_0 
  tk                 pkgs/main/linux-64::tk-8.6.12-h1ccaba5_0 
  wheel              pkgs/main/noarch::wheel-0.37.1-pyhd3eb1b0_0 
  xz                 pkgs/main/linux-64::xz-5.4.2-h5eee18b_0 
  zlib               pkgs/main/linux-64::zlib-1.2.13-h5eee18b_0 

Proceed ([y]/n)? y

Downloading and Extracting Packages

Preparing transaction: done
Verifying transaction: done
Executing transaction: done
#
# To activate this environment, use
#
#     $ conda activate effpose
#
# To deactivate an active environment, use
#
#     $ conda deactivate

(base) mona@ada:~/EfficientPose$ conda activate effpose
(effpose) mona@ada:~/EfficientPose$ pip install nvidia-tensorflow==1.15.4
ERROR: Could not find a version that satisfies the requirement nvidia-tensorflow==1.15.4 (from versions: 0.0.1.dev4, 0.0.1.dev5)
ERROR: No matching distribution found for nvidia-tensorflow==1.15.4
(effpose) mona@ada:~/EfficientPose$ python
Python 3.6.13 |Anaconda, Inc.| (default, Jun  4 2021, 14:25:59) 
[GCC 7.5.0] on linux
Type "help", "copyright", "credits" or "license" for more information.

hello, pip install nvidia-pyindex before

Jingranxia commented 7 months ago

even with Python 3.6 I couldn't install that version of tensorflow you mentioned 即使使用 Python 3.6,我也无法安装您提到的 tensorflow 版本@Jingranxia

(base) mona@ada:~/EfficientPose$ conda create --name effpose python=3.6
Collecting package metadata (current_repodata.json): done
Solving environment: unsuccessful attempt using repodata from current_repodata.json, retrying with next repodata source.
Collecting package metadata (repodata.json): done
Solving environment: done

==> WARNING: A newer version of conda exists. <==
  current version: 23.7.4
  latest version: 23.10.0

Please update conda by running

    $ conda update -n base -c defaults conda

Or to minimize the number of packages updated during conda update use

     conda install conda=23.10.0

## Package Plan ##

  environment location: /home/mona/anaconda3/envs/effpose

  added / updated specs:
    - python=3.6

The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    python-3.6.13              |       h12debd9_1        32.5 MB
    ------------------------------------------------------------
                                           Total:        32.5 MB

The following NEW packages will be INSTALLED:

  _libgcc_mutex      pkgs/main/linux-64::_libgcc_mutex-0.1-main 
  _openmp_mutex      pkgs/main/linux-64::_openmp_mutex-5.1-1_gnu 
  ca-certificates    pkgs/main/linux-64::ca-certificates-2023.08.22-h06a4308_0 
  certifi            pkgs/main/linux-64::certifi-2021.5.30-py36h06a4308_0 
  ld_impl_linux-64   pkgs/main/linux-64::ld_impl_linux-64-2.38-h1181459_1 
  libffi             pkgs/main/linux-64::libffi-3.3-he6710b0_2 
  libgcc-ng          pkgs/main/linux-64::libgcc-ng-11.2.0-h1234567_1 
  libgomp            pkgs/main/linux-64::libgomp-11.2.0-h1234567_1 
  libstdcxx-ng       pkgs/main/linux-64::libstdcxx-ng-11.2.0-h1234567_1 
  ncurses            pkgs/main/linux-64::ncurses-6.4-h6a678d5_0 
  openssl            pkgs/main/linux-64::openssl-1.1.1w-h7f8727e_0 
  pip                pkgs/main/linux-64::pip-21.2.2-py36h06a4308_0 
  python             pkgs/main/linux-64::python-3.6.13-h12debd9_1 
  readline           pkgs/main/linux-64::readline-8.2-h5eee18b_0 
  setuptools         pkgs/main/linux-64::setuptools-58.0.4-py36h06a4308_0 
  sqlite             pkgs/main/linux-64::sqlite-3.41.2-h5eee18b_0 
  tk                 pkgs/main/linux-64::tk-8.6.12-h1ccaba5_0 
  wheel              pkgs/main/noarch::wheel-0.37.1-pyhd3eb1b0_0 
  xz                 pkgs/main/linux-64::xz-5.4.2-h5eee18b_0 
  zlib               pkgs/main/linux-64::zlib-1.2.13-h5eee18b_0 

Proceed ([y]/n)? y

Downloading and Extracting Packages

Preparing transaction: done
Verifying transaction: done
Executing transaction: done
#
# To activate this environment, use
#
#     $ conda activate effpose
#
# To deactivate an active environment, use
#
#     $ conda deactivate

(base) mona@ada:~/EfficientPose$ conda activate effpose
(effpose) mona@ada:~/EfficientPose$ pip install nvidia-tensorflow==1.15.4
ERROR: Could not find a version that satisfies the requirement nvidia-tensorflow==1.15.4 (from versions: 0.0.1.dev4, 0.0.1.dev5)
ERROR: No matching distribution found for nvidia-tensorflow==1.15.4
(effpose) mona@ada:~/EfficientPose$ python
Python 3.6.13 |Anaconda, Inc.| (default, Jun  4 2021, 14:25:59) 
[GCC 7.5.0] on linux
Type "help", "copyright", "credits" or "license" for more information.

and you should use python3.8 ,not use 3.6 ,because the python 3.6 add-on package is not automatically installed

monajalal commented 7 months ago

@Jingranxia thank you that helped me but I get this error. How did you fix this?

(EfficientPose) mona@ada:~/EfficientPose$ python train.py --phi 0 --weights weights/Weights/Linemod/object_8/phi_0_linemod_best_ADD.h5   linemod data/Linemod_preprocessed/ --object-id 8
2023-11-29 13:13:42.445715: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
WARNING:tensorflow:From train.py:204: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.

WARNING:tensorflow:From train.py:206: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.

2023-11-29 13:13:43.505241: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3096000000 Hz
2023-11-29 13:13:43.510439: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x434bf50 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2023-11-29 13:13:43.510479: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2023-11-29 13:13:43.512905: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2023-11-29 13:13:43.584707: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x4321be0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2023-11-29 13:13:43.584796: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): NVIDIA RTX 6000 Ada Generation, Compute Capability 8.9
2023-11-29 13:13:43.585722: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1665] Found device 0 with properties: 
name: NVIDIA RTX 6000 Ada Generation major: 8 minor: 9 memoryClockRate(GHz): 2.505
pciBusID: 0000:52:00.0
2023-11-29 13:13:43.585768: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2023-11-29 13:13:43.613692: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2023-11-29 13:13:43.617581: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2023-11-29 13:13:43.617997: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2023-11-29 13:13:43.618652: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.11
2023-11-29 13:13:43.619621: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11
2023-11-29 13:13:43.619835: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2023-11-29 13:13:43.620187: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1793] Adding visible gpu devices: 0
2023-11-29 13:13:43.620216: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2023-11-29 13:13:43.625597: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1206] Device interconnect StreamExecutor with strength 1 edge matrix:
2023-11-29 13:13:43.625632: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1212]      0 
2023-11-29 13:13:43.625649: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1225] 0:   N 
2023-11-29 13:13:43.626098: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1351] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 39203 MB memory) -> physical GPU (device: 0, name: NVIDIA RTX 6000 Ada Generation, pci bus id: 0000:52:00.0, compute capability: 8.9)
{'dataset_type': 'linemod', 'rotation_representation': 'axis_angle', 'weights': 'weights/Weights/Linemod/object_8/phi_0_linemod_best_ADD.h5', 'freeze_backbone': False, 'no_freeze_bn': False, 'batch_size': 1, 'lr': 0.0001, 'no_color_augmentation': False, 'no_6dof_augmentation': False, 'phi': 0, 'gpu': None, 'epochs': 500, 'steps': 1790, 'snapshot_path': 'checkpoints/29_11_2023_13_13_43', 'tensorboard_dir': 'logs/29_11_2023_13_13_43', 'snapshots': True, 'evaluation': True, 'compute_val_loss': False, 'score_threshold': 0.5, 'validation_image_save_path': None, 'multiprocessing': False, 'workers': 4, 'max_queue_size': 10, 'linemod_path': 'data/Linemod_preprocessed/', 'object_id': 8}

Creating the Generators...
Done!

Building the Model...
Traceback (most recent call last):
  File "train.py", line 368, in <module>
    main()
  File "train.py", line 132, in main
    model, prediction_model, all_layers = build_EfficientPose(args.phi,
  File "/home/mona/EfficientPose/model.py", line 99, in build_EfficientPose
    image_input = layers.Input(input_shape)
  File "/home/mona/anaconda3/envs/EfficientPose/lib/python3.8/site-packages/tensorflow_core/python/keras/engine/input_layer.py", line 265, in Input
    input_layer = InputLayer(**input_layer_config)
  File "/home/mona/anaconda3/envs/EfficientPose/lib/python3.8/site-packages/tensorflow_core/python/keras/engine/input_layer.py", line 121, in __init__
    input_tensor = backend.placeholder(
  File "/home/mona/anaconda3/envs/EfficientPose/lib/python3.8/site-packages/tensorflow_core/python/keras/backend.py", line 1051, in placeholder
    x = array_ops.placeholder(dtype, shape=shape, name=name)
  File "/home/mona/anaconda3/envs/EfficientPose/lib/python3.8/site-packages/tensorflow_core/python/ops/array_ops.py", line 2619, in placeholder
    return gen_array_ops.placeholder(dtype=dtype, shape=shape, name=name)
  File "/home/mona/anaconda3/envs/EfficientPose/lib/python3.8/site-packages/tensorflow_core/python/ops/gen_array_ops.py", line 6668, in placeholder
    _, _, _op = _op_def_lib._apply_op_helper(
  File "/home/mona/anaconda3/envs/EfficientPose/lib/python3.8/site-packages/tensorflow_core/python/framework/op_def_library.py", line 792, in _apply_op_helper
    op = g.create_op(op_type_name, inputs, dtypes=None, name=scope,
  File "/home/mona/anaconda3/envs/EfficientPose/lib/python3.8/site-packages/tensorflow_core/python/util/deprecation.py", line 513, in new_func
    return func(*args, **kwargs)
  File "/home/mona/anaconda3/envs/EfficientPose/lib/python3.8/site-packages/tensorflow_core/python/framework/ops.py", line 3356, in create_op
    return self._create_op_internal(op_type, inputs, dtypes, input_types, name,
  File "/home/mona/anaconda3/envs/EfficientPose/lib/python3.8/site-packages/tensorflow_core/python/framework/ops.py", line 3411, in _create_op_internal
    node_def = _NodeDef(op_type, name, device=None, attrs=attrs)
  File "/home/mona/anaconda3/envs/EfficientPose/lib/python3.8/site-packages/tensorflow_core/python/framework/ops.py", line 1552, in _NodeDef
    node_def.attr[k].CopyFrom(v)
  File "/home/mona/anaconda3/envs/EfficientPose/lib/python3.8/site-packages/google/protobuf/internal/containers.py", line 70, in __getitem__
    return self._values[key]
TypeError: list indices must be integers or slices, not str

Here is my environment.yml file:

(EfficientPose) mona@ada:~/EfficientPose$ cat environment.yml 
name: EfficientPose
channels:
  - defaults
dependencies:
  - _libgcc_mutex=0.1=main
  - _openmp_mutex=5.1=1_gnu
  - ca-certificates=2023.08.22=h06a4308_0
  - ld_impl_linux-64=2.38=h1181459_1
  - libffi=3.4.4=h6a678d5_0
  - libgcc-ng=11.2.0=h1234567_1
  - libgomp=11.2.0=h1234567_1
  - libstdcxx-ng=11.2.0=h1234567_1
  - ncurses=6.4=h6a678d5_0
  - openssl=3.0.12=h7f8727e_0
  - pip=23.3.1=py38h06a4308_0
  - python=3.8.18=h955ad1f_0
  - pyyaml=6.0.1=py38h5eee18b_0
  - readline=8.2=h5eee18b_0
  - setuptools=68.0.0=py38h06a4308_0
  - sqlite=3.41.2=h5eee18b_0
  - tk=8.6.12=h1ccaba5_0
  - wheel=0.41.2=py38h06a4308_0
  - xz=5.4.2=h5eee18b_0
  - yaml=0.2.5=h7b6447c_0
  - zlib=1.2.13=h5eee18b_0
  - pip:
      - absl-py==2.0.0
      - astor==0.8.1
      - contourpy==1.1.1
      - cycler==0.12.1
      - cython==3.0.6
      - fonttools==4.45.1
      - gast==0.2.2
      - google-pasta==0.2.0
      - grpcio==1.59.3
      - h5py==3.10.0
      - imageio==2.33.0
      - imgaug==0.4.0
      - importlib-metadata==6.8.0
      - importlib-resources==6.1.1
      - keras-applications==1.0.8
      - keras-preprocessing==1.1.2
      - kiwisolver==1.4.5
      - lazy-loader==0.3
      - markdown==3.5.1
      - markupsafe==2.1.3
      - matplotlib==3.7.4
      - networkx==3.1
      - numpy==1.24.4
      - nvidia-cublas==11.3.0.106
      - nvidia-cuda-cupti==11.1.105
      - nvidia-cuda-nvcc==11.1.105
      - nvidia-cuda-nvrtc==11.1.105
      - nvidia-cuda-runtime==11.1.74
      - nvidia-cudnn==8.0.5.43
      - nvidia-cufft==10.3.0.105
      - nvidia-curand==10.2.2.105
      - nvidia-cusolver==11.0.1.105
      - nvidia-cusparse==11.3.0.10
      - nvidia-dali-cuda110==0.28.0
      - nvidia-dali-nvtf-plugin==0.28.0+nv20.12
      - nvidia-nccl==2.8.3
      - nvidia-pyindex==1.0.9
      - nvidia-tensorboard==1.15.0+nv20.12
      - nvidia-tensorflow==1.15.4+nv20.12
      - nvidia-tensorrt==7.2.2.1
      - opencv-python==4.8.1.78
      - opt-einsum==3.3.0
      - packaging==23.2
      - pillow==10.1.0
      - plyfile==1.0.2
      - protobuf==4.25.1
      - pyparsing==3.1.1
      - python-dateutil==2.8.2
      - pywavelets==1.4.1
      - scikit-image==0.21.0
      - scipy==1.10.1
      - shapely==2.0.2
      - six==1.16.0
      - tensorboard==1.15.0
      - tensorflow-estimator==1.15.1
      - termcolor==2.3.0
      - tifffile==2023.7.10
      - typeguard==4.1.5
      - typing-extensions==4.8.0
      - webencodings==0.5.1
      - werkzeug==3.0.1
      - wrapt==1.16.0
      - zipp==3.17.0
prefix: /home/mona/anaconda3/envs/EfficientPose

Here's my sys info:

(EfficientPose) mona@ada:~$ uname -a
Linux ada 6.2.0-37-generic #38~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Nov  2 18:01:13 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
(EfficientPose) mona@ada:~$ lsb_release -a
LSB Version:    core-11.1.0ubuntu4-noarch:security-11.1.0ubuntu4-noarch
Distributor ID: Ubuntu
Description:    Ubuntu 22.04.3 LTS
Release:    22.04
Codename:   jammy
(EfficientPose) mona@ada:~$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0
(EfficientPose) mona@ada:~$ nvidia-smi
Wed Nov 29 13:20:45 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.12             Driver Version: 535.104.12   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA RTX 6000 Ada Gene...    On  | 00000000:52:00.0  On |                  Off |
| 32%   61C    P2              76W / 300W |   7489MiB / 49140MiB |      4%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      2317      G   /usr/lib/xorg/Xorg                          740MiB |
|    0   N/A  N/A      2519      G   /usr/bin/gnome-shell                         61MiB |
|    0   N/A  N/A      2994      G   ...AAAAAAAACAAAAAAAAAA= --shared-files       98MiB |
|    0   N/A  N/A     25264      G   ...0208189,17325718055376231948,262144       60MiB |
|    0   N/A  N/A    652962      G   ...irefox/3358/usr/lib/firefox/firefox      422MiB |
|    0   N/A  N/A    703622      G   blender                                     205MiB |
|    0   N/A  N/A    829624      G   /usr/bin/gnome-control-center                79MiB |
|    0   N/A  N/A    837524      C   python                                      844MiB |
|    0   N/A  N/A    842408      G   ...sion,SpareRendererForSitePerProcess      106MiB |
|    0   N/A  N/A    847224      C   python                                     1046MiB |
|    0   N/A  N/A    855952      C   python                                      984MiB |
|    0   N/A  N/A    856952      C   python                                      914MiB |
|    0   N/A  N/A    857675      C   python                                      730MiB |
|    0   N/A  N/A   1068492      G   meshlab                                      12MiB |
|    0   N/A  N/A   1118791      C   python                                     1046MiB |
+---------------------------------------------------------------------------------------+

Please let me know if you may need more information

madhanuman commented 5 months ago

I also had the same issue where the training only gave NAN values and I could see in the task manager that my GPU wasnt been used during the training... I figured out that CUDA 10.0 was not supported by my GPU. See this graph image I had the RTX3070 which uses the Ampere architecture. I switched now to a GTX1070 ti which uses the Pascal architecture and now it works fine.

One option you have if you cant get hands on a suitable GPU is to use the CPU. But it is significantly slower. Just type the following commands

pip install tensorflow-cpu==1.15 pip install h5py==2.10.0 --force-reinstall pip install numpy==1.19.5

monajalal commented 4 months ago

@madhanuman thanks a lot for your response. I ran with CPU and the versions you suggested above. Do the following sound correct to you? I still have some nans

(EfficientPose) mona@ada:~/effpose/EfficientPose$ python evaluate.py --phi 0 --weights weights/Weights/Linemod/object_8/phi_0_linemod_best_ADD.h5 --validation-image-save-path val_imgs linemod data/Linemod_preprocessed/ --object-id 8
WARNING:tensorflow:From evaluate.py:132: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.

WARNING:tensorflow:From evaluate.py:134: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.

2024-02-21 15:40:46.017484: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2024-02-21 15:40:46.023920: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3096000000 Hz
2024-02-21 15:40:46.025134: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x1e9cac0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2024-02-21 15:40:46.025158: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2024-02-21 15:40:46.026767: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2024-02-21 15:40:46.117532: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x20b98e0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2024-02-21 15:40:46.117555: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): NVIDIA RTX 6000 Ada Generation, Compute Capability 8.9
2024-02-21 15:40:46.117876: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: 
name: NVIDIA RTX 6000 Ada Generation major: 8 minor: 9 memoryClockRate(GHz): 2.505
pciBusID: 0000:52:00.0
2024-02-21 15:40:46.118055: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2024-02-21 15:40:46.119033: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2024-02-21 15:40:46.120008: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2024-02-21 15:40:46.120232: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2024-02-21 15:40:46.121344: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2024-02-21 15:40:46.122185: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2024-02-21 15:40:46.124973: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2024-02-21 15:40:46.125178: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2024-02-21 15:40:46.125203: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2024-02-21 15:40:46.125365: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2024-02-21 15:40:46.125371: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165]      0 
2024-02-21 15:40:46.125375: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0:   N 
2024-02-21 15:40:46.125539: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 27811 MB memory) -> physical GPU (device: 0, name: NVIDIA RTX 6000 Ada Generation, pci bus id: 0000:52:00.0, compute capability: 8.9)
{'dataset_type': 'linemod', 'rotation_representation': 'axis_angle', 'weights': 'weights/Weights/Linemod/object_8/phi_0_linemod_best_ADD.h5', 'batch_size': 1, 'phi': 0, 'gpu': None, 'score_threshold': 0.5, 'validation_image_save_path': 'val_imgs', 'linemod_path': 'data/Linemod_preprocessed/', 'object_id': 8}

Creating the Generators...
Done!

Building the Model...
input shape is:  (512, 512, 3)
ArgSpec(args=['shape', 'batch_size', 'name', 'dtype', 'sparse', 'tensor', 'ragged'], varargs=None, keywords='kwargs', defaults=(None, None, None, None, False, None, False))
WARNING:tensorflow:From /home/mona/anaconda3/envs/EfficientPose/lib/python3.7/site-packages/tensorflow_core/python/util/deprecation.py:507: calling VarianceScaling.__init__ (from tensorflow.python.ops.init_ops) with distribution=normal is deprecated and will be removed in a future version.
Instructions for updating:
`normal` is a deprecated alias for `truncated_normal`
WARNING:tensorflow:From /home/mona/anaconda3/envs/EfficientPose/lib/python3.7/site-packages/tensorflow_core/python/ops/resource_variable_ops.py:1630: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
2024-02-21 15:40:57.013591: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: 
name: NVIDIA RTX 6000 Ada Generation major: 8 minor: 9 memoryClockRate(GHz): 2.505
pciBusID: 0000:52:00.0
2024-02-21 15:40:57.013646: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2024-02-21 15:40:57.013653: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2024-02-21 15:40:57.013658: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2024-02-21 15:40:57.013663: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2024-02-21 15:40:57.013668: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2024-02-21 15:40:57.013673: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2024-02-21 15:40:57.013678: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2024-02-21 15:40:57.013816: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2024-02-21 15:40:57.014131: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: 
name: NVIDIA RTX 6000 Ada Generation major: 8 minor: 9 memoryClockRate(GHz): 2.505
pciBusID: 0000:52:00.0
2024-02-21 15:40:57.014143: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2024-02-21 15:40:57.014150: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2024-02-21 15:40:57.014157: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2024-02-21 15:40:57.014162: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2024-02-21 15:40:57.014167: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2024-02-21 15:40:57.014173: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2024-02-21 15:40:57.014177: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2024-02-21 15:40:57.014281: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2024-02-21 15:40:57.014301: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2024-02-21 15:40:57.014305: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165]      0 
2024-02-21 15:40:57.014307: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0:   N 
2024-02-21 15:40:57.014434: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 27811 MB memory) -> physical GPU (device: 0, name: NVIDIA RTX 6000 Ada Generation, pci bus id: 0000:52:00.0, compute capability: 8.9)
WARNING:tensorflow:From /home/mona/effpose/EfficientPose/layers.py:298: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
Done!
Loading model, this may take a second...

Done!
Running network:   0% (0 of 1009) |                                                                                                                                  | Elapsed Time: 0:00:00 ETA:  --:--:--2024-02-21 15:45:35.072567: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2024-02-21 15:49:12.814151: W tensorflow/stream_executor/cuda/redzone_allocator.cc:312] Not found: ./bin/ptxas not found
Relying on driver to perform ptx compilation. This message will be only logged once.
2024-02-21 15:49:12.885611: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
Running network: 100% (1009 of 1009) |###############################################################################################################################| Elapsed Time: 0:04:49 Time:  0:04:49
Parsing annotations: 100% (1009 of 1009) |###########################################################################################################################| Elapsed Time: 0:00:00 Time:  0:00:00
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1009/1009 [00:07<00:00, 137.57it/s]
/home/mona/anaconda3/envs/EfficientPose/lib/python3.7/site-packages/numpy/core/_methods.py:193: RuntimeWarning: invalid value encountered in subtract
  x = asanyarray(arr - arrmean)
1009 instances of class object with average precision: 0.0000
1009 instances of class object with ADD accuracy: 0.0000
1009 instances of class object with ADD-S-Accuracy: 0.0000
1009 instances of class object with 5cm-5degree-Accuracy: 0.0000
class object with Translation Differences in mm: Mean: 3187475330973234436726203613184.0000 and Std: 7807679963186587234277502484480.0000
class object with Rotation Differences in degree: Mean: 144.6458 and Std: 17.1566
1009 instances of class object with 2d-projection-Accuracy: 0.0000
1009 instances of class object with ADD(-S)-Accuracy: 0.0000
class object with Transformed Point Distances in mm: Mean: 3187475330973234436726203613184.0000 and Std: 7807679963186587234277502484480.0000
class object with Transformed Symmetric Point Distances in mm: Mean: inf and Std: nan
class object with Mixed Transformed Point Distances in mm: Mean: 3187475330973234436726203613184.0000 and Std: 7807679963186587234277502484480.0000
mAP: 0.0000
ADD: 0.0000
ADD-S: 0.0000
5cm_5degree: 0.0000
TranslationErrorMean_in_mm: 3187475330973234436726203613184.0000
TranslationErrorStd_in_mm: 7807679963186587234277502484480.0000
RotationErrorMean_in_degree: 144.6458
RotationErrorStd_in_degree: 17.1566
2D-Projection: 0.0000
Summed_Translation_Rotation_Error: 10995155294159821671003706097664.0000
ADD(-S): 0.0000
AveragePointDistanceMean_in_mm: 3187475330973234436726203613184.0000
AveragePointDistanceStd_in_mm: 7807679963186587234277502484480.0000
AverageSymmetricPointDistanceMean_in_mm: inf
AverageSymmetricPointDistanceStd_in_mm: nan
MixedAveragePointDistanceMean_in_mm: 3187475330973234436726203613184.0000
MixedAveragePointDistanceStd_in_mm: 7807679963186587234277502484480.0000
monajalal commented 4 months ago

@madhanuman

also training using tensorflow cpu yields nan loss

(EfficientPose) mona@ada:~/effpose/EfficientPose$ python train.py --phi 0 --weights weights/Weights/Linemod/object_8/phi_0_linemod_best_ADD.h5 linemod data/Linemod_preprocessed/ --object-id 8
WARNING:tensorflow:From train.py:204: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.

WARNING:tensorflow:From train.py:206: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.

2024-02-21 15:54:47.492224: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2024-02-21 15:54:47.498868: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3096000000 Hz
2024-02-21 15:54:47.500513: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x2fd8c10 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2024-02-21 15:54:47.500538: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2024-02-21 15:54:47.502165: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2024-02-21 15:54:47.590605: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0xc1e7f0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2024-02-21 15:54:47.590664: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): NVIDIA RTX 6000 Ada Generation, Compute Capability 8.9
2024-02-21 15:54:47.591259: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: 
name: NVIDIA RTX 6000 Ada Generation major: 8 minor: 9 memoryClockRate(GHz): 2.505
pciBusID: 0000:52:00.0
2024-02-21 15:54:47.591777: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2024-02-21 15:54:47.593401: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2024-02-21 15:54:47.594326: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2024-02-21 15:54:47.594543: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2024-02-21 15:54:47.595626: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2024-02-21 15:54:47.596433: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2024-02-21 15:54:47.599096: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2024-02-21 15:54:47.599317: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2024-02-21 15:54:47.599349: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2024-02-21 15:54:47.599518: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2024-02-21 15:54:47.599524: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165]      0 
2024-02-21 15:54:47.599528: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0:   N 
2024-02-21 15:54:47.599706: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 27792 MB memory) -> physical GPU (device: 0, name: NVIDIA RTX 6000 Ada Generation, pci bus id: 0000:52:00.0, compute capability: 8.9)
{'dataset_type': 'linemod', 'rotation_representation': 'axis_angle', 'weights': 'weights/Weights/Linemod/object_8/phi_0_linemod_best_ADD.h5', 'freeze_backbone': False, 'no_freeze_bn': False, 'batch_size': 1, 'lr': 0.0001, 'no_color_augmentation': False, 'no_6dof_augmentation': False, 'phi': 0, 'gpu': None, 'epochs': 500, 'steps': 1790, 'snapshot_path': 'checkpoints/21_02_2024_15_54_47', 'tensorboard_dir': 'logs/21_02_2024_15_54_47', 'snapshots': True, 'evaluation': True, 'compute_val_loss': False, 'score_threshold': 0.5, 'validation_image_save_path': None, 'multiprocessing': False, 'workers': 4, 'max_queue_size': 10, 'linemod_path': 'data/Linemod_preprocessed/', 'object_id': 8}

Creating the Generators...
Done!

Screenshot from 2024-02-21 16-00-39

madhanuman commented 4 months ago

@monajalal It seems like something is not loading correctly... In the screenshot you see that it states at the bottom that you have an CUPTI error I also encountered something similar... If i remember correctly you need to change something in the environment variables path