train_instance error - Githubissues

sihuanian-2 commented 5 months ago

当我训练3.3实例分割任务时，出现以下错误：AssertionError: No inf checks were recorded for this optimizer.

sh bash/train_instance.sh 
stop_semantic_grad:True
ignore_index: 0
Namespace(gpnerf=True, debug=False, val_type='val', logger_interval=10, separate_semantic=True, freeze_geo=True, dataset_type='memory_depth_dji_instance_crossview', balance_weight=True, remove_cluster=True, use_subset=True, label_name_3d_to_2d='label_pc', start=-1, end=-1, check_depth=False, contract_new=True, use_plane=True, geo_init_method='idr', save_individual=False, continue_train=False, depth_dji_loss=False, depth_dji_type='mesh', sampling_mesh_guidance=True, wgt_air_sigma_loss=0, around_mesh_meter=5, wgt_depth_mse_loss=0, wgt_sigma_loss=0, sample_ray_num=1024, visual_normal=True, normal_loss=False, wgt_nl1_loss=0.0001, wgt_ncos_loss=0.0001, depth_loss=False, wgt_depth_loss=0.0, auto_grad=False, decay_min=0.1, save_depth=False, fushi=False, enable_instance=True, num_instance_classes=50, wgt_instance_loss=1, freeze_semantic=True, instance_name='instances_mask_0.001_depth', instance_loss_mode='linear_assignment', cached_centroids_path=None, use_dbscan=True, wgt_concentration_loss=1, crossview_process_path='zyq/test', crossview_all=False, stop_semantic_grad=True, ignore_index=0, label_name='fusion', enable_semantic=True, num_semantic_classes=5, num_layers_semantic_hidden=3, semantic_layer_dim=128, wgt_sem_loss=1, network_type='gpnerf_nr3d', clip_grad_max=0, num_layers=2, num_layers_color=3, layer_dim=64, appearance_dim=48, geo_feat_dim=15, num_levels=16, base_resolution=16, desired_resolution=8192, log2_hashmap_size=22, hash_feat_dim=2, writer_log=True, wandb_id='None', wandb_run_name='test', use_scaling=False, contract_norm='l2', contract_bg_len=1, aabb_bound=1.6, train_iterations=100000, val_interval=50000, ckpt_interval=50000, model_chunk_size=10485760, ray_chunk_size=20480, batch_size=16384, coarse_samples=128, fine_samples=128, ckpt_path='yingrenshi/yingrenshi_semantic.pt', config_file='configs/yingrenshi.yaml', chunk_paths=None, desired_chunks=20, num_chunks=20, disk_flush_size=10000000, train_every=1, cluster_mask_path=None, container_path=None, bg_layer_dim=256, near=1, far=None, ray_altitude_range=[-95.0, 54.0], train_scale_factor=4, val_scale_factor=4, pos_xyz_dim=10, pos_dir_dim=4, layers=8, skip_layers=[4], affine_appearance=False, use_cascade=False, train_mega_nerf=None, boundary_margin=1.15, all_val=False, cluster_2d=False, center_pixels=True, shifted_softplus=True, image_pixel_batch_size=8192, perturb=1.0, noise_std=1.0, lr=0.01, lr_decay_factor=1, bg_nerf=False, ellipse_scale_factor=1.1, ellipse_bounds=True, resume_ckpt_state=True, amp=True, detect_anomalies=False, random_seed=42, render_zyq=False, render_zyq_far_view='render_far0.3', exp_name='logs/yingrenshi_instance_la', dataset_path='Yingrenshi')
Origin: tensor([-9.4238e+01, -1.2068e+06, -2.3388e+06]), scale factor: 334.7266229371708
Ray bounds: 0.0029875125893039675, 100000.0
Ray altitude range in [-1, 1] space: [tensor(-0.0023, dtype=torch.float64), tensor(0.4429, dtype=torch.float64)]
Ray altitude range in metric space: [-95.0, 54.0]
Using 854 train images and 15 val images
Camera range in metric space: tensor([-1.3072e+02, -1.2071e+06, -2.3390e+06]) tensor([-5.7758e+01, -1.2065e+06, -2.3385e+06])
Camera range in [-1, 1] space: tensor([-0.1090, -0.9933, -0.6832]) tensor([0.1090, 0.9933, 0.6832])
Camera range in [-1, 1] space with ray altitude range: tensor([-0.1090, -0.9933, -0.6832]) tensor([0.4429, 0.9933, 0.6832])
Sphere center: tensor([0.1669, 0.0000, 0.0000], device='cuda:0'), radius: tensor([0.4785, 1.7223, 1.1847], device='cuda:0')
2024-04-24 10:20:19,531-rk0-utils.py#20:kaolin is not installed. OctreeAS / ForestAS disabled.
2024-04-24 10:20:19,531-rk0-lotd_encoding.py#35:tensorly is not installed.
the dataset_type is :memory_depth_dji_instance_crossview
layer_dim: 64
semantic layer_dim: 128
use two mlp
2024-04-24 10:20:19,535-rk0-lotd_cfg.py#129:NGP auto-computed config: layer resolutions: [[24, 190, 130], [34, 263, 180], [46, 363, 250], [64, 502, 345], [89, 694, 477], [124, 959, 660], [171, 1326, 912], [236, 1833, 1260], [327, 2533, 1742], [452, 3501, 2408], [625, 4838, 3328], [864, 6686, 4599], [1194, 9241, 6356], [1650, 12771, 8784], [2280, 17649, 12140], [3152, 24391, 16778], [4356, 33709, 23187]]
2024-04-24 10:20:19,535-rk0-lotd_cfg.py#130:NGP auto-computed config: layer types: ['Dense', 'Dense', 'Hash', 'Hash', 'Hash', 'Hash', 'Hash', 'Hash', 'Hash', 'Hash', 'Hash', 'Hash', 'Hash', 'Hash', 'Hash', 'Hash', 'Hash']
2024-04-24 10:20:19,535-rk0-lotd_cfg.py#131:NGP auto-computed config: layer n_feats: [2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]
2024-04-24 10:20:19,535-rk0-lotd_cfg.py#132:NGP auto-computed config: expected num_params=134217728; generated: 130233840 [0.97x]
Hash and Plane
Hash and Plane
separate the semantic mlp from nerf
separate the instance mlp from nerf
the parameters of whole model:   total: 152016670, fg: 152016670, bg: 0
no using wandb
load weights from yingrenshi/yingrenshi_semantic.pt, strat training from 0
dicard_index:-1
Loading data
  0%|          | 0/854 [00:00<?, ?it/s]/data/Aerial_lifting/./gp_nerf/image_metadata.py:42: UserWarning: The given NumPy array is not writable, and PyTorch does not support non-writable tensors. This means writing to this tensor will result in undefined behavior. You may want to copy the array to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at ../torch/csrc/utils/tensor_numpy.cpp:206.)
  return torch.ByteTensor(np.asarray(rgbs))
100%|██████████| 854/854 [01:56<00:00,  7.35it/s]
load_subset: 499
Finished loading data
  0%|          | 0/100000 [00:00<?, ?it/s]/home/kpn/.conda/envs/aerial/lib/python3.10/site-packages/torch/optim/lr_scheduler.py:139: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`.  Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
  warnings.warn("Detected call of `lr_scheduler.step()` before `optimizer.step()`. "
  0%|          | 299/100000 [01:59<10:40:56,  2.59it/s]Traceback (most recent call last):
  File "/data/Aerial_lifting/gp_nerf/train.py", line 67, in <module>
    main(_get_train_opts())
  File "/home/kpn/.conda/envs/aerial/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/data/Aerial_lifting/gp_nerf/train.py", line 63, in main
    Runner(hparams).train()
  File "/data/Aerial_lifting/./gp_nerf/runner_gpnerf.py", line 593, in train
    scaler.step(optimizer)
  File "/home/kpn/.conda/envs/aerial/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py", line 372, in step
    assert len(optimizer_state["found_inf_per_device"]) > 0, "No inf checks were recorded for this optimizer."
AssertionError: No inf checks were recorded for this optimizer.
  0%|          | 299/100000 [01:59<11:05:36,  2.50it/s]

zyqz97 commented 5 months ago

Could you please share the trained semantic field model and your training script(bash)? I try to reproduce the error. But the training process is normal on my machine.

sihuanian-2 commented 5 months ago

This is my training bash

#!/bin/bash
export OMP_NUM_THREADS=4
export CUDA_VISIBLE_DEVICES=0

dataset_path=Yingrenshi
config_file=configs/yingrenshi.yaml

batch_size=16384
train_iterations=100000
val_interval=50000
ckpt_interval=50000

exp_name=logs/yingrenshi_instance_la
dataset_type=memory_depth_dji_instance_crossview

enable_semantic=True
ckpt_path=logs/yingrenshi_semantic/0/models/100000.pt

enable_instance=True
instance_loss_mode=linear_assignment
instance_name=instances_mask_0.001_depth

python gp_nerf/train.py  \
    --dataset_path  $dataset_path  --config_file  $config_file   \
    --batch_size  $batch_size  --train_iterations   $train_iterations   --val_interval  $val_interval   --ckpt_interval   $ckpt_interval  \
    --dataset_type $dataset_type    \
    --enable_semantic  $enable_semantic  \
    --exp_name  $exp_name   \
    --enable_semantic=$enable_semantic    \
    --instance_loss_mode=$instance_loss_mode   --instance_name=$instance_name --enable_instance=$enable_instance    \
    --ckpt_path=$ckpt_path

My trained semantic field model is 100000.pt I alse used the provided trained model in this repository and the same problem occurs.

zyqz97 commented 5 months ago

Thanks. Currently, I have no much time for debugging. I will try it later.

zyqz97 commented 5 months ago

Hello, I don't have access to your files. And I check the released checkpoint and training again. It looks normal.

zyqz97 / Aerial_lifting

train_instance error #6