zerchen / AlignSDF

AlignSDF: Pose-Aligned Signed Distance Fields for Hand-Object Reconstruction, ECCV 2022
72 stars 10 forks source link

Training #19

Open shqmffl486 opened 1 year ago

shqmffl486 commented 1 year ago

How do I train with sdf_hand_mini and sdf_obj_mini that you uploaded? I think there is a .npz file that doesn't exist because I put it in mini version.

(alignsdf) MS-7B23:~/mount4t/AlignSDF$ CUDA_VISIBLE_DEVICES=0 bash dist_train.sh 4 6666 -e experiments/obman/30k_1e2d_mlp5.json do not support renderer in this machine DeepSdf - INFO - Added key: store_based_barrier_key:1 to store for rank: 0 DeepSdf - INFO - Training in distributed mode, 1 GPU per process. Process 0, total 1. DeepSdf - INFO - Experiment description: 3D hand reconstruction on the mini obman dataset. Hand branch: True Object branch: True Mano branch: False Depth branch: False Classifier Weight: 0 Penetration Loss: False Penetration Loss Weight: 0 Additional Loss start at epoch: 1201 Contact Loss: False Contact Loss Weight: 0 Contact Loss Sigma (m): 0.005 Independent Obj Scale: False Ignore other: False nb_label_class: 6 Image encoder, the branch has latent size 256 DeepSdf - INFO - Finish constructing the dataset DeepSdf - INFO - start_epoch:1, current_rank:0 DeepSdf - INFO - epoch:1, current_rank:0 Traceback (most recent call last): File "train.py", line 715, in main_function(exp_cfg, args.continue_from, args.local_rank, args.opt_level, args.slurm) File "train.py", line 465, in main_function for i, (input_iter, label_iter, meta_iter) in enumerate(sdf_loader): File "/home/gaeun/anaconda3/envs/alignsdf/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 517, in next data = self._next_data() File "/home/gaeun/anaconda3/envs/alignsdf/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1199, in _next_data return self._process_data(data) File "/home/gaeun/anaconda3/envs/alignsdf/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1225, in _process_data data.reraise() File "/home/gaeun/anaconda3/envs/alignsdf/lib/python3.8/site-packages/torch/_utils.py", line 429, in reraise raise self.exc_type(msg) FileNotFoundError: Caught FileNotFoundError in DataLoader worker process 0. Original Traceback (most recent call last): File "/home/gaeun/anaconda3/envs/alignsdf/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 202, in _worker_loop data = fetcher.fetch(index) File "/home/gaeun/anaconda3/envs/alignsdf/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch data = [self.dataset[idx] for idx in possibly_batched_index] File "/home/gaeun/anaconda3/envs/alignsdf/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 44, in data = [self.dataset[idx] for idx in possibly_batched_index] File "/home/gaeun/mount4t/AlignSDF/utils/data.py", line 162, in getitem hand_samples, hand_labels = unpack_sdf_samples(self.data_source, data_key, num_sample, hand=True, clamp=self.clamp, filter_dist=self.filter_dist) File "/home/gaeun/mount4t/AlignSDF/utils/sdf_utils.py", line 172, in unpack_sdf_samples npz = np.load(npz_path) File "/home/gaeun/anaconda3/envs/alignsdf/lib/python3.8/site-packages/numpy/lib/npyio.py", line 405, in load fid = stack.enter_context(open(os_fspath(file), "rb")) FileNotFoundError: [Errno 2] No such file or directory: 'data/obman/train/sdf_hand/00018168.npz'

Killing subprocess 12576 Traceback (most recent call last): File "/home/gaeun/anaconda3/envs/alignsdf/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/gaeun/anaconda3/envs/alignsdf/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/gaeun/anaconda3/envs/alignsdf/lib/python3.8/site-packages/torch/distributed/launch.py", line 340, in main() File "/home/gaeun/anaconda3/envs/alignsdf/lib/python3.8/site-packages/torch/distributed/launch.py", line 326, in main sigkill_handler(signal.SIGTERM, None) # not coming back File "/home/gaeun/anaconda3/envs/alignsdf/lib/python3.8/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd) subprocess.CalledProcessError: Command '['/home/gaeun/anaconda3/envs/alignsdf/bin/python', '-u', 'train.py', '--local_rank=0', '-e', 'experiments/obman/30k_1e2d_mlp5.json']' returned non-zero exit status 1.

zerchen commented 1 year ago

Hi,

I created this split only for students to conduct experiments under limited computing resources. I did not do experiments using the sdf_hand_mini and sdf_obj_mini. To use this split, you need to generate a new json file like this and use it in your config file (https://github.com/zerchen/AlignSDF/blob/master/experiments/obman/30k_1e2d_mlp5.json). Hope it helps.

shqmffl486 commented 1 year ago

Thank you for your reply. I learned it beforehand and trained it, but there seems to be an error in the process of making the last mesh. What do you think is the problem? So the Eval_obman file and several files were created in it, but the contents were missing

DeepSdf - INFO - time used: 85.93944382667542 DeepSdf - INFO - save at 100 DeepSdf - INFO - Distributing BatchNorm running means and vars Traceback (most recent call last):
File "train.py", line 715, in main_function(exp_cfg, args.continue_from, args.local_rank, args.opt_level, args.slurm) File "train.py", line 669, in main_function reconstruct(encoderDecoder, specs, split_filename, output_path, start_point=start_points[local_rank], end_point=end_points[local_rank], task=task, device=device, cube_dim=128, label_out=use_optim_mano, eval_mode=use_eval_mode) File "/home/gaeun/mount4t/AlignSDF/reconstruct.py", line 95, in reconstruct utils.mesh.create_mesh_combined_decoder(hand_branch, obj_branch, cls_branch, loaded_model.module.decoder, latent, mano_results, obj_results, cam_intr, specs, mesh_filename, N=cube_dim, max_batch=int(2 ** 18), scale=scale, device=device, label_out=label_out, viz=viz, eval_mode=eval_mode, task=task) File "/home/gaeun/mount4t/AlignSDF/utils/mesh.py", line 157, in create_mesh_combined_decoder out_labels[head: min(head + max_batch, num_out_vertices)] = predicted_class.argmax(dim=1).detach().cpu() IndexError: Dimension out of range (expected to be in range of [-1, 0], but got 1) Killing subprocess 19631 Traceback (most recent call last): File "/home/gaeun/anaconda3/envs/alignsdf/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/gaeun/anaconda3/envs/alignsdf/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/gaeun/anaconda3/envs/alignsdf/lib/python3.8/site-packages/torch/distributed/launch.py", line 340, in main() File "/home/gaeun/anaconda3/envs/alignsdf/lib/python3.8/site-packages/torch/distributed/launch.py", line 326, in main sigkill_handler(signal.SIGTERM, None) # not coming back File "/home/gaeun/anaconda3/envs/alignsdf/lib/python3.8/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd) subprocess.CalledProcessError: Command '['/home/gaeun/anaconda3/envs/alignsdf/bin/python', '-u', 'train.py', '--local_rank=0', '-e', 'experiments/obman/30k_1e2d_mlp5.json']' returned non-zero exit status 1.

shqmffl486 commented 1 year ago

I think this line is the problem. mesh.py 157, out_labels[head: min(head + max_batch, num_out_vertices)] = predicted_class.argmax(dim=1).detach().cpu()