Error training on custom data

briancantwe commented 5 months ago

Hello! I've managed to get my test data to a point where it trains most of the way, but I'm getting this error.

2024-01-27 09:43:10.564616 easyvolcap.utils.net_utils -> save_npz: Saved model data/trained_model/l3mhet_test/latest.npz at epoch 59 net_utils.py:449 l3mhet_test 0:00:02 59 29959 0.029935 13.920692 0.020401 0.050336 0.0011 0.0598 0.002006 3361 0:00:01 59 29969 0.029938 13.879657 0.020609 0.050547 0.0010 0.0600 0.002005 3361 0:00:01 59 29979 0.029935 13.817179 0.020912 0.050847 0.0008 0.0614 0.002003 3361 0:00:00 59 29989 0.029906 13.771573 0.021099 0.051005 0.0010 0.0579 0.002002 3361 0:00:00 59 29999 0.029902 13.829483 0.020795 0.050696 0.0011 0.0494 0.002000 3361 eta epoch iter prop_loss psnr img_loss loss data batch lr max_mem 2024-01-27 09:43:12.650359 easyvolcap.runners.evaluators.volumetric_video_evaluator -> evaluate: camera: 0 frame: 0 volumetric_video_evaluator.py:46 {'psnr': 10.659576416015625, 'ssim': 0.08238379, 'lpips': 0.6307356953620911} 2024-01-27 09:43:13.742178 easyvolcap.runners.volumetric_video_runner -> 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1/1 0:00:03 < 0:00:00 ? it/s v… test_generator: 2024-01-27 09:43:13.744583 easyvolcap.runners.evaluators.volumetric_video_evaluator -> summarize: volumetric_video_evaluator.py:72 { 'psnr_mean': 10.659576416015625, 'psnr_std': 0.0, 'ssim_mean': 0.08238379657268524, 'ssim_std': 7.450580596923828e-09, 'lpips_mean': 0.6307356953620911, 'lpips_std': 0.0 } 2024-01-27 09:43:13.748727 easyvolcap.runners.volumetric_video_runner -> train: Error in validation pass, ignored and volumetric_video_runner.py:308 continuing ╭─────────────────────────────────────────────────── Traceback (most recent call last) ────────────────────────────────────────────────────╮ │ /mnt/c/Users/User/Documents/Github/EasyVolcap/easyvolcap/runners/volumetric_video_runner.py:306 in train │ │ │ │ ❱ 306 │ │ │ │ │ self.test_epoch(epoch + 1) # will this provoke a live display? │ │ │ │ /mnt/c/Users/Use/Documents/Github/EasyVolcap/easyvolcap/runners/volumetric_video_runner.py:405 in testepoch │ │ │ │ ❱ 405 │ │ for in test_generator: pass # the actual calling │ │ │ │ /mnt/c/Users/User/Documents/Github/EasyVolcap/easyvolcap/runners/volumetric_video_runner.py:432 in test_generator │ │ │ │ ❱ 432 │ │ scalar_stats = self.evaluator.summarize() │ │ │ │ /mnt/c/Users/User/Documents/Github/EasyVolcap/easyvolcap/runners/evaluators/volumetric_video_evaluator.py:80 in summarize │ │ │ │ ❱ 80 │ │ │ │ json.dump(metric, f, indent=4) │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ TypeError: 0.08238379 is not JSON serializable (easyvolcap) User@home:/mnt/c/Users/User/Documents/Github/EasyVolcap$

Any ideas?

Thanks!

dendenxu commented 5 months ago

Hi, I wrapped a try block around the json export command (shouldn't have thrown the TypeError since we should have already converted numpy scalars to python scalars before writing the json). So now if metrics.json couldn't be saved, the training and evaluation will just continue.

By the way, the PSNR looks to be too low, this seems like a similar bounding box issue to this issue, maybe you can also take a look there. If the camera pose and bounding box is set up correctly, running the generalizable enerfi model should have already given you a reasonable result.

briancantwe commented 5 months ago

I don't seem to get any POINT output after training completes. Is my custom data absolutely broken/not converging? Or possibly a bug? Maybe related?

Thanks for the tip on the bounding box! My data is indeed at a very different scale than the example.

On Sun, Jan 28, 2024 at 4:54 AM Zhen Xu @.***> wrote:

Hi, I wrapped a try block around the json export command (shouldn't have thrown the TypeError since we should have already converted numpy scalars to python scalars before writing the json). So now if metrics.json couldn't be saved, the training and evaluation will just continue.

By the way, the PSNR looks to be too low, this seems like a similar bounding box issue to this issue https://github.com/zju3dv/EasyVolcap/issues/21#issuecomment-1913585826, maybe you can also take a look there. If the camera pose and bounding box is set up correctly, running the generalizable enerfi model should have already given you a reasonable result.

— Reply to this email directly, view it on GitHub https://github.com/zju3dv/EasyVolcap/issues/23#issuecomment-1913587007, or unsubscribe https://github.com/notifications/unsubscribe-auth/BDMD46JEFUUES3DLHGUO4SLYQZDABAVCNFSM6AAAAABCNQUXNOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMJTGU4DOMBQG4 . You are receiving this because you authored the thread.Message ID: @.***>

dendenxu commented 5 months ago

The POINT folder will be created after running this command (in the running 3DGS section of the guide):

# Extract geometry (point cloud) for initialization from the l3mhet model
# Tune image sample rate and resizing ratio for a denser or sparser estimation
python scripts/tools/volume_fusion.py -- -c configs/exps/l3mhet/l3mhet_${expname}.yaml val_dataloader_cfg.dataset_cfg.ratio=0.15

This command essentially renders depth maps and fuse them for initialization of 3DGS. Note that you might need to tune the image resolution (using ratio) or other parameters to get a reasonably sized result. There's also a --skip_geometry_consistency switch to disable the "fusion" process, which might prune out too many points.

briancantwe commented 5 months ago

Of course, your right. Sorry about that. There are a lot of steps. :-)

I'm facing another problem now though in that my results are all black. I'm sure it's with my custom data setup, but I'm not sure where to start debugging.

Thanks!

briancantwe commented 5 months ago

Actually, I've found my output in the /results dir and am getting a very sparse pointcloud after conversion. So, I'm getting closer! Stay tuned.

dendenxu commented 5 months ago

The rendering being all black might be a bug. Is it possible that you send me a one-frame sample of your custom data for me to try to reproduce? Maybe through email?

briancantwe commented 5 months ago

I'm now past the black rendering issue, but getting poor results from NGP-T. Is there possibly a way to control or output test RENDER/DEPTH/ALPHA images from more cameras? That would be useful in debugging my data.

Sorry, I unfortunately can't share the data I'm using.

dendenxu commented 5 months ago

The ngpt models can also be viewed from the GUI, which I also often use for debugging.

Simply replacing adding -t gui to the training command should do the trick. Note that ngpt might be very slow to render, so it could be helpful to run it in fp16 model by append configs/specs/fp16.yaml to the command and setting viewer_cfg.render_ratio=0.1 for faster visualization. Also, you could append this config configs/specs/superm.yaml to skip the image loading process since we only want to visualize the model.

briancantwe commented 5 months ago

OK, after a bit of debugging on my data, I'm now getting a reasonable RENDER. However my DEPTH images are pretty much garbage. Any suggestions? I tried messing with the near and far settings, but it didn't seem to do much. Unfortunately I'm getting a Segmentation fault running the gui, which doesn't happen on the examples, so that can't be good. Investigation continues!

rexainn commented 4 months ago

Hi, when you run l3mhet to optimize calibration, could it converge?

dendenxu commented 4 months ago

OK, after a bit of debugging on my data, I'm now getting a reasonable RENDER. However my DEPTH images are pretty much garbage. Any suggestions? I tried messing with the near and far settings, but it didn't seem to do much. Unfortunately I'm getting a Segmentation fault running the gui, which doesn't happen on the examples, so that can't be good. Investigation continues!

Looks like a near-far problem. Could try setting near a little bit bigger?

dendenxu commented 4 months ago

Hi @briancantwe I updated the metric logging mechanism to make errors more verbose, could you try the same training command and see whether we could locate the root cause of the TypeError or check whether the error has went away?

briancantwe commented 4 months ago

Hi @dendenxu! It does appear that the Type Error went away with the update.

I played a lot with the box and clipping sizes, but still no luck with depth maps. I've used the input colmap data with other research projects, but could have hit a snag with the EasyVolcap conversion/requirements.

I was considering trying to use my own dense pointclouds or perhaps trying im4D (since it appears to take the same input format) just to see if perhaps something specific about NGP-T that didn't like my scene.

briancantwe commented 4 months ago

Actually, there's no requirement in EasyVolcap for the cameras to all be the same focal length or the images all the same aspect ratio, is there? I have a mix of formats.

dendenxu commented 4 months ago

Actually, there's no requirement in EasyVolcap for the cameras to all be the same focal length or the images all the same aspect ratio, is there? I have a mix of formats.

Yes, we took special care of the data loading process to support differently sized images (or with different intrinsic).

dendenxu commented 4 months ago

Hi @dendenxu! It does appear that the Type Error went away with the update.

I played a lot with the box and clipping sizes, but still no luck with depth maps. I've used the input colmap data with other research projects, but could have hit a snag with the EasyVolcap conversion/requirements.

I was considering trying to use my own dense pointclouds or perhaps trying im4D (since it appears to take the same input format) just to see if perhaps something specific about NGP-T that didn't like my scene.

There's a visualize_camera script that outputs a ply file for your converted camera parameters. You could check whether this visualization and the COLMAP visualization (and your actual setup) can match up.

dendenxu commented 4 months ago

I was considering trying to use my own dense pointclouds or perhaps trying im4D (since it appears to take the same input format) just to see if perhaps something specific about NGP-T that didn't like my scene.

Aside from Im4D, you could also try visualizing the dataset with the ENeRFi inference model as mentioned here. It's also a good way to check whether the camera pose is reasonable (aside from visualizing the cameras)

briancantwe commented 4 months ago

Ok, tried out the visualize_cameras script. Cameras all appear to be in the right spot. Not sure if the link for ENeRFi usage is correct above? I get an all grey screen if I use evc -t gui though.

dendenxu commented 4 months ago

Ah indeed the links was wrong. Updated link: https://github.com/zju3dv/EasyVolcap?tab=readme-ov-file#inferencing-with-enerfi and: https://github.com/zju3dv/EasyVolcap/blob/main/docs/projects/enerf.md

zju3dv / EasyVolcap

Error training on custom data #23