Numeric instability with outpainting

mattgara commented 5 months ago

I've been able to run the example (teddy) image up until the outpainting step, but repeatedly come across the following error:

[INFO] Start outpainting.
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:02<00:00, 16.79it/s]
Pipelines loaded with `dtype=torch.float16` cannot run with `cpu` device. It is not recommended to move them to `cpu` as running them will fail. Please make sure to use an accelerator to run the pipeline in inference, due to the lack of support for`float16` operations on this device in PyTorch. Please, remove the `torch_dtype=torch.float16` argument, or use another device for inference.
Pipelines loaded with `dtype=torch.float16` cannot run with `cpu` device. It is not recommended to move them to `cpu` as running them will fail. Please make sure to use an accelerator to run the pipeline in inference, due to the lack of support for`float16` operations on this device in PyTorch. Please, remove the `torch_dtype=torch.float16` argument, or use another device for inference.
Pipelines loaded with `dtype=torch.float16` cannot run with `cpu` device. It is not recommended to move them to `cpu` as running them will fail. Please make sure to use an accelerator to run the pipeline in inference, due to the lack of support for`float16` operations on this device in PyTorch. Please, remove the `torch_dtype=torch.float16` argument, or use another device for inference.
Pipelines loaded with `dtype=torch.float16` cannot run with `cpu` device. It is not recommended to move them to `cpu` as running them will fail. Please make sure to use an accelerator to run the pipeline in inference, due to the lack of support for`float16` operations on this device in PyTorch. Please, remove the `torch_dtype=torch.float16` argument, or use another device for inference.
                                                                                                                                                                                                                                                                                                                                                          [INFO] Number of points at merging:68449                                                                                                                                                                                                                                                                                                                    
Traceback (most recent call last):                                                                                                                                                                                                                                                                                                                         
  File "/home/dreamer/threestudio/launch.py", line 301, in <module>
    main(args, extras)
  File "/home/dreamer/threestudio/launch.py", line 244, in main
    trainer.fit(system, datamodule=dm, ckpt_path=cfg.resume)
  File "/home/dreamer/.local/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 544, in fit
    call._call_and_handle_interrupt(
  File "/home/dreamer/.local/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 44, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/home/dreamer/.local/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 580, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/home/dreamer/.local/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 987, in _run
    results = self._run_stage()
  File "/home/dreamer/.local/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1033, in _run_stage
    self.fit_loop.run()
  File "/home/dreamer/.local/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 205, in run
    self.advance()
  File "/home/dreamer/.local/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 363, in advance
    self.epoch_loop.run(self._data_fetcher)
  File "/home/dreamer/.local/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 140, in run
    self.advance(data_fetcher)
  File "/home/dreamer/.local/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 252, in advance
    batch_output = self.manual_optimization.run(kwargs)
  File "/home/dreamer/.local/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/manual.py", line 94, in run
    self.advance(kwargs)
  File "/home/dreamer/.local/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/manual.py", line 114, in advance
    training_step_output = call._call_strategy_hook(trainer, "training_step", *kwargs.values())
  File "/home/dreamer/.local/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 309, in _call_strategy_hook
    output = fn(*args, **kwargs)
  File "/home/dreamer/.local/lib/python3.10/site-packages/pytorch_lightning/strategies/strategy.py", line 391, in training_step
    return self.lightning_module.training_step(*args, **kwargs)
  File "/home/dreamer/threestudio/custom/threestudio-3dgs/system/scene_lang.py", line 124, in training_step
    self.outpaint()
  File "/home/dreamer/threestudio/custom/threestudio-3dgs/system/scene_lang.py", line 325, in outpaint
    output = self(sample)
  File "/home/dreamer/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/dreamer/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/dreamer/threestudio/custom/threestudio-3dgs/system/scene_lang.py", line 107, in forward
    outputs = self.renderer.batch_forward(batch)
  File "/home/dreamer/threestudio/custom/threestudio-3dgs/renderer/gaussian_batch_renderer.py", line 38, in batch_forward
    render_pkg = self.forward(
  File "/home/dreamer/threestudio/custom/threestudio-3dgs/renderer/diff_gaussian_rasterizer.py", line 126, in forward
    result_list = rasterizer(
  File "/home/dreamer/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/dreamer/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/dreamer/.local/lib/python3.10/site-packages/diff_gaussian_rasterization/__init__.py", line 222, in forward
    return rasterize_gaussians(
  File "/home/dreamer/.local/lib/python3.10/site-packages/diff_gaussian_rasterization/__init__.py", line 33, in rasterize_gaussians
    return _RasterizeGaussians.apply(
  File "/home/dreamer/.local/lib/python3.10/site-packages/torch/autograd/function.py", line 539, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/home/dreamer/.local/lib/python3.10/site-packages/diff_gaussian_rasterization/__init__.py", line 97, in forward
    num_rendered, color, language_feature, radii, geomBuffer, binningBuffer, imgBuffer = _C.rasterize_gaussians(*args)
RuntimeError: numel: integer multiplication overflow

Any help would be appreciated.

Note, I'm running in a docker container.

zqh0253 commented 5 months ago

Can you share your docker file with me?

On Sat, Jun 8, 2024 at 04:17 Matt Gara @.***> wrote:

I've been able to run the example (teddy) image up until the outpainting step, but repeatedly come across the following error:

[INFO] Start outpainting. 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:02<00:00, 16.79it/s] Pipelines loaded with dtype=torch.float16 cannot run with cpu device. It is not recommended to move them to cpu as running them will fail. Please make sure to use an accelerator to run the pipeline in inference, due to the lack of support forfloat16 operations on this device in PyTorch. Please, remove the torch_dtype=torch.float16 argument, or use another device for inference. Pipelines loaded with dtype=torch.float16 cannot run with cpu device. It is not recommended to move them to cpu as running them will fail. Please make sure to use an accelerator to run the pipeline in inference, due to the lack of support forfloat16 operations on this device in PyTorch. Please, remove the torch_dtype=torch.float16 argument, or use another device for inference. Pipelines loaded with dtype=torch.float16 cannot run with cpu device. It is not recommended to move them to cpu as running them will fail. Please make sure to use an accelerator to run the pipeline in inference, due to the lack of support forfloat16 operations on this device in PyTorch. Please, remove the torch_dtype=torch.float16 argument, or use another device for inference. Pipelines loaded with dtype=torch.float16 cannot run with cpu device. It is not recommended to move them to cpu as running them will fail. Please make sure to use an accelerator to run the pipeline in inference, due to the lack of support forfloat16 operations on this device in PyTorch. Please, remove the torch_dtype=torch.float16 argument, or use another device for inference. [INFO] Number of points at merging:68449 Traceback (most recent call last): File "/home/dreamer/threestudio/launch.py", line 301, in main(args, extras) File "/home/dreamer/threestudio/launch.py", line 244, in main trainer.fit(system, datamodule=dm, ckpt_path=cfg.resume) File "/home/dreamer/.local/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 544, in fit call._call_and_handle_interrupt( File "/home/dreamer/.local/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 44, in _call_and_handle_interrupt return trainer_fn(*args, kwargs) File "/home/dreamer/.local/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 580, in _fit_impl self._run(model, ckpt_path=ckpt_path) File "/home/dreamer/.local/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 987, in _run results = self._run_stage() File "/home/dreamer/.local/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1033, in _run_stage self.fit_loop.run() File "/home/dreamer/.local/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 205, in run self.advance() File "/home/dreamer/.local/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 363, in advance self.epoch_loop.run(self._data_fetcher) File "/home/dreamer/.local/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 140, in run self.advance(data_fetcher) File "/home/dreamer/.local/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 252, in advance batch_output = self.manual_optimization.run(kwargs) File "/home/dreamer/.local/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/manual.py", line 94, in run self.advance(kwargs) File "/home/dreamer/.local/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/manual.py", line 114, in advance training_step_output = call._call_strategy_hook(trainer, "training_step", kwargs.values()) File "/home/dreamer/.local/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 309, in _call_strategy_hook output = fn(args, kwargs) File "/home/dreamer/.local/lib/python3.10/site-packages/pytorch_lightning/strategies/strategy.py", line 391, in training_step return self.lightning_module.training_step(*args, kwargs) File "/home/dreamer/threestudio/custom/threestudio-3dgs/system/scene_lang.py", line 124, in training_step self.outpaint() File "/home/dreamer/threestudio/custom/threestudio-3dgs/system/scene_lang.py", line 325, in outpaint output = self(sample) File "/home/dreamer/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/home/dreamer/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(args, kwargs) File "/home/dreamer/threestudio/custom/threestudio-3dgs/system/scene_lang.py", line 107, in forward outputs = self.renderer.batch_forward(batch) File "/home/dreamer/threestudio/custom/threestudio-3dgs/renderer/gaussian_batch_renderer.py", line 38, in batch_forward render_pkg = self.forward( File "/home/dreamer/threestudio/custom/threestudio-3dgs/renderer/diff_gaussian_rasterizer.py", line 126, in forward result_list = rasterizer( File "/home/dreamer/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/home/dreamer/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, *kwargs) File "/home/dreamer/.local/lib/python3.10/site-packages/diff_gaussian_rasterization/init.py", line 222, in forward return rasterize_gaussians( File "/home/dreamer/.local/lib/python3.10/site-packages/diff_gaussian_rasterization/init.py", line 33, in rasterize_gaussians return _RasterizeGaussians.apply( File "/home/dreamer/.local/lib/python3.10/site-packages/torch/autograd/function.py", line 539, in apply return super().apply(args, kwargs) # type: ignore[misc] File "/home/dreamer/.local/lib/python3.10/site-packages/diff_gaussian_rasterization/init.py", line 97, in forward num_rendered, color, language_feature, radii, geomBuffer, binningBuffer, imgBuffer = _C.rasterize_gaussians(*args) RuntimeError: numel: integer multiplication overflow

Any help would be appreciated.

Note, I'm running in a docker container.

— Reply to this email directly, view it on GitHub https://github.com/zqh0253/3DitScene/issues/7, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHYZC3T47LWNDSG5TZ6JQQLZGK4YXAVCNFSM6AAAAABI7ZN5COVHI2DSMVQWIX3LMV43ASLTON2WKOZSGM2DCNJUGU2DINI . You are receiving this because you are subscribed to this thread.Message ID: @.***>

mattgara commented 5 months ago

This is the docker file https://huggingface.co/spaces/qihang/3Dit-Scene/blob/main/Dockerfile

zqh0253 commented 5 months ago

Hi, can you tell me:

which OS are you working with
which GPU card are you using

mattgara commented 5 months ago

Here is the output from nvidia-smi:

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Quadro RTX 8000                Off | 00000000:1A:00.0 Off |                  Off |
+---------------------------------------------------------------------------------------+

and the host is

cat /etc/os-release 
NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"

zqh0253 commented 5 months ago

Hi, I apologize for the inconvenience caused, but it is hard for me to reproduce the error. Maybe you can check the following:

I notice that your CUDA version is 12.2, but the Dockerfile specifies cuda11.8. I'm not sure if this mismatch is causing the problem. Could you try modifying the Dockerfile, or installing the environment in a Python virtual environment?
Out of memory can cause this error: https://github.com/graphdeco-inria/gaussian-splatting/issues/24. Is any other program running simultaneously when you run the example?

mattgara commented 5 months ago

Okay, thanks for looking into this.

If I have time, I'll attempt to get this working in the docker again.

FWIW, I've been able to get the Dockerfile for threestudio to work after several rounds of debugging dependency issues, and AFAICT it looks like the Dockerfile above is based off that Dockerfile, so I can probably apply the same fixes.

The main issue in in the threestudio Dockerfile is that not all dependencies are pinned to version numbers, so certain dependencies, when installed cause base version of torch and other core libraries to be overriden (and newer versions installed), and this causes downstream errors.

zqh0253 commented 5 months ago

Yes, I encountered the same issue with threestudio's Dockerfile. That's why I specified the versions for several packages. You'll need to determine the exact versions compatible with your hardware.

Overall, it seems promising now that you've resolved the issues with threestudio's Dockerfile. If you encounter any further problems, feel free to reach out for a discussion.

zqh0253 / 3DitScene

Numeric instability with outpainting #7