Questions about the 3DGS finetune

Thank you for your wonderful work!

I have some questions about your video rendering results in your project webpage. In the second video (3DGS-Enhancer can enhance the low quality images from 3D Gaussian splatting), it seems that the result of enhanced image is significantly better than result of finetuned 3DGS. Do you have any insights into this experimental phenomenon?

I'm still a little confused about how to use 3DGS enhancer to improve the quality of sparse NVS. My understanding is to use the video diffusion to enhance the low quality render images (obtained by camera interpolation?) and then use them as pseudo training views. The confidence-weighted loss function is used to correct the possible errors introduced by the generated images. Is it correct?
What is "real image as reference views" in ablation study. I don't understand the meaning of "However, due to its native restrictions, we directly feed the original input views into the 3DGS fine-tuning process. This results in more reliable and view-consistent information from the input domain to facilitate 3DGS fine-tuning" in your paper.

I'm looking forward to your reply. Thanks a lot

Hi, Thanks for your good question.

it seems that the result of enhanced images are significantly better than result of finetuned 3DGS.

The video diffusion result is smooth and free from artifacts, but this does not guarantee accuracy. In fact, the PSNR of fine-tuned 3DGS is higher. For instance, even when using only real images to train 3DGS, artifacts still appear in some novel views. Video diffusion can remove all the significant artifacts but it also introduces some other errors and biases (such as VAE degradation). Overall, the result of video diffusion can provide smoother video without artifacts, but it can not achieve a better PSNR.

2.I'm still a little confused about how to use 3DGS enhancer to improve the quality of sparse NVS.

Your understanding is correct. The confidence-weighted loss function is primarily an assumption; the main objective is to ensure that reference views contribute more to 3DGS than the views generated by video diffusion. We then need to design a strategy to help the model identify areas that should rely more heavily on reference views rather than on generated views.

What is "real image as reference views" in the ablation study?

This means that we use not only the views generated by video diffusion but also real images (the real sparse view images). This ablation aims to demonstrate that while video diffusion can produce visually great videos, it still falls short in achieving high PSNR. It also serves as a partial answer to your first question.

xiliu8006 / 3DGS-Enhancer

Questions about the 3DGS finetune #2