Open sirisian opened 3 years ago
Hi Thanks!
(1) Are you saying training on our provided example will get much worse rendering quality? I think there is some key to make model has high quality rendering: (1) keep rendered image resolution to be around 512x288, higher resolution require much more samples point (if you make image width and height larger by 2, you need at least x4 your sample points according to light field theory ). (2) keep # training frames to be 20~50 frames (default 30 has the best quality), longer sequence will make network produce oversmooth rendering unless you double the network capacity . (3) very high speed motion in the video can also cause model to produce oversmoothed or incorrect results.
(2) Yes. If you go to script "run_nerf.py", you can see writer.add_image("val/depth_map_ref", normalize_depth(ret['depth_map_ref']), global_step=i, dataformats='HW'), which visualize rendered depth map in tensorboard.
You can write the depth in evaluation mode too. Just uncomment these lines: https://github.com/zhengqili/Neural-Scene-Flow-Fields/blob/698421da99610d483bc0d1f44d1cd0f16b69e583/nsff_exp/render_utils.py#L182-L193 Pay attention to which render you use, there are so many methods here...
Regarding the render quality. The first two frames look like using the example:
They seemed blurrier than the gif at first glance.
Thank you for the depth recommendations.
Regarding my data, I have none yet, but it would be similar to the examples. Was going to compare the depth generated to other projects when I had time. Mostly interested in photogrammetry.
Hi, it seems that somehow either you train it for too long #frames, or you render using only dynamic representation (i.e. without static scene representation because static region of ground seems lose most texture and it should mostly belong to background). Did you use the default configuration or modify some hyperparameters?
I used the default config file with just the model name changed and data directories to post to the right folders.
The motion_masks folder has images that are 1024x576. They look like: The images and images_512x288 have images that are 512x288. If there's an error with that then it might be related to running the scripts on Windows. I can fix those though if that's the issue and try again.
Yes. I guess it might be related to another mentioned issue regarding running on windows platform (I believe the motion mask should be the same size of images, which has 512x288 rather than 1024).
@zhengqili I also wonder why the network performs worse when the sequence is too long?
longer sequence will make network produce oversmooth rendering
Do you have theoretic reason of why this happens, or is it based on the experiments?
I am feeling the reason come from the limited capacity of MLPs and NDC coordinate we are building for reconstruction (NDC assume the entire scene should be inside the reference view frustum), although some concurrent work shows this might not be the case: https://video-nerf.github.io/.
Hmm I see, I work with forward moving scenes so I don't encounter this problem, everything lies inside the frustum of the first camera. Indeed NDC might not be a good choice for long laterally-moving scenes. Btw, the paper you mention doesn't use NDC:
While we do not use the normalized device coordinates, we sample each ray uniformly in inverse depth.
And I think it's better to adapt the coordinate system to each scene (selected by the user), for example, the running kid scene only contains objects with finite depth, so I think euclidean coordinate also works. NDC would be a better choice if the scene contains faraway geometry such as the sky or faraway buildings.
From my experience, sampling in inverse depth space produce worse results for NeRF MLPs (I think that's the reason the dog sequence in video-nerf paper is very blurry even if they are close to the camera).
If you want to work in Euclidean Space, currently I found NeRF-W and Nerfies both sample points uniformly in depth rather than disparity and they increase sampling points to 256~512 to mitigate oversmoothing issues. I bet the underlying reason is that to make Position Encoding work, the input coordinate must be uniform distributed due to the fact that MLP with Fourier Features can be thought as a stationary Neural Tangent kernel as described in https://bmild.github.io/fourfeat/
Before I try this. Have you tried running this with https://github.com/IBM/pytorch-large-model-support to train a full resolution 1270x714 (for the example) model? It definitely uses more than 24 GBs of GPU memory as I ran out. My naive goal would be to set N_samples to like 1152? I should have asked that in my original question. I was wondering what the memory required is to train that setup at a high quality? (Training time isn't a concern).
Additionally is there a way to estimate the memory usage for 1920x1080 images with I assume N_samples = 1792?
I don't think you need to increase the number of samples for higher resolutions, either theoretically or empirically. I have trained lots of NeRFs with high resolutions with only 64 samples, they work very well.
Concerning the memory, actually the training memory shouldn't vary no matter how large your image is because the training is only done with a fixed number of rays (1024). The problem comes from the validation (done every 500 steps). The current code keeps every ray result on the GPU so the memory grows with image size, but you can either 1. pass each ray chunk result to cpu or 2. totally disable validation if you believe your model is training well by looking at the metrics.
For high resolution rendering, you either need to add important sampling similar to original NeRF or increasing uniform sampling rate, otherwise you will start to see aliasing. There is light field theory stating that rendering range and resolution is proportional to the number of sampled planes/points along the rays: https://arxiv.org/abs/1905.00413
Yes, the paper you mention uses MPI which has fixed planes, so I agree that the number of samples should be proportional to the image size. The implementation you provide here only uses a fixed step size (1/128), so I assume you're right, he needs to increase the number of samples.
However if we follow the original NeRF which has importance sampling that allocates samples dynamically according to the scene (according to xyz and t), the sample size need not be increased. I have experimented on many dynamic scenes with this strategy, and did not find that I had to increase the sample size.
I ran the example training on an RTX 3090 and it took 41 hours to complete. Is there any recommended settings for maximum results for using a 24 GB card other than trial and error if training time isn't a concern? I noticed when I run the example the quality is much lower than your gifs on your project page? Are you using more samples in your configuration?
Also can this generate depth map output or am I misunderstanding what the video was describing? It looked like you had a visualization for depth per frame. Is extracting that information supported in the released implementation easily?