Mainly for performance reason. In our experiments (Sec 5.3), our depth-peeling implementation could be 10-20x faster than the cuda-based implementation. We also tested Pulsar and it's also slower than our impl. A comparison was also made with 3DGS and similar speeds are obtained (with many other advantages from other components (App. B.3)). My personal thought on this is that the hardware accelerated rasterization pipeline is reasonably fast (faster than CUDA-based software rasterization) when the number of points is reasonable (250K). Switching from depth-peeling to 3DGS might make our method scale better to extremely large number of points, but we didn't observe the need for such large point count in our experiments.
We could (and did) use other initialization techniques for generating the initial point clouds. In the paper, when foreground masks are not available (like when modeling the static background), we trained an Instant-NGP model, as stated in Sec. 4. Space carving was a consistently working and good enough choice for our setting.
I do think this is an interesting point for improving convergence (although this will limit the setting a little).
Do you mean applying temporal priors like optical flow? I do think this might work and make training easier (with the benefit of producing correspondence). Although in our early experiments we find optical flow estimators not consistent and reliable enough when the motion is fast and extremely complex. This could improve the initial training but might limit the final quality.
It does handle the view dependency present in the experimented dynamic multi-view dataset well enough (as shown in the videos on the project page). IBR + SH provides a quite robust view-dependency model for our 4D view synthesis model.
For now, we're focusing on making 4K4D able to handle longer videos and produce correspondences.
Thanks for sharing the result. I have a couple following questions: