zju3dv / 4K4D

[CVPR 2024] 4K4D: Real-Time 4D View Synthesis at 4K Resolution
https://zju3dv.github.io/4k4d/
Other
1.59k stars 69 forks source link

Nice Work! #1

Open bigdimboom opened 1 year ago

bigdimboom commented 1 year ago

Thanks for sharing the result. I have a couple following questions:

  1. Why use depth peeling? There are many other ways of sorting depth.
  2. Why use space carving for generating point cloud? What is the advantage?
  3. Do you think it would converge faster if include any shape aware terms like human body detection?
  4. Using temporal info is expensive, but what if just us them as simple consistency check for adjacent frames
  5. Does it handle reflection well since you enabled view-dependent training?
  6. Other methods could be tried in the future.
dendenxu commented 1 year ago

Thanks for the questions!

  1. Mainly for performance reason. In our experiments (Sec 5.3), our depth-peeling implementation could be 10-20x faster than the cuda-based implementation. We also tested Pulsar and it's also slower than our impl. A comparison was also made with 3DGS and similar speeds are obtained (with many other advantages from other components (App. B.3)). My personal thought on this is that the hardware accelerated rasterization pipeline is reasonably fast (faster than CUDA-based software rasterization) when the number of points is reasonable (250K). Switching from depth-peeling to 3DGS might make our method scale better to extremely large number of points, but we didn't observe the need for such large point count in our experiments.
  2. We could (and did) use other initialization techniques for generating the initial point clouds. In the paper, when foreground masks are not available (like when modeling the static background), we trained an Instant-NGP model, as stated in Sec. 4. Space carving was a consistently working and good enough choice for our setting.
  3. I do think this is an interesting point for improving convergence (although this will limit the setting a little).
  4. Do you mean applying temporal priors like optical flow? I do think this might work and make training easier (with the benefit of producing correspondence). Although in our early experiments we find optical flow estimators not consistent and reliable enough when the motion is fast and extremely complex. This could improve the initial training but might limit the final quality.
  5. It does handle the view dependency present in the experimented dynamic multi-view dataset well enough (as shown in the videos on the project page). IBR + SH provides a quite robust view-dependency model for our 4D view synthesis model.
  6. For now, we're focusing on making 4K4D able to handle longer videos and produce correspondences.