zhengqili / Neural-Scene-Flow-Fields

PyTorch implementation of paper "Neural Scene Flow Fields for Space-Time View Synthesis of Dynamic Scenes"
MIT License
715 stars 94 forks source link

poor results on a new dataset #15

Open yaseryacoob opened 3 years ago

yaseryacoob commented 3 years ago

I trained on a new dataset, (full 500000) but had to change the netwidth to 128 (from 256) to be able to train on 16GB GPU. See attached video, the result is poor, the scene is obviously complex, with original forward motions into moving windmills. Do you have any advice? Is the 128 MLP the cause of the poor results, or the dataset is too complex for the approach?

https://user-images.githubusercontent.com/8571131/115993511-17a5ed80-a5a1-11eb-98ce-c7b654591c04.mp4

thanks

owang commented 3 years ago

Can you post the input footage that you used?

On Sun, Apr 25, 2021 at 5:36 AM yaseryacoob @.***> wrote:

I trained on a new dataset, (full 500000) but had to change the netwidth to 128 (from 256) to be able to train on 16GB GPU. See attached video, the result is poor, the scene is obviously complex, with original forward motions into moving windmills. Do you have any advice? Is the 128 MLP the cause of the poor results, or the dataset is too complex for the approach?

https://user-images.githubusercontent.com/8571131/115993511-17a5ed80-a5a1-11eb-98ce-c7b654591c04.mp4

thanks

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/zhengqili/Neural-Scene-Flow-Fields/issues/15, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAPCAZQAWY5OSHCWXDM5ONDTKQEE3ANCNFSM43RHVUOQ .

yaseryacoob commented 3 years ago

Sure, I left the full experiment as a tarfile ftp://ftp.umiacs.umd.edu/pub/yaser/myfiles/Semafor/AMS/ams3.tgz (all the subfolders, that would be under Neural-Scene_flow-Fields and also the model ftp://ftp.umiacs.umd.edu/pub/yaser/myfiles/Semafor/AMS/490000.tar for rendering.

zhengqili commented 3 years ago

Hi, It seems that I am not able to download the sequence from FTP. Could you share data in some other places like Google drive?

But my feeling from your description is that since the original sequence is forward motion, it might not works well if synthesized camera movement is horizontal And also our reconstruction is in NDC space like original NerF, the results might be poor if forward motion is very large with respect to the reference viewpoint, or scene content is behind the reference viewpoint (in that case, the recovered geometry can be wrong)

yaseryacoob commented 3 years ago

Here it is https://drive.google.com/drive/folders/1PO-F8CqNFDERv7sM3plG5_lSfLB5JwW0?usp=sharing The sampled the motion tightly, every 4th frame from a 30FPS video, but it is wide angle.

I also suspect that using an MLP of 128 may have reduced the performance. Can you give it a try on your set up? Also how much GPU memory are you using, my impression that even 24GB is not enough for this?

kwea123 commented 3 years ago

@yaseryacoob Can you try to generate --render_lockcam_slowmo? In that scenario the camera doesn't move and only the reconstruction moves. I think this eliminates the effect of "scene content is behind the reference viewpoint" that @zhengqili mentions. On the other hand, I do observe that in my experiments (with large forward motion too) that even the visible geometry is reconstructed very incorrectly. I had to lengthen the hard mining steps a lot to make the geometry converge, before I could introduce the flow training.

In terms of memory, you can reduce the N_rand (the batch size). I can successfully train with N_rand=256 (all other settings are untouched) on my 11GB 2080Ti.

yaseryacoob commented 3 years ago

I tried the lockcam_slowmo, see below, still problematic. One can see that the visible geometry is not stablized. I want to try another run, after I understand what @kwea123 "I had to lengthen the hard mining steps a lot to make the geometry converge." can you please explain?

https://user-images.githubusercontent.com/8571131/116088894-4ccd4100-a670-11eb-9b31-d8ba47c09c02.mp4

zhengqili commented 3 years ago

The problem seems to come from the static region, the dynamic region looks fine. Have you ever tried the original NeRF code on the video to see if there any artifacts on the ground and river?

yaseryacoob commented 3 years ago

I see your point, I haven't tried the original NeRF because I didn't think it would be able to handle this clip. I tend to agree with @kwea123 that the geometry is reconstructed incorrectly to begin with. The lockcam seems to say so. You know your work best, so your feedback is much appreciated. Figuring where the problem is coming from should give us a good chance at creating a a visually appealing clip. thanks.

yaseryacoob commented 3 years ago

I also looked at your project page, and all videos are narrow angle, but you do have some featureless background like the skateboarder.

owang commented 3 years ago

I believe that one issue with this scene is that z camera translation (i.e., moving into the scene) is not well supported by the NDC representation where the scene is warped into a cube (this is because the camera would translate "into" the NDC cube). As a result I think that some other kind of ray sampling (like maybe linear in disparity space) could be helpful for this scene. Also see the Mip-nerf paper as to another reason why Nerfs struggle with reconstructing scenes observed from different camera distances. I agree that I would first try to get a static Nerf working on the static parts, and then look into the dynamics.

kwea123 commented 3 years ago

@yaseryacoob This is the --render_lockcam_slowmo video? Why is the static scene moving then? Normally only the windmill moves and other should stay fixed... Could you render also the depth to see if the geometry is reconstructed correctly? The "lengthen the hard mining step" I mentioned is for training better dynamic region, but for your data it seems already good, the problem is the reconstruction of the static region. I have one doubt, that is, maybe everything is reconstructed as dynamic, so it moves even if the camera is fixed. If the problem is this, I'm afraid the current implementation can do nothing (see #3). You need to devise other strategies to separate static/dynamic more correctly.

@owang No, I don't think z translation is an issue, rather, it is much more suitable for NDC than horizontal movement because basically all the scene lies inside the frustum of the first pose. It becomes an issue only if the translation is too large (e.g. the scene is too long), but I don't think his scene is too long. I have applied NeRF (and other extensions) on a lot of vehicle scenes that forward motion is dominant, and don't have issue with NDC space at all. Sampling from 0 to 1 linearly is also fine.

zhengqili commented 3 years ago

Yes. I believe the geometry is incorrect in that static region. This can be due to many reasons such as it cannot properly separate the scene and dynamic model performs badly, or representation does properly separate the scene but static model perform badly.

The easiest way is to try the original NerF on this sequence and see if it works for that static region. If so, then this suggests dynamic model somehow models a static region with very wrong geometry (although in most cases dynamic model itself should also faithfully reconstruct the static region with a less accurate appearance).

yaseryacoob commented 3 years ago

Yes the video is using --render_lockcam_slowmo. And you guys are all right, the background motion is the problem and not sure how to fix it. Probably the features in the background are too hard to register correctly in COLMAP to begin with?

Anyhow, I am retraining the clip to see if it will come out better (should know by tomorrow, I will stop at around 300000.

I will then retrain on other scenes with different properties to see what I get. Here is the one I will try next https://www.pexels.com/video/young-people-playing-basketball-5390849/

Too bad it takes so long to train and there are too few resources for this.

yaseryacoob commented 3 years ago

I tried two more scenes, one close up with small motions, and one wide area with a mix of excellent and poor features. Both flopped. See the lockcamera_slowmo I beginning to think one got to be very selective in videos.

https://user-images.githubusercontent.com/8571131/117215586-6b4bde80-adcc-11eb-912f-8b1e19427691.mp4

https://user-images.githubusercontent.com/8571131/117215642-828acc00-adcc-11eb-9a70-703ef1087b60.mp4

zhengqili commented 3 years ago

The results look strange since the geometry of the static region is completely wrong. I am feeling it might be due to some type of camera motion or some other factors that break our system, but I cannot tell exactly though...

owang commented 3 years ago

Could there be a problem in the camera calibration? How are you validating that the camera poses are correct? If you project fixed 3d scene points back into the video you can see if these points drift across the surface or move with it.

kwea123 commented 3 years ago

I want to know the problem too, can you also post the training images as a video?

yaseryacoob commented 3 years ago

I left the datasets and the trained checkpoints here https://drive.google.com/drive/u/0/folders/1PO-F8CqNFDERv7sM3plG5_lSfLB5JwW0

Likely 3D reconstruction is incorrect in colmap, but I tried 4 different scenes and they all failed.

zhengqili commented 3 years ago

Hi, I tried one of the videos you provided, the results look good to me. The one difference I made is I subsampled the frame by 2 so that the number of inputs is around 30 (since our default hyperparameters are validated/optimized for 30 frames). You can try this trick (especially if object motion is not large) to see if this works for you.

https://user-images.githubusercontent.com/7671455/119886369-e2265400-bf00-11eb-9d70-4cb0619bc896.mp4

https://user-images.githubusercontent.com/7671455/119886376-e3f01780-bf00-11eb-9b23-8a9bc784e63e.mp4

yaseryacoob commented 3 years ago

Thanks for keeping up. The results look better, but I am not sure I understand what you mean by subsampling the frames by 2, since I trained on 30 frames as well. The spatial resolution of the rendering appears the same. Can you please clarify? thanks

zhengqili commented 3 years ago

What I suggest is that if adjacent frames in the input video exhibit small camera baseline, then it's worth trying to extract and use every other frame in the original video to train the model.

But I also tried to train the model on the original image sequence you provide (i.e., I trained it on all the 50 frames), and the result is still good to me, as shown below.

If you still see the ghosting artifacts in the static region, you might also try the suggestion I provide in this thread: https://github.com/zhengqili/Neural-Scene-Flow-Fields/issues/18, because I found these modifications can significantly reduce artifacts from most of the videos.

For another dance video you provide, I found the problem comes from static NeRF (the rendering from the original NeRF model is wrong), so I believe either the original NeRF model fails for this video or the camera parameters estimated from COLMAP are not accurate (this may be due to very far scene contents that cause unstable reconstruction),. In this case, I don't think there is a simple way to fix it.

https://user-images.githubusercontent.com/7671455/120902952-30c6a300-c611-11eb-84c2-2867c09d4e58.mp4