Other data and the motion mask accuracy

kwea123 commented 3 years ago

Hi, thanks for the code! Do you plan to publish the full data (running kid, and other data you used in the paper other than the NVIDIA ones) as well?

In fact, the thing I'd like to check the most is your motion masks' accuracy. I'd like know if it's really possible to let the network learn to separate the background and the foreground by only providing the "coarse mask" that you mentioned in the supplementary.

For example for the bubble scene on the project page, how accurate should the mask be to clearly separate the bubbles from the background like you showed? Have you also experimented on the influence of the mask quality, i.e. if masks are more coarse (larger), then how well can the model separate bg/fg?

zhengqili commented 3 years ago

Thanks! Most of the videos we showed come from Adobe Stock (but I am no longer at Adobe), so I don't think I can release all the data due to license issue.

In terms of masks, in summary, I believe mask quality will not have a strong influence. They are used only for hard-mining data-driven priors, not used for directly telling the network which region is non-rigid or not. For the videos without thin moving objects such as limbs or hands further away from the camera, turning off the coarse mask initialization can still give us good results (even if there is moving shadow ).

kwea123 commented 3 years ago

Thanks. I will try training on running kid to see how it performs without mask.

kwea123 commented 3 years ago

Do you still have the training log (tensorboard summary) of the pretrained model (kid-running_ndc_5f_sv_of_sm_unify3)? I would like to compare with my mask-free version.

kwea123 commented 3 years ago

In order to see how it decomposes the fg/bg, using your pretrained model, I manually set the raw_blend_w in this line https://github.com/zhengqili/Neural-Scene-Flow-Fields/blob/7d8a336919b2f0b0dfe458dfd35bee1ffa04bac0/nsff_exp/render_utils.py#L1020 to either 0 or 1 (raw_blend_w*0 or raw_blend_w*0+1). If I understand the code correctly, 0 means render only the bg and 1 means only the fg. The command I use to render is

python run_nerf.py --config configs/config_kid-running.txt --render_bt --target_idx 0

The bg (raw_blend_w*0) 000 The fg (raw_blend_w*0+1) 000 And the composed (original code with raw_blend_w): 000

In terms of image quality bg/fg doesn't matter so I believe the mask won't have strong influence as you said, but I'm focusing on the model's capability of separating static regions from dynamic ones, which is also a claim in the paper (Fig. 5) and in the video. However, my result doesn't seem to correctly separate bg and fg, it just outputs almost everything as fg. Did I misunderstand anything? If so, how to generate bg and fg only images correctly?

zhengqili commented 3 years ago

I believe for foreground, you need to use blend_alpha to mask out possible static region from dynamic model: i.e. you need to rendering fg through alpha with blending weight using this line

alpha_dy = (1. - torch.exp(-opacity_dy dists) ) raw_blend_w

zhengqili commented 3 years ago

FYI, I believe more principle way is to remove blending weight for training and rendering, but we always found this strategy cause much more artifacts.

kwea123 commented 3 years ago

I believe for foreground, you need to use blend_alpha to mask out possible static region from dynamic model: i.e. you need to rendering fg through alpha with blending weight using this line

alpha_dy = (1. - torch.exp(-opacity_dy dists) ) raw_blend_w

I'm not sure what you mean, that line is in raw2outputs_blending function, and if I set raw_blend_w=1 as input, it sets alpha_dy to what it should be and sets alpha_rig to zero. What else am I supposed to do?

https://github.com/zhengqili/Neural-Scene-Flow-Fields/blob/7d8a336919b2f0b0dfe458dfd35bee1ffa04bac0/nsff_exp/render_utils.py#L807-L809

zhengqili / Neural-Scene-Flow-Fields

Other data and the motion mask accuracy #1