zju3dv / 4K4D

[CVPR 2024] 4K4D: Real-Time 4D View Synthesis at 4K Resolution
https://zju3dv.github.io/4k4d/
Other
1.56k stars 67 forks source link

training speed #31

Open hhhddddddd opened 5 months ago

hhhddddddd commented 5 months ago

Hello, I have a strange problem with train time.

I executed evc-train -c configs/exps/4k4d/4k4d_0013_01_r4.yaml,configs/specs/static.yaml,configs/specs/tiny.yaml exp_name=4k4d_0013_01_r4_static on NVIDIA GeForce RTX 4090. But it takes me about 40 minutes to train single-frame. image

It's even more serious when I executed evc-train -c configs/exps/4k4d/4k4d_0013_01_r4.yaml. it takes me about 4 days to train all frames (NVIDIA GeForce RTX 4090). Moreover, I also observed a strange phenomenon during my training. When I ran a 4k4d training experiment on the 4090, the gpustat command showed that there were two experiments running. image (The same is true on 4090)

In addition, the psnr of the training results of 4k4d_0013_01_r4_static also failed to reach about 30. image

Can you give me any advice? Thank you so much for all your help!

dendenxu commented 4 months ago

Hi, thanks for using our code first! Sorry for the late reply.

For the dynamic dataset, the released default config trains for 800k iterations (defined in r4dv.yaml with the epochs parameter). It typically only requires 400k iterations (epochs=800) to converge. Another thing to note is that we test the training speed without evaluation (runner_cfg.eval_ep=800) and report training metrics only every 100 iterations (runner_cfg.log_interval=100) to reflect the real training time.

The same story goes for the static scene. It only takes 2-3k iterations to converge.

The iteration speed looks fine (60-70ms) though. I'm not sure about the cause for the two experiments showing up, the VRAM usage seems OK.

Another thing to do to speed up the training is to use our latest CUDA-backend implementation, you can enable it via this option: https://github.com/zju3dv/4K4D/blob/712eccb0e0eeef744c19eb221cfb424a2915b474/easyvolcap/models/samplers/r4dv_sampler.py#L43C18-L43C27

As for the training PSNR, the 0013_01 scene is the harder of all four for the DNA-Rendering dataset thus its training PSNR is slightly lower.