whwjdqls / 4D-Gaussian-Head

YAICON 3rd project page - 4D Gaussian for Head Reconstruction
Other
10 stars 0 forks source link

Some issue #5

Open amerssun opened 10 months ago

amerssun commented 10 months ago

Hey there! 😊 Ran some tests on the algorithm, still got a few roadblocks. Could use your expertise!

train_sample_rate and test_sample_rate seem MIA. I set them to 1 and None simplistically. Any chance it's messing with the results?

Rocking an A10 GPU with 22GB VRAM, but it's not the speed demon I imagined. Using the Yufeng dataset, it takes about 5 hours to train. Plus, VRAM overflow in the second phase causing training to bail. Limited points to 100,000, and while it recognizes a face shape, the quality's meh. Device bottlenecking the final show?

Your insights would be gold! 💻🚀

whwjdqls commented 10 months ago

As far as I know, the yufeng dataset does not contain FLAME parameters. You should use reconstruction methods to extract FLAME expression parameters and use them as a condition. Also, I recommend using a shorter dataset as the yufeng dataset is quite long. The results in the README are obtained by using the NerFace dataset, and by cutting them to 100~150 frames which we used for training.

amerssun commented 9 months ago

Hello, these days I've been conducting tests intensively, but I still encountered many problems, so I have to seek your help again. Have you ever encountered this kind of issue during training? I followed your advice and switched to the NerFace dataset, but the original JSON of this dataset lacks the "time" field. I substituted it with "index/max_time_float," but it resulted in flickering and jittering of the characters' heads. I'm not sure what the reason for this is.

https://github.com/whwjdqls/4D-Gaussian-Head/assets/14248498/25eaafdb-6a0f-4b2e-bd5c-d285e7d652ee

amerssun commented 9 months ago

hh sorry it's my fault! thank u for your advice~

sstzal commented 7 months ago

As far as I know, the yufeng dataset does not contain FLAME parameters. You should use reconstruction methods to extract FLAME expression parameters and use them as a condition. Also, I recommend using a shorter dataset as the yufeng dataset is quite long. The results in the README are obtained by using the NerFace dataset, and by cutting them to 100~150 frames which we used for training.

Hi, whwjdqls!

I wondered if training with more data (1000+ frames) would lead to poor performance. I found such phenomenon in my experiments, but I was confused about why. Since in my past experience, more data should lead to better results.

Can you help me with this puzzle?

whwjdqls commented 7 months ago

Hi sstzal,

You are correct. Longer clips lead to poor performance. This is because this model is not a generalizable model, but we are overfitting a scene to the model. As longer clips lead to more images to overfit, the model's performance drops. I recommend using a generalizable model or models with stronger facial priors (rigging Gaussians to mesh).

I think tuning the hyperparameters of the deformation model will help as well, as they are currently set for short clips :)

sstzal commented 7 months ago

Hi sstzal,

You are correct. Longer clips lead to poor performance. This is because this model is not a generalizable model, but we are overfitting a scene to the model. As longer clips lead to more images to overfit, the model's performance drops. I recommend using a generalizable model or models with stronger facial priors (rigging Gaussians to mesh).

I think tuning the hyperparameters of the deformation model will help as well, as they are currently set for short clips :)

Thank you for your quick reply! Your reply was very useful to me.

Then another puzzle: With such a small amount of data (100~150 frames), can the network well learn the mapping from expression to facial motion? Since I think such mapping is not so easy to learn.

whwjdqls commented 7 months ago

Assuming that by expression you mean FLAME expression parameters and facial motion as deformation of each Gaussians, I frankly think that we are not learning such a mapping. The main goal of dynamic gaussians is to overfit a scene using Gaussians and a Deformation model. A vanilia dynamic Gaussian model uses timestamp T, which is not correlated with the motion but used only as a timestamp itself, to differentiate from other timestamps. The use of FLAME expression parameters as conditions, as they are actually correlated with facial motion(canonical to a certain timestamp), helps the model to reconstruct better images as it gives additional information about what expression the person has at that timestamp. As we have used FLAME parameters of the test set as condition for testing, seeing from a reconstruction task perspective, it will be seen as cheating. However, from a controllable avatar task perspective, it is controllability.

So to sum up, mapping exp->facial motion is for controllable avatar tasks not a reconstruction task and in this repo, we have made better reconstruction using additional information, but it is actually not a fair comparison.

These are my thoughts, and feel free to discuss more avatar related stuff as they are so fun :)