zhangzc21 / DynTet

178 stars 21 forks source link

Unstable head talking. #11

Open hereTac opened 2 months ago

hereTac commented 2 months ago

Here is the original video file. It is approximately 16 MB in size. I add the file as a temporary file that expires on May 9, 2024. Download here.

The generated video has a shaky head. Are there any suggestions on how to stabilize the head position? https://github.com/zhangzc21/DynTet/assets/8244097/ad22e569-ac15-4044-b482-9d9877749194

hereTac commented 2 months ago
The model trained with iter:20000 . PSNR LPIPS (alex) LMD mouth (fan) LMD eye (fan) id similarity (arcface) name
34.864698 0.012997 1.827579 1.575581 0.992161
zhangzc21 commented 2 months ago

Hi, @hereTac, since the quantitative results seem good, I think the shaky visual results and black color around the joint largely result from the poor pose estimation in data pre-processing. You can first check the pose estimation by modifying https://github.com/zhangzc21/DynTet/blob/87c580894cdcf77223409c37dbcdab66770042c9/tools/viz_tracking.py#L86 to the validation frame index, and https://github.com/zhangzc21/DynTet/blob/87c580894cdcf77223409c37dbcdab66770042c9/tools/viz_tracking.py#L187 to the video_dir, then running the script tools/viz_tracking.py to draw the estimated 3DMM head.

If the pose estimation goes well, you can try to change the blur strength smoothing the pose path for inference https://github.com/zhangzc21/DynTet/blob/87c580894cdcf77223409c37dbcdab66770042c9/dataset/dataset_face.py#L53 https://github.com/zhangzc21/DynTet/blob/87c580894cdcf77223409c37dbcdab66770042c9/infer.py#L147

hereTac commented 2 months ago

How can I determine if the results are good?
000000

https://github.com/zhangzc21/DynTet/assets/8244097/adaf1e0d-ea26-4c1d-a783-cd41f4c49b0c

My video contains 770 images in total, and I used Selected = slice(0, 771, 1) in the code. Here's the program's output log:

(dyntet) root@fb4ccd87f92c:/mnt/xxxx/DynTet# python tools/viz_tracking.py 
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 771/771 [06:46<00:00,  1.90it/s]
ffmpeg version 4.2.7-0ubuntu0.1 Copyright (c) 2000-2022 the FFmpeg developers
  built with gcc 9 (Ubuntu 9.4.0-1ubuntu1~20.04.1)
  configuration: --prefix=/usr --extra-version=0ubuntu0.1 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --arch=amd64 --enable-gpl --disable-stripping --enable-avresample --disable-filter=resample --enable-avisynth --enable-gnutls --enable-ladspa --enable-libaom --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libcodec2 --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libjack --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librsvg --enable-librubberband --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libssh --enable-libtheora --enable-libtwolame --enable-libvidstab --enable-libvorbis --enable-libvpx --enable-libwavpack --enable-libwebp --enable-libx265 --enable-libxml2 --enable-libxvid --enable-libzmq --enable-libzvbi --enable-lv2 --enable-omx --enable-openal --enable-opencl --enable-opengl --enable-sdl2 --enable-libdc1394 --enable-libdrm --enable-libiec61883 --enable-nvenc --enable-chromaprint --enable-frei0r --enable-libx264 --enable-shared
  libavutil      56. 31.100 / 56. 31.100
  libavcodec     58. 54.100 / 58. 54.100
  libavformat    58. 29.100 / 58. 29.100
  libavdevice    58.  8.100 / 58.  8.100
  libavfilter     7. 57.100 /  7. 57.100
  libavresample   4.  0.  0 /  4.  0.  0
  libswscale      5.  5.100 /  5.  5.100
  libswresample   3.  5.100 /  3.  5.100
  libpostproc    55.  5.100 / 55.  5.100
Input #0, image2, from 'tools/viz_tracking/head/*.png':
  Duration: 00:00:30.84, start: 0.000000, bitrate: N/A
    Stream #0:0: Video: png, rgb24(pc), 1072x1440, 25 fps, 25 tbr, 25 tbn, 25 tbc
Stream mapping:
  Stream #0:0 -> #0:0 (png (native) -> h264 (libx264))
Press [q] to stop, [?] for help
[libx264 @ 0x55ff264a0780] using cpu capabilities: MMX2 SSE2Fast SSSE3 SSE4.2 AVX FMA3 BMI2 AVX2 AVX512
[libx264 @ 0x55ff264a0780] profile High, level 4.0
[libx264 @ 0x55ff264a0780] 264 - core 155 r2917 0a84d98 - H.264/MPEG-4 AVC codec - Copyleft 2003-2018 - http://www.videolan.org/x264.html - options: cabac=1 ref=3 deblock=1:0:0 analyse=0x3:0x113 me=hex subme=7 psy=1 psy_rd=1.00:0.00 mixed_ref=1 me_range=16 chroma_me=1 trellis=1 8x8dct=1 cqm=0 deadzone=21,11 fast_pskip=1 chroma_qp_offset=-2 threads=45 lookahead_threads=7 sliced_threads=0 nr=0 decimate=1 interlaced=0 bluray_compat=0 constrained_intra=0 bframes=3 b_pyramid=2 b_adapt=1 b_bias=0 direct=1 weightb=1 open_gop=0 weightp=2 keyint=250 keyint_min=25 scenecut=40 intra_refresh=0 rc_lookahead=40 rc=crf mbtree=1 crf=23.0 qcomp=0.60 qpmin=0 qpmax=69 qpstep=4 ip_ratio=1.40 aq=1:1.00
Output #0, mp4, to 'tools/viz_tracking/head/output.mp4':
  Metadata:
    encoder         : Lavf58.29.100
    Stream #0:0: Video: h264 (libx264) (avc1 / 0x31637661), yuv420p, 1072x1440, q=-1--1, 25 fps, 12800 tbn, 25 tbc
    Metadata:
      encoder         : Lavc58.54.100 libx264
    Side data:
      cpb: bitrate max/min/avg: 0/0/0 buffer size: 0 vbv_delay: -1
frame=  771 fps= 72 q=-1.0 Lsize=    5959kB time=00:00:30.72 bitrate=1589.1kbits/s speed=2.86x    
video:5949kB audio:0kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 0.166714%
[libx264 @ 0x55ff264a0780] frame I:4     Avg QP:18.03  size: 36898
[libx264 @ 0x55ff264a0780] frame P:194   Avg QP:21.64  size: 15019
[libx264 @ 0x55ff264a0780] frame B:573   Avg QP:24.07  size:  5288
[libx264 @ 0x55ff264a0780] consecutive B-frames:  0.9%  0.0%  0.0% 99.1%
[libx264 @ 0x55ff264a0780] mb I  I16..4: 10.3% 84.6%  5.1%
[libx264 @ 0x55ff264a0780] mb P  I16..4:  2.0%  5.9%  0.3%  P16..4: 31.6%  7.4%  3.3%  0.0%  0.0%    skip:49.4%
[libx264 @ 0x55ff264a0780] mb B  I16..4:  0.1%  0.3%  0.1%  B16..8: 23.9%  1.6%  0.2%  direct: 0.6%  skip:73.1%  L0:47.5% L1:50.9% BI: 1.6%
[libx264 @ 0x55ff264a0780] 8x8 transform intra:72.6% inter:85.0%
[libx264 @ 0x55ff264a0780] coded y,uvDC,uvAC intra: 32.6% 34.3% 8.7% inter: 5.4% 9.5% 1.9%
[libx264 @ 0x55ff264a0780] i16 v,h,dc,p: 29% 17% 11% 43%
[libx264 @ 0x55ff264a0780] i8 v,h,dc,ddl,ddr,vr,hd,vl,hu: 30% 17% 38%  2%  2%  3%  3%  2%  2%
[libx264 @ 0x55ff264a0780] i4 v,h,dc,ddl,ddr,vr,hd,vl,hu: 18% 24% 27%  6%  6%  5%  6%  3%  4%
[libx264 @ 0x55ff264a0780] i8c dc,h,v,p: 69% 15% 14%  2%
[libx264 @ 0x55ff264a0780] Weighted P-Frames: Y:0.0% UV:0.0%
[libx264 @ 0x55ff264a0780] ref P L0: 56.1%  8.1% 26.3%  9.6%
[libx264 @ 0x55ff264a0780] ref B L0: 82.9% 13.8%  3.3%
[libx264 @ 0x55ff264a0780] ref B L1: 92.8%  7.2%
[libx264 @ 0x55ff264a0780] kb/s:1580.07
zhangzc21 commented 2 months ago

The semi-transparent masks and red landmarks are generated based on 3DMM pose estimation. You can see that the masks and landmarks are also shaky (just like the shaky head in your inference video) and not precise, which indicates the pose estimation is not very good.

hereTac commented 2 months ago

For now, the pose estimation is not good enough, any advises about how to optimization it?

Yes, the tracking is a bit unstable in the video. I also tried training the model on the Obama video, but the results were still not as good as those from your project video. The tracking is still shaky, and it's not as stable as in your demonstration. I'm unable to replicate the results like your generated video using the steps in your guide. Here's the Obama generated video.

https://github.com/zhangzc21/DynTet/assets/8244097/1ff42662-d1b1-4548-9ccc-fdf87a3e279d

| By the way, I also tried changing the background, but the tracking is still unstable.

zhangzc21 commented 2 months ago

Could you kindly provide the quantitative results of the Obama video?

btw, I have not encountered the phenomenon of half-closed eyes and shaky ear region after using elastic scores. If you just trained the model with 20k iterations, I recommend training the model with 40k iterations. It seems that the training hasn't converged yet. More iterations will ensure the convergence.