Open hereTac opened 2 months ago
The model trained with iter:20000 . | PSNR | LPIPS (alex) | LMD mouth (fan) | LMD eye (fan) | id similarity (arcface) | name |
---|---|---|---|---|---|---|
34.864698 | 0.012997 | 1.827579 | 1.575581 | 0.992161 |
Hi, @hereTac, since the quantitative results seem good, I think the shaky visual results and black color around the joint largely result from the poor pose estimation in data pre-processing. You can first check the pose estimation by modifying
https://github.com/zhangzc21/DynTet/blob/87c580894cdcf77223409c37dbcdab66770042c9/tools/viz_tracking.py#L86 to the validation frame index, and
https://github.com/zhangzc21/DynTet/blob/87c580894cdcf77223409c37dbcdab66770042c9/tools/viz_tracking.py#L187 to the video_dir, then running the script tools/viz_tracking.py
to draw the estimated 3DMM head.
If the pose estimation goes well, you can try to change the blur strength smoothing the pose path for inference https://github.com/zhangzc21/DynTet/blob/87c580894cdcf77223409c37dbcdab66770042c9/dataset/dataset_face.py#L53 https://github.com/zhangzc21/DynTet/blob/87c580894cdcf77223409c37dbcdab66770042c9/infer.py#L147
How can I determine if the results are good?
https://github.com/zhangzc21/DynTet/assets/8244097/adaf1e0d-ea26-4c1d-a783-cd41f4c49b0c
My video contains 770 images in total, and I used Selected = slice(0, 771, 1)
in the code. Here's the program's output log:
(dyntet) root@fb4ccd87f92c:/mnt/xxxx/DynTet# python tools/viz_tracking.py
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 771/771 [06:46<00:00, 1.90it/s]
ffmpeg version 4.2.7-0ubuntu0.1 Copyright (c) 2000-2022 the FFmpeg developers
built with gcc 9 (Ubuntu 9.4.0-1ubuntu1~20.04.1)
configuration: --prefix=/usr --extra-version=0ubuntu0.1 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --arch=amd64 --enable-gpl --disable-stripping --enable-avresample --disable-filter=resample --enable-avisynth --enable-gnutls --enable-ladspa --enable-libaom --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libcodec2 --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libjack --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librsvg --enable-librubberband --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libssh --enable-libtheora --enable-libtwolame --enable-libvidstab --enable-libvorbis --enable-libvpx --enable-libwavpack --enable-libwebp --enable-libx265 --enable-libxml2 --enable-libxvid --enable-libzmq --enable-libzvbi --enable-lv2 --enable-omx --enable-openal --enable-opencl --enable-opengl --enable-sdl2 --enable-libdc1394 --enable-libdrm --enable-libiec61883 --enable-nvenc --enable-chromaprint --enable-frei0r --enable-libx264 --enable-shared
libavutil 56. 31.100 / 56. 31.100
libavcodec 58. 54.100 / 58. 54.100
libavformat 58. 29.100 / 58. 29.100
libavdevice 58. 8.100 / 58. 8.100
libavfilter 7. 57.100 / 7. 57.100
libavresample 4. 0. 0 / 4. 0. 0
libswscale 5. 5.100 / 5. 5.100
libswresample 3. 5.100 / 3. 5.100
libpostproc 55. 5.100 / 55. 5.100
Input #0, image2, from 'tools/viz_tracking/head/*.png':
Duration: 00:00:30.84, start: 0.000000, bitrate: N/A
Stream #0:0: Video: png, rgb24(pc), 1072x1440, 25 fps, 25 tbr, 25 tbn, 25 tbc
Stream mapping:
Stream #0:0 -> #0:0 (png (native) -> h264 (libx264))
Press [q] to stop, [?] for help
[libx264 @ 0x55ff264a0780] using cpu capabilities: MMX2 SSE2Fast SSSE3 SSE4.2 AVX FMA3 BMI2 AVX2 AVX512
[libx264 @ 0x55ff264a0780] profile High, level 4.0
[libx264 @ 0x55ff264a0780] 264 - core 155 r2917 0a84d98 - H.264/MPEG-4 AVC codec - Copyleft 2003-2018 - http://www.videolan.org/x264.html - options: cabac=1 ref=3 deblock=1:0:0 analyse=0x3:0x113 me=hex subme=7 psy=1 psy_rd=1.00:0.00 mixed_ref=1 me_range=16 chroma_me=1 trellis=1 8x8dct=1 cqm=0 deadzone=21,11 fast_pskip=1 chroma_qp_offset=-2 threads=45 lookahead_threads=7 sliced_threads=0 nr=0 decimate=1 interlaced=0 bluray_compat=0 constrained_intra=0 bframes=3 b_pyramid=2 b_adapt=1 b_bias=0 direct=1 weightb=1 open_gop=0 weightp=2 keyint=250 keyint_min=25 scenecut=40 intra_refresh=0 rc_lookahead=40 rc=crf mbtree=1 crf=23.0 qcomp=0.60 qpmin=0 qpmax=69 qpstep=4 ip_ratio=1.40 aq=1:1.00
Output #0, mp4, to 'tools/viz_tracking/head/output.mp4':
Metadata:
encoder : Lavf58.29.100
Stream #0:0: Video: h264 (libx264) (avc1 / 0x31637661), yuv420p, 1072x1440, q=-1--1, 25 fps, 12800 tbn, 25 tbc
Metadata:
encoder : Lavc58.54.100 libx264
Side data:
cpb: bitrate max/min/avg: 0/0/0 buffer size: 0 vbv_delay: -1
frame= 771 fps= 72 q=-1.0 Lsize= 5959kB time=00:00:30.72 bitrate=1589.1kbits/s speed=2.86x
video:5949kB audio:0kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 0.166714%
[libx264 @ 0x55ff264a0780] frame I:4 Avg QP:18.03 size: 36898
[libx264 @ 0x55ff264a0780] frame P:194 Avg QP:21.64 size: 15019
[libx264 @ 0x55ff264a0780] frame B:573 Avg QP:24.07 size: 5288
[libx264 @ 0x55ff264a0780] consecutive B-frames: 0.9% 0.0% 0.0% 99.1%
[libx264 @ 0x55ff264a0780] mb I I16..4: 10.3% 84.6% 5.1%
[libx264 @ 0x55ff264a0780] mb P I16..4: 2.0% 5.9% 0.3% P16..4: 31.6% 7.4% 3.3% 0.0% 0.0% skip:49.4%
[libx264 @ 0x55ff264a0780] mb B I16..4: 0.1% 0.3% 0.1% B16..8: 23.9% 1.6% 0.2% direct: 0.6% skip:73.1% L0:47.5% L1:50.9% BI: 1.6%
[libx264 @ 0x55ff264a0780] 8x8 transform intra:72.6% inter:85.0%
[libx264 @ 0x55ff264a0780] coded y,uvDC,uvAC intra: 32.6% 34.3% 8.7% inter: 5.4% 9.5% 1.9%
[libx264 @ 0x55ff264a0780] i16 v,h,dc,p: 29% 17% 11% 43%
[libx264 @ 0x55ff264a0780] i8 v,h,dc,ddl,ddr,vr,hd,vl,hu: 30% 17% 38% 2% 2% 3% 3% 2% 2%
[libx264 @ 0x55ff264a0780] i4 v,h,dc,ddl,ddr,vr,hd,vl,hu: 18% 24% 27% 6% 6% 5% 6% 3% 4%
[libx264 @ 0x55ff264a0780] i8c dc,h,v,p: 69% 15% 14% 2%
[libx264 @ 0x55ff264a0780] Weighted P-Frames: Y:0.0% UV:0.0%
[libx264 @ 0x55ff264a0780] ref P L0: 56.1% 8.1% 26.3% 9.6%
[libx264 @ 0x55ff264a0780] ref B L0: 82.9% 13.8% 3.3%
[libx264 @ 0x55ff264a0780] ref B L1: 92.8% 7.2%
[libx264 @ 0x55ff264a0780] kb/s:1580.07
The semi-transparent masks and red landmarks are generated based on 3DMM pose estimation. You can see that the masks and landmarks are also shaky (just like the shaky head in your inference video) and not precise, which indicates the pose estimation is not very good.
For now, the pose estimation is not good enough, any advises about how to optimization it?
Yes, the tracking is a bit unstable in the video. I also tried training the model on the Obama video, but the results were still not as good as those from your project video. The tracking is still shaky, and it's not as stable as in your demonstration. I'm unable to replicate the results like your generated video using the steps in your guide. Here's the Obama generated video.
https://github.com/zhangzc21/DynTet/assets/8244097/1ff42662-d1b1-4548-9ccc-fdf87a3e279d
| By the way, I also tried changing the background, but the tracking is still unstable.
Could you kindly provide the quantitative results of the Obama video?
btw, I have not encountered the phenomenon of half-closed eyes and shaky ear region after using elastic scores. If you just trained the model with 20k iterations, I recommend training the model with 40k iterations. It seems that the training hasn't converged yet. More iterations will ensure the convergence.
Here is the original video file. It is approximately 16 MB in size. I add the file as a temporary file that expires on May 9, 2024. Download here.
The generated video has a shaky head. Are there any suggestions on how to stabilize the head position? https://github.com/zhangzc21/DynTet/assets/8244097/ad22e569-ac15-4044-b482-9d9877749194