yochaiye / LipVoicer

Official Code implementation for the ICLR paper "LipVoicer: Generating Speech from Silent Videos Guided by Lip Reading"
MIT License
40 stars 7 forks source link

Windows 10 deployment error,No module named 'ctcdecode' #2

Open jacksinofn opened 5 months ago

jacksinofn commented 5 months ago

E:\AudioText\CycleLip-Project-main\LipVoicer>python inference_real_video.py No module named 'ctcdecode' melgen: name: melgen in_channels: 80 out_channels: 80 diffusion_step_embed_dim_in: 128 diffusion_step_embed_dim_mid: 512 diffusion_step_embed_dim_out: 512 res_channels: 512 skip_channels: 512 num_res_layers: 12 dilation_cycle: 1 mel_upsample:

Lipreading configuration file loaded. AudioVisualModel Parameters: 57.309632M Successfully loaded MelGen checkpoint saving to output directory results\221 Loading ASR, tokenizer and decoder Rank 0: Model loaded at step 12720 Rank 0: Model loaded at step 0 saving to output directory results\221\w1=2_w2=1.5_asr_start=270 Cropping lip region and predicting text Converting fps to 25Hz ffmpeg version 2023-04-06-git-b564ad8eac-essentials_build-www.gyan.dev Copyright (c) 2000-2023 the FFmpeg developers built with gcc 12.2.0 (Rev10, Built by MSYS2 project) configuration: --enable-gpl --enable-version3 --enable-static --disable-w32threads --disable-autodetect --enable-fontconfig --enable-iconv --enable-gnutls --enable-libxml2 --enable-gmp --enable-bzlib --enable-lzma --enable-zlib --enable-libsrt --enable-libssh --enable-libzmq --enable-avisynth --enable-sdl2 --enable-libwebp --enable-libx264 --enable-libx265 --enable-libxvid --enable-libaom --enable-libopenjpeg --enable-libvpx --enable-mediafoundation --enable-libass --enable-libfreetype --enable-libfribidi --enable-libvidstab --enable-libvmaf --enable-libzimg --enable-amf --enable-cuda-llvm --enable-cuvid --enable-ffnvcodec --enable-nvdec --enable-nvenc --enable-d3d11va --enable-dxva2 --enable-libvpl --enable-libgme --enable-libopenmpt --enable-libopencore-amrwb --enable-libmp3lame --enable-libtheora --enable-libvo-amrwbenc --enable-libgsm --enable-libopencore-amrnb --enable-libopus --enable-libspeex --enable-libvorbis --enable-librubberband libavutil 58. 6.100 / 58. 6.100 libavcodec 60. 9.100 / 60. 9.100 libavformat 60. 4.101 / 60. 4.101 libavdevice 60. 2.100 / 60. 2.100 libavfilter 9. 5.100 / 9. 5.100 libswscale 7. 2.100 / 7. 2.100 libswresample 4. 11.100 / 4. 11.100 libpostproc 57. 2.100 / 57. 2.100 [mov,mp4,m4a,3gp,3g2,mj2 @ 000001962323e6c0] Unknown cover type: 0x1. Input #0, mov,mp4,m4a,3gp,3g2,mj2, from 'C:/Users/Administrator/Desktop/221.mp4': Metadata: major_brand : isom minor_version : 512 compatible_brands: isomiso2avc1mp41 creation_time : 2024-04-16T05:38:43.000000Z Hw : 1 bitrate : 12000000 maxrate : 0 te_is_reencode : 1 encoder : Lavf58.76.100 Duration: 00:00:03.32, start: 0.000000, bitrate: 8762 kb/s Stream #0:00x1: Video: h264 (Main) (avc1 / 0x31637661), yuv420p(tv, bt709, progressive), 1440x1080 [SAR 1:1 DAR 4:3], 8799 kb/s, 30 fps, 30 tbr, 30 tbn (default) Metadata: creation_time : 2024-04-16T05:38:43.000000Z handler_name : VideoHandler vendor_id : [0][0][0][0] Stream #0:10x2: Audio: aac (LC) (mp4a / 0x6134706D), 44100 Hz, stereo, fltp, 2 kb/s (default) Metadata: creation_time : 2024-04-16T05:38:43.000000Z handler_name : SoundHandler vendor_id : [0][0][0][0] Stream mapping: Stream #0:0 -> #0:0 (h264 (native) -> h264 (libx264)) Stream #0:1 -> #0:1 (aac (native) -> aac (native)) Press [q] to stop, [?] for help [libx264 @ 000001962330e940] using SAR=1/1 [libx264 @ 000001962330e940] using cpu capabilities: MMX2 SSE2Fast SSSE3 SSE4.2 AVX FMA3 BMI2 AVX2 AVX512 [libx264 @ 000001962330e940] profile High, level 4.0, 4:2:0, 8-bit [libx264 @ 000001962330e940] 264 - core 164 r3106 eaa68fa - H.264/MPEG-4 AVC codec - Copyleft 2003-2023 - http://www.videolan.org/x264.html - options: cabac=1 ref=3 deblock=1:0:0 analyse=0x3:0x113 me=hex subme=7 psy=1 psy_rd=1.00:0.00 mixed_ref=1 me_range=16 chroma_me=1 trellis=1 8x8dct=1 cqm=0 deadzone=21,11 fast_pskip=1 chroma_qp_offset=-2 threads=18 lookahead_threads=3 sliced_threads=0 nr=0 decimate=1 interlaced=0 bluray_compat=0 constrained_intra=0 bframes=3 b_pyramid=2 b_adapt=1 b_bias=0 direct=1 weightb=1 open_gop=0 weightp=2 keyint=250 keyint_min=25 scenecut=40 intra_refresh=0 rc_lookahead=40 rc=crf mbtree=1 crf=23.0 qcomp=0.60 qpmin=0 qpmax=69 qpstep=4 ip_ratio=1.40 aq=1:1.00 Output #0, mp4, to 'C:/Users/Administrator/Desktop/22125fps.mp4': Metadata: major_brand : isom minor_version : 512 compatible_brands: isomiso2avc1mp41 te_is_reencode : 1 Hw : 1 bitrate : 12000000 maxrate : 0 encoder : Lavf60.4.101 Stream #0:0(und): Video: h264 (avc1 / 0x31637661), yuv420p(tv, bt709, progressive), 1440x1080 [SAR 1:1 DAR 4:3], q=2-31, 25 fps, 12800 tbn (default) Metadata: creation_time : 2024-04-16T05:38:43.000000Z handler_name : VideoHandler vendor_id : [0][0][0][0] encoder : Lavc60.9.100 libx264 Side data: cpb: bitrate max/min/avg: 0/0/0 buffer size: 0 vbv_delay: N/A Stream #0:1(und): Audio: aac (LC) (mp4a / 0x6134706D), 44100 Hz, stereo, fltp, 128 kb/s (default) Metadata: creation_time : 2024-04-16T05:38:43.000000Z handler_name : SoundHandler vendor_id : [0][0][0][0] encoder : Lavc60.9.100 aac frame= 83 fps= 76 q=-1.0 Lsize= 794kB time=00:00:03.29 bitrate=1973.8kbits/s speed=3.01x video:789kB audio:1kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 0.518973% [libx264 @ 000001962330e940] frame I:1 Avg QP:22.21 size: 67758 [libx264 @ 000001962330e940] frame P:22 Avg QP:22.13 size: 23326 [libx264 @ 000001962330e940] frame B:60 Avg QP:26.27 size: 3780 [libx264 @ 000001962330e940] consecutive B-frames: 1.2% 2.4% 14.5% 81.9% [libx264 @ 000001962330e940] mb I I16..4: 7.5% 78.5% 14.0% [libx264 @ 000001962330e940] mb P I16..4: 0.7% 2.2% 0.2% P16..4: 53.1% 20.0% 8.1% 0.0% 0.0% skip:15.7% [libx264 @ 000001962330e940] mb B I16..4: 0.1% 0.1% 0.0% B16..8: 34.0% 1.5% 0.1% direct: 0.3% skip:63.9% L0:49.7% L1:48.7% BI: 1.6% [libx264 @ 000001962330e940] 8x8 transform intra:73.1% inter:83.0% [libx264 @ 000001962330e940] coded y,uvDC,uvAC intra: 70.4% 78.8% 19.3% inter: 10.1% 13.1% 0.0% [libx264 @ 000001962330e940] i16 v,h,dc,p: 16% 23% 25% 36% [libx264 @ 000001962330e940] i8 v,h,dc,ddl,ddr,vr,hd,vl,hu: 17% 27% 18% 5% 7% 5% 10% 5% 7% [libx264 @ 000001962330e940] i4 v,h,dc,ddl,ddr,vr,hd,vl,hu: 21% 29% 13% 4% 9% 6% 11% 4% 4% [libx264 @ 000001962330e940] i8c dc,h,v,p: 50% 22% 19% 8% [libx264 @ 000001962330e940] Weighted P-Frames: Y:18.2% UV:4.5% [libx264 @ 000001962330e940] ref P L0: 71.8% 11.1% 15.1% 1.9% 0.1% [libx264 @ 000001962330e940] ref B L0: 95.7% 3.7% 0.6% [libx264 @ 000001962330e940] ref B L1: 99.1% 0.9% [libx264 @ 000001962330e940] kb/s:1946.32 [aac @ 00000196237db200] Qavg: 65536.000 'mv' 不是内部或外部命令,也不是可运行的程序 或批处理文件。 Cropping mouth region Error executing job with overrides: [] Traceback (most recent call last): File "E:\AudioText\CycleLip-Project-main\LipVoicer\inference_real_video.py", line 270, in main generate( File "C:\Users\Administrator\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\utils_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "E:\AudioText\CycleLip-Project-main\LipVoicer\inference_real_video.py", line 198, in generate mouthroi, text = crop_and_infer.main(generate_cfg["video_path"], output_directory) File "E:\AudioText\CycleLip-Project-main\LipVoicer\mouthroi_processing\crop_and_infer.py", line 45, in main pipeline = InferencePipeline(config_filename, device='cuda', detector=detector, face_track=True) File "E:\AudioText\CycleLip-Project-main\LipVoicer\mouthroi_processing\pipelines\pipeline.py", line 45, in init self.model = AVSR(modality, model_path, model_conf, rnnlm, rnnlm_conf, penalty, ctc_weight, lm_weight, beam_size, device) File "E:\AudioText\CycleLip-Project-main\LipVoicer\mouthroi_processing\pipelines\model.py", line 43, in init self.token_list = [''] + [word.split()[0] for word in open(file_path).read().splitlines()] + [''] UnicodeDecodeError: 'gbk' codec can't decode byte 0x81 in position 4416: illegal multibyte sequence

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

yochaiye commented 5 months ago

Did you install the ctcdecode package? Seems like it's missing

newgenai79 commented 1 month ago

ctcdecode package is not supported on Windows

yochaiye commented 1 month ago

Please have a look here https://github.com/parlance/ctcdecode/issues?q=is%3Aissue+is%3Aopen+windows and see if it is of any help

newgenai79 commented 1 month ago

@yochaiye Yes checked the repository and there are several issues open where it can't be installed on Windows and no solution that I can find.

So because of ctcdecode all Windows users won't be able to try LipVoicer.

If this package can be replaced by

https://github.com/nanoporetech/fast-ctc-decode/

For this python wheels are available for windows https://github.com/nanoporetech/fast-ctc-decode/issues/21

yochaiye commented 1 month ago

I hope that the repositories that you mentioned can resolve the issue. I've also noticed that there is a relevant PyTorch module https://pytorch.org/audio/main/generated/torchaudio.models.decoder.CTCDecoder.html

Unfortunately, I currently don't have the capacity to check these alternatives as I have a few deadlines coming up. Hopefully I will find the time afterwards.