whoiswennie / AI-Vtuber

一个高自由度的端到端的可定制AI-VTuber。支持对接哔哩哔哩直播间,以智谱API作为语言基座模型,拥有意图识别、长短期记忆(直接记忆和联想记忆),支持搭建认知库、歌曲作品库,接入了当前热门的一些语音转换、语音合成、图像生成、数字人驱动项目,并提供了一个便于操作的客户端。
MIT License
289 stars 36 forks source link

请问博主可以分享一些训练好的音色模型吗 #8

Open 1136623363 opened 5 days ago

1136623363 commented 5 days ago

请问博主可以分享一些训练好的音色模型吗,不知道是不是svc版本的问题,从其他处下载的预训练模型使用报错

Active code page: 65001
Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.
IMPORTANT: You are using gradio version 3.39.0, however version 4.29.0 is available, please upgrade.
--------
G:\AI-Vtuber\so-vits-svc\miniconda3\lib\site-packages\torch\nn\utils\weight_norm.py:28: UserWarning: torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.
  warnings.warn("torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.")
load
load model(s) from pretrain/checkpoint_best_legacy_500.pt
spks: ['ikaros']
Running TTS...
Text: 在此输入要转译的文字。注意,使用该功能建议打开F0预测,不然会很怪, Language: zh-cn, Gender: Female, Rate: +0%, Volume: +0%
Using random zh-cn voice: Microsoft Server Speech Text to Speech Voice (zh-CN, XiaoxiaoNeural)
#=====segment start, 7.6s======
G:\AI-Vtuber\so-vits-svc\inference\infer_tool.py:270: UserWarning: torchaudio._backend.set_audio_backend has been deprecated. With dispatcher enabled, this function is no-op. You can remove the function call.
  torchaudio.set_audio_backend("soundfile")
G:\AI-Vtuber\so-vits-svc\miniconda3\lib\site-packages\torch\nn\functional.py:5476: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at ..\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:263.)
  attn_output = scaled_dot_product_attention(q, k, v, attn_mask, dropout_p, is_causal)
Traceback (most recent call last):
  File "G:\AI-Vtuber\so-vits-svc\webUI.py", line 238, in vc_fn2
    output_file_path = vc_infer(output_format, sid, input_audio, "tts", vc_transform, auto_f0, cluster_ratio, slice_db, noise_scale, pad_seconds, cl_num, lg_num, lgr_num, f0_predictor, enhancer_adaptive_key, cr_threshold, k_step, use_spk_mix, second_encoding, loudness_envelope_adjustment)
  File "G:\AI-Vtuber\so-vits-svc\webUI.py", line 145, in vc_infer
    _audio = model.slice_inference(
  File "G:\AI-Vtuber\so-vits-svc\inference\infer_tool.py", line 470, in slice_inference
    out_audio, out_sr, out_frame = self.infer(spk, tran, raw_path,
  File "G:\AI-Vtuber\so-vits-svc\inference\infer_tool.py", line 297, in infer
    audio,f0 = self.net_g_ms.infer(c, f0=f0, g=sid, uv=uv, predict_f0=auto_predict_f0, noice_scale=noice_scale,vol=vol)
  File "G:\AI-Vtuber\so-vits-svc\miniconda3\lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "G:\AI-Vtuber\so-vits-svc\models.py", line 520, in infer
    x = self.pre(c) * x_mask + self.emb_uv(uv.long()).transpose(1, 2) + vol
  File "G:\AI-Vtuber\so-vits-svc\miniconda3\lib\site-packages\torch\nn\modules\module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "G:\AI-Vtuber\so-vits-svc\miniconda3\lib\site-packages\torch\nn\modules\module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "G:\AI-Vtuber\so-vits-svc\miniconda3\lib\site-packages\torch\nn\modules\conv.py", line 310, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "G:\AI-Vtuber\so-vits-svc\miniconda3\lib\site-packages\torch\nn\modules\conv.py", line 306, in _conv_forward
    return F.conv1d(input, weight, bias, self.stride,
RuntimeError: Given groups=1, weight of size [192, 256, 5], expected input[1, 768, 740] to have 256 channels, but got 768 channels instead

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "G:\AI-Vtuber\so-vits-svc\miniconda3\lib\site-packages\gradio\routes.py", line 442, in run_predict
    output = await app.get_blocks().process_api(
  File "G:\AI-Vtuber\so-vits-svc\miniconda3\lib\site-packages\gradio\blocks.py", line 1392, in process_api
    result = await self.call_function(
  File "G:\AI-Vtuber\so-vits-svc\miniconda3\lib\site-packages\gradio\blocks.py", line 1097, in call_function
    prediction = await anyio.to_thread.run_sync(
  File "G:\AI-Vtuber\so-vits-svc\miniconda3\lib\site-packages\anyio\to_thread.py", line 56, in run_sync
    return await get_async_backend().run_sync_in_worker_thread(
  File "G:\AI-Vtuber\so-vits-svc\miniconda3\lib\site-packages\anyio\_backends\_asyncio.py", line 2144, in run_sync_in_worker_thread
    return await future
  File "G:\AI-Vtuber\so-vits-svc\miniconda3\lib\site-packages\anyio\_backends\_asyncio.py", line 851, in run
    result = context.run(func, *args)
  File "G:\AI-Vtuber\so-vits-svc\miniconda3\lib\site-packages\gradio\utils.py", line 703, in wrapper
    response = f(*args, **kwargs)
  File "G:\AI-Vtuber\so-vits-svc\webUI.py", line 243, in vc_fn2
    raise gr.Error(e)
gradio.exceptions.Error: RuntimeError('Given groups=1, weight of size [192, 256, 5], expected input[1, 768, 740] to have 256 channels, but got 768 channels instead')
Running TTS...
Text: 在此输入要转译的文字。注意,使用该功能建议打开F0预测,不然会很怪, Language: zh-cn, Gender: Female, Rate: +0%, Volume: +0%
Using random zh-cn voice: Microsoft Server Speech Text to Speech Voice (zh-CN, XiaoyiNeural)
#=====segment start, 7.3s======
Traceback (most recent call last):
  File "G:\AI-Vtuber\so-vits-svc\webUI.py", line 238, in vc_fn2
    output_file_path = vc_infer(output_format, sid, input_audio, "tts", vc_transform, auto_f0, cluster_ratio, slice_db, noise_scale, pad_seconds, cl_num, lg_num, lgr_num, f0_predictor, enhancer_adaptive_key, cr_threshold, k_step, use_spk_mix, second_encoding, loudness_envelope_adjustment)
  File "G:\AI-Vtuber\so-vits-svc\webUI.py", line 145, in vc_infer
    _audio = model.slice_inference(
  File "G:\AI-Vtuber\so-vits-svc\inference\infer_tool.py", line 470, in slice_inference
    out_audio, out_sr, out_frame = self.infer(spk, tran, raw_path,
  File "G:\AI-Vtuber\so-vits-svc\inference\infer_tool.py", line 297, in infer
    audio,f0 = self.net_g_ms.infer(c, f0=f0, g=sid, uv=uv, predict_f0=auto_predict_f0, noice_scale=noice_scale,vol=vol)
  File "G:\AI-Vtuber\so-vits-svc\miniconda3\lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "G:\AI-Vtuber\so-vits-svc\models.py", line 520, in infer
    x = self.pre(c) * x_mask + self.emb_uv(uv.long()).transpose(1, 2) + vol
  File "G:\AI-Vtuber\so-vits-svc\miniconda3\lib\site-packages\torch\nn\modules\module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "G:\AI-Vtuber\so-vits-svc\miniconda3\lib\site-packages\torch\nn\modules\module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "G:\AI-Vtuber\so-vits-svc\miniconda3\lib\site-packages\torch\nn\modules\conv.py", line 310, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "G:\AI-Vtuber\so-vits-svc\miniconda3\lib\site-packages\torch\nn\modules\conv.py", line 306, in _conv_forward
    return F.conv1d(input, weight, bias, self.stride,
RuntimeError: Given groups=1, weight of size [192, 256, 5], expected input[1, 768, 714] to have 256 channels, but got 768 channels instead

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "G:\AI-Vtuber\so-vits-svc\miniconda3\lib\site-packages\gradio\routes.py", line 442, in run_predict
    output = await app.get_blocks().process_api(
  File "G:\AI-Vtuber\so-vits-svc\miniconda3\lib\site-packages\gradio\blocks.py", line 1392, in process_api
    result = await self.call_function(
  File "G:\AI-Vtuber\so-vits-svc\miniconda3\lib\site-packages\gradio\blocks.py", line 1097, in call_function
    prediction = await anyio.to_thread.run_sync(
  File "G:\AI-Vtuber\so-vits-svc\miniconda3\lib\site-packages\anyio\to_thread.py", line 56, in run_sync
    return await get_async_backend().run_sync_in_worker_thread(
  File "G:\AI-Vtuber\so-vits-svc\miniconda3\lib\site-packages\anyio\_backends\_asyncio.py", line 2144, in run_sync_in_worker_thread
    return await future
  File "G:\AI-Vtuber\so-vits-svc\miniconda3\lib\site-packages\anyio\_backends\_asyncio.py", line 851, in run
    result = context.run(func, *args)
  File "G:\AI-Vtuber\so-vits-svc\miniconda3\lib\site-packages\gradio\utils.py", line 703, in wrapper
    response = f(*args, **kwargs)
  File "G:\AI-Vtuber\so-vits-svc\webUI.py", line 243, in vc_fn2
    raise gr.Error(e)
gradio.exceptions.Error: RuntimeError('Given groups=1, weight of size [192, 256, 5], expected input[1, 768, 714] to have 256 channels, but got 768 channels instead')
whoiswennie commented 4 days ago

sovits在4.1版本更换了编码器,你这个是4.0的老模型,可以尝试可通过修改 4.0 模型的 config.json 对 4.0 的模型进行支持,需要在 config.json 的 model 字段中添加 speech_encoder 字段,具体见下: "model": { ......... "ssl_dim": 768, "n_speakers": 200, "speech_encoder":"vec768l12" },此外需要音色模型又不想自己训练的话,可以去哔哩哔哩搜sovits模型,如白佬训练的崩铁全角色模型:https://pan.baidu.com/s/1ltCbnJhR03kHeFQpMIz53Q?pwd=1145 作者:白菜工厂1145号员工 。huggingface上也可以搜到相关的模型,不要跟gptsovits搞混了就行。