will-rice / denoisers

Simple PyTorch Denoisers for Waveform Audio
Apache License 2.0
32 stars 1 forks source link

RuntimeError: Given groups=1, weight of size [24, 1, 15], expected input[1, 2, 163840] to have 1 channels, but got 2 channels instead #21

Closed magnusviri closed 7 months ago

magnusviri commented 7 months ago

I'm on macOS 13. Python 3.11, ffmpeg 6.1.1, torch and torchaudio 2.2.0. The audio file is almost 21 minutes long and is extracted from this video: https://www.youtube.com/watch?v=8Wdz1Tj5084. I tried both mp3 and wav versions. Your gradio demo errors out on the file too.

This is the full error.

Traceback (most recent call last):
  File "/Users/james/denoisers/test.py", line 20, in <module>
    clean_chunk = model(audio_chunk[None]).audio
                  ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/james/.local/share/virtualenvs/denoisers-4WN3pNTX/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/james/.local/share/virtualenvs/denoisers-4WN3pNTX/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/james/.local/share/virtualenvs/denoisers-4WN3pNTX/lib/python3.11/site-packages/denoisers/modeling/waveunet/model.py", line 156, in forward
    noise = self.model(inputs)
            ^^^^^^^^^^^^^^^^^^
  File "/Users/james/.local/share/virtualenvs/denoisers-4WN3pNTX/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/james/.local/share/virtualenvs/denoisers-4WN3pNTX/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/james/.local/share/virtualenvs/denoisers-4WN3pNTX/lib/python3.11/site-packages/denoisers/modeling/waveunet/model.py", line 234, in forward
    out = self.in_conv(inputs)
          ^^^^^^^^^^^^^^^^^^^^
  File "/Users/james/.local/share/virtualenvs/denoisers-4WN3pNTX/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/james/.local/share/virtualenvs/denoisers-4WN3pNTX/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/james/.local/share/virtualenvs/denoisers-4WN3pNTX/lib/python3.11/site-packages/torch/nn/modules/conv.py", line 310, in forward
    return self._conv_forward(input, self.weight, self.bias)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/james/.local/share/virtualenvs/denoisers-4WN3pNTX/lib/python3.11/site-packages/torch/nn/modules/conv.py", line 306, in _conv_forward
    return F.conv1d(input, weight, bias, self.stride,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Given groups=1, weight of size [24, 1, 15], expected input[1, 2, 163840] to have 1 channels, but got 2 channels instead

The code is taken from the project README.md:

import torch
import torchaudio
from denoisers import WaveUNetModel
from tqdm import tqdm

model = WaveUNetModel.from_pretrained("wrice/waveunet-vctk-24khz")

audio, sr = torchaudio.load("noisy_audio.wav")
if sr != model.config.sample_rate:
    audio = torchaudio.functional.resample(audio, sr, model.config.sample_rate)

chunk_size = model.config.max_length
print(model.config)
padding = abs(audio.size(-1) % chunk_size - chunk_size)
padded = torch.nn.functional.pad(audio, (0, padding))
clean = []
for i in tqdm(range(0, padded.shape[-1], chunk_size)):
    audio_chunk = padded[:, i:i + chunk_size]
    with torch.no_grad():
        clean_chunk = model(audio_chunk[None]).audio
    clean.append(clean_chunk.squeeze(0))

denoised = torch.concat(clean, 1)[:, :audio.shape[-1]]
will-rice commented 7 months ago

From first glance it looks like the audio is stereo and not mono. Unfortunately, the models are only trained on mono audio, but I can add support for converting stereo audio into mono.