zh-plus / openlrc

Transcribe and translate voice into LRC file using Whisper and LLMs (GPT, Claude, et,al). 使用whisper和LLM(GPT,Claude等)来转录、翻译你的音频为字幕文件。
https://zh-plus.github.io/openlrc/
MIT License
438 stars 28 forks source link

Improve error handling: No active speech found in audio #58

Open MaleicAcid opened 1 week ago

MaleicAcid commented 1 week ago

use openlrc version: 1.5.2

When try to transcribe a video that have no human voice, will get exception RuntimeError: stack expects a non-empty TensorList. I found the following text in log:

 [2024-09-19 22:48:52] INFO     [Producer_0] Audio length: /home/user00/gitspace/video_tools/.data/no-speech/preprocessed/no-speech_preprocessed.wav: 00:25:14,243
No active speech found in audio

Is it possible for openlrc to handle this situation and end the transcription task early? Generating an empty subtitle file and return its path as usual, which may be a reasonable way to deal with it.

2024-09-19 22:48:16.532 | INFO     | video_tools.transcribe.base_transcriber:preview:93 - preview transcribe task:
TranscribeMetadata(
│   params=TranscribeParams(model='tiny', device='cpu', compute_type='int8'),
│   audios=[
│   │   AudioMetadata(path=PosixPath('/home/user00/gitspace/video_tools/.data/no-speech/no-speech.mp4'), hash='6e8b9718e3f6c6f60be6c25f766e3da885995f557d541989a341896feff6d505', subtitle=None, error=None)
│   ]
)
2024-09-19 22:48:16.622 | INFO     | video_tools.transcribe.base_transcriber:preview:95 - total audios num: 1
Do you want to continue? [y/N]: y
Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.4.0. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint .venv/lib/python3.11/site-packages/faster_whisper/assets/pyannote_vad_model.bin`
Model was trained with pyannote.audio 0.0.1, yours is 3.3.2. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.2.2+cu121. Bad things might happen unless you revert torch to 1.x.
 [2024-09-19 22:48:18] INFO     [MainThread] File /home/user00/gitspace/video_tools/.data/no-speech/no-speech.mp4: Audio sample rate: 44100
 [2024-09-19 22:48:19] INFO     [MainThread] Loudness normalizing...
 [2024-09-19 22:48:19] INFO     [MainThread] Normalizing file no-speech.wav (1 of 1)
 [2024-09-19 22:48:19] INFO     [MainThread] Running first pass loudnorm filter for stream 0
 [2024-09-19 22:48:48] INFO     [MainThread] Running second pass for /home/user00/gitspace/video_tools/.data/no-speech/no-speech.wav
 [2024-09-19 22:48:52] INFO     [MainThread] Normalized file written to /home/user00/gitspace/video_tools/.data/no-speech/preprocessed/no-speech_ln.wav
 [2024-09-19 22:48:52] INFO     [MainThread] Preprocessed audio saved to /home/user00/gitspace/video_tools/.data/no-speech/preprocessed/no-speech_preprocessed.wav
 [2024-09-19 22:48:52] INFO     [MainThread] Working on 1 audio files: [PosixPath('/home/user00/gitspace/video_tools/.data/no-speech/preprocessed/no-speech_preprocessed.wav')]
 [2024-09-19 22:48:52] INFO     [MainThread] Start Transcription (Producer) and Translation (Consumer) process
 [2024-09-19 22:48:52] INFO     [Producer_0] Start Transcription process
 [2024-09-19 22:48:52] INFO     [Producer_0] Audio length: /home/user00/gitspace/video_tools/.data/no-speech/preprocessed/no-speech_preprocessed.wav: 00:25:14,243
No active speech found in audio
 [2024-09-19 22:49:24] INFO     [Producer_0] Detected language: en (0.58) in first 30s of audio...
 [2024-09-19 22:49:24] INFO     [Producer_0] Transcription process Elapsed: 31.53s
 [2024-09-19 22:49:24] INFO     [MainThread] Transcription (Producer) and Translation (Consumer) process Elapsed: 31.53s
Traceback (most recent call last):
  File "/home/user00/gitspace/video_tools/video_tools/main.py", line 6, in <module>
    fire.Fire(OpenLRCTranscriber)
  File "/home/user00/gitspace/video_tools/.venv/lib/python3.11/site-packages/fire/core.py", line 143, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user00/gitspace/video_tools/.venv/lib/python3.11/site-packages/fire/core.py", line 477, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
                                ^^^^^^^^^^^^^^^^^^^^
  File "/home/user00/gitspace/video_tools/.venv/lib/python3.11/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user00/gitspace/video_tools/video_tools/transcribe/base_transcriber.py", line 125, in run
    return self._transcribe()
           ^^^^^^^^^^^^^^^^^^
  File "/home/user00/gitspace/video_tools/video_tools/transcribe/openlrc_transcriber.py", line 13, in _transcribe
    return self._lrcer.run(self._audios, skip_trans=True, clear_temp=True)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user00/gitspace/video_tools/.venv/lib/python3.11/site-packages/openlrc/openlrc.py", line 370, in run
    producer.result()
  File "/usr/lib/python3.11/concurrent/futures/_base.py", line 456, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
  File "/usr/lib/python3.11/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user00/gitspace/video_tools/.venv/lib/python3.11/site-packages/openlrc/openlrc.py", line 122, in produce_transcriptions
    segments, info = self.transcriber.transcribe(audio_path, language=src_lang)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user00/gitspace/video_tools/.venv/lib/python3.11/site-packages/openlrc/transcribe.py", line 81, in transcribe
    seg_gen, info = self.whisper_model.transcribe(str(audio_path), language=language, **self.asr_options)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user00/gitspace/video_tools/.venv/lib/python3.11/site-packages/faster_whisper/transcribe.py", line 523, in transcribe
    features = torch.stack(
               ^^^^^^^^^^^^
RuntimeError: stack expects a non-empty TensorList
zh-plus commented 1 week ago

There is an existing PR for Faster-Whisper to implement early stopping for non-voice audio, which can be found at https://github.com/SYSTRAN/faster-whisper/pull/1014. Until it's merged, there seems to be no straightforward solution to stop it early without adding an extra VAD, which is computationally intensive and unnecessary for most of users.

As a workaround, you could try implementing voice detection using the pyannote on your local machine before sending the audio to openlrc.