【功能】希望可以增加SRT字幕文件的翻译

billyphuse commented 1 year ago

现在用转档（转成epub）的，會把时间戳也翻译，希望可以专门优化时间戳的部分

yihong0618 commented 1 year ago

理论可行

hadwinn commented 1 year ago

同样有需求 😂

MIBlue119 commented 1 year ago

I try to implement a basic code to process srt and vtt file, sharing with @yihong0618

Process line by line but it would meet the rate limit problem. Currently, i use ratelimiter and retry package to workaround. But .srt/.vtt line by line's processing is slow.

I try to extract block of .srt/.vtt and feed them to api, but they mabe translated without timecode. Maybe a better prompt to solve the error.

yihong0618 commented 1 year ago

@MIBlue119 I am working on this too seems need to batch

yihong0618 commented 1 year ago

thanks

yihong0618 commented 1 year ago

@MIBlue119 Simply look at the number of srt of the next video and a book to check not much, I feel that the speed is acceptable

MIBlue119 commented 1 year ago

Thanks for your reply~!

Rich-999 commented 1 year ago

同样有需求。。

guavashine commented 1 year ago

I am working on this too seems need to batch

srt改后缀txt可直接翻译，里面时间戳是这种格式“00:00:04,000 --> 00:00:15,000”，我试了下字幕和显示：（1）时间戳不翻译出汉字最好，改回srt就直接双语字幕可用；（2）时间戳在翻译过程中格式保持不变，变成一行时间戳加英文，翻译多出一行同样的时间戳加中文，在影片中显示的也是正常双语字幕。

Royhowtohack commented 1 year ago

我修改了txt_loader.py ，可以保留时间轴和原有格式。效果如下：

import re

def make_bilingual_book(self):
    index = 0
    p_to_save_len = len(self.p_to_save)

    try:
        sliced_list = [
            self.origin_book[i : i + self.batch_size]
            for i in range(0, len(self.origin_book), self.batch_size)
        ]
        for i in sliced_list:
            batch_text = "\n".join(i)
            if self._is_special_text(batch_text):
                continue
            if not self.resume or index >= p_to_save_len:
                try:
                    temp = self.translate_model.translate(batch_text)
                except Exception as e:
                    print(e)
                    raise Exception("Something is wrong when translate") from e
                self.p_to_save.append(temp)

                # Split the original and translated text by newline characters
                original_lines = batch_text.split('\n')
                translated_lines = temp.split('\n')

                # Append the original and translated lines to self.bilingual_result
                for orig_line, trans_line in zip(original_lines, translated_lines):
                    # Check if the line is a timestamp or a line number
                    if re.match(r"^\d+$", orig_line) or re.match(r"^\d+:\d{2}:\d{2}.\d{3} --> \d+:\d{2}:\d{2}.\d{3}$", orig_line):
                        self.bilingual_result.append(orig_line)
                    else:
                        self.bilingual_result.append(orig_line)
                        self.bilingual_result.append(trans_line)

            index += self.batch_size
            if self.is_test and index > self.test_num:
                break

        self.save_file(
            f"{Path(self.txt_name).parent}/{Path(self.txt_name).stem}_bilingual.txt",
            self.bilingual_result,
        )

    except (KeyboardInterrupt, Exception) as e:
        print(e)
        print("you can resume it next time")
        self._save_progress()
        self._save_temp_book()
        sys.exit(0)

yihong0618 commented 1 year ago

@Royhowtohack cool

guavashine commented 1 year ago

爽！感谢！

yihong0618 commented 1 year ago

@Royhowtohack if we can keep the format,
Would you mind open one PR or I can do it base on your code

Royhowtohack commented 1 year ago

I just tried to open a PR, I'm still quite new to github, I hope I did it correctly.

yihong0618 / bilingual_book_maker

【功能】希望可以增加SRT字幕文件的翻译 #39