Open billyphuse opened 1 year ago
理论可行
同样有需求 😂
I try to implement a basic code to process srt and vtt file, sharing with @yihong0618
Process line by line but it would meet the rate limit problem. Currently, i use ratelimiter and retry package to workaround. But .srt/.vtt line by line's processing is slow.
I try to extract block of .srt/.vtt and feed them to api, but they mabe translated without timecode. Maybe a better prompt to solve the error.
@MIBlue119 I am working on this too seems need to batch
thanks
@MIBlue119 Simply look at the number of srt of the next video and a book to check not much, I feel that the speed is acceptable
Thanks for your reply~!
同样有需求。。
I am working on this too seems need to batch
srt改后缀txt可直接翻译,里面时间戳是这种格式“00:00:04,000 --> 00:00:15,000”,我试了下字幕和显示:(1)时间戳不翻译出汉字最好,改回srt就直接双语字幕可用;(2)时间戳在翻译过程中格式保持不变,变成一行时间戳加英文,翻译多出一行同样的时间戳加中文,在影片中显示的也是正常双语字幕。
我修改了txt_loader.py ,可以保留时间轴和原有格式。效果如下:
import re
def make_bilingual_book(self):
index = 0
p_to_save_len = len(self.p_to_save)
try:
sliced_list = [
self.origin_book[i : i + self.batch_size]
for i in range(0, len(self.origin_book), self.batch_size)
]
for i in sliced_list:
batch_text = "\n".join(i)
if self._is_special_text(batch_text):
continue
if not self.resume or index >= p_to_save_len:
try:
temp = self.translate_model.translate(batch_text)
except Exception as e:
print(e)
raise Exception("Something is wrong when translate") from e
self.p_to_save.append(temp)
# Split the original and translated text by newline characters
original_lines = batch_text.split('\n')
translated_lines = temp.split('\n')
# Append the original and translated lines to self.bilingual_result
for orig_line, trans_line in zip(original_lines, translated_lines):
# Check if the line is a timestamp or a line number
if re.match(r"^\d+$", orig_line) or re.match(r"^\d+:\d{2}:\d{2}.\d{3} --> \d+:\d{2}:\d{2}.\d{3}$", orig_line):
self.bilingual_result.append(orig_line)
else:
self.bilingual_result.append(orig_line)
self.bilingual_result.append(trans_line)
index += self.batch_size
if self.is_test and index > self.test_num:
break
self.save_file(
f"{Path(self.txt_name).parent}/{Path(self.txt_name).stem}_bilingual.txt",
self.bilingual_result,
)
except (KeyboardInterrupt, Exception) as e:
print(e)
print("you can resume it next time")
self._save_progress()
self._save_temp_book()
sys.exit(0)
@Royhowtohack cool
爽!感谢!
@Royhowtohack if we can keep the format,
Would you mind open one PR
or I can do it base on your code
I just tried to open a PR, I'm still quite new to github, I hope I did it correctly.
现在用转档(转成epub)的,會把时间戳也翻译,希望可以专门优化时间戳的部分