yihong0618 / bilingual_book_maker

Make bilingual epub books Using AI translate
MIT License
6.91k stars 1k forks source link

希望能够适当拼合短段落 #360

Closed aqssxlzc closed 5 months ago

aqssxlzc commented 6 months ago

我注意到当段落很短时,比如对话类型的,每段只有一句 这种情况下是每一句单独请求一次翻译引擎 但是对于GPT来说,每次给与适量长度的文本时,翻译效果会更好。 能否在这种情况下适当拼合短段落 将多句对话送入引擎去翻译

yihong0618 commented 6 months ago

可以,有兴趣来个 PR 么?

yihong0618 commented 6 months ago

可以,有兴趣来个 PR 么?

不过我们这块代码当时写的比较急没有把关的很好,可能稍微有点困难,我自己的话,可能放在之后做。

Ninzore commented 5 months ago

试着做了一下,发现在合并段落后,经过翻译很难保持原有格式,或许只能在single_translate模式下使用。

yihong0618 commented 5 months ago

试着做了一下,发现在合并段落后,经过翻译很难保持原有格式,或许只能在single_translate模式下使用。

谢谢之前我们也感觉格式不太好弄,不过可以做成 options 参数?

Ninzore commented 5 months ago

嗯,接收新的这个参数的时候也需要提醒开启single_translate,我发个pr上来?

yihong0618 commented 5 months ago

嗯,接收新的这个参数的时候也需要提醒开启single_translate,我发个pr上来?

如果不是 single 直接 raise 好一些,如果开启这个参数。 welcome

Ninzore commented 5 months ago

其实换个思路,可以把上下文一起发给LLM,但是指定只翻译当前的那一句。这样可以同时保证翻译质量和格式准确。但问题是token使用量暴涨,如果不是自己搭的普通人应该消耗不起。

yihong0618 commented 5 months ago

其实换个思路,可以把上下文一起发给LLM,但是指定只翻译当前的那一句。这样可以同时保证翻译质量和格式准确。但问题是token使用量暴涨,如果不是自己搭的普通人应该消耗不起。

倒是可以但是会严重拖慢速度和花更多的钱,要是做的话也应该是默认关

29988122 commented 5 months ago

我知道這個ticket是在討論不要一次只丟一句給api翻譯.

稍微離題一下, 我對下面這件事有點問題, 不知道有沒有解?

试着做了一下,发现在合并段落后,经过翻译很难保持原有格式,或许只能在single_translate模式下使用。

我自己的狀況是語言能力尚可, 而目前版本epub下的呈現方式:一句原文一句譯文, 會阻礙閱讀速度和理解. 用大塊的方式, 來一次閱讀一段原文而非一句原文, 後再接譯文, 會比較適當.

我很誠實的說我看不懂原本的代碼, 沒法直接提PR, 但我對產完的bilingual epub裡面的xhtml file做了這樣的處理:

import os
from bs4 import BeautifulSoup

# Variable for the number of <p> tags to aggregate
num_p_tags_to_aggregate = 5

# Function to process each XHTML file
def process_xhtml_file(file_path):
    with open(file_path, 'r') as file:
        content = file.read()

    soup = BeautifulSoup(content, 'html.parser')
    body = soup.find('body')

    # Get all lines within the <body> tag
    body_lines = str(body).split('\n')

    # Variables to store selected <p> tags and their positions
    selected_p_tags = []
    positions = []

    # Process each line within the <body> section
    for i, line in enumerate(body_lines):
        line_soup = BeautifulSoup(line, 'html.parser')
        p_tags = line_soup.find_all('p')
        if len(p_tags) >= 2:
            # Store the second <p> tag and its position
            selected_p_tags.append(str(p_tags[1]))
            positions.append(i)
            # Remove the selected <p> tag from the line
            body_lines[i] = str(line_soup).replace(str(p_tags[1]), '')

    # Grouping every 'num_p_tags_to_aggregate' <p> tags and appending them to the end of the line
    for i in range(0, len(selected_p_tags), num_p_tags_to_aggregate):
        group = ''.join(selected_p_tags[i:i+num_p_tags_to_aggregate])
        if i+num_p_tags_to_aggregate-1 < len(positions):
            pos_to_append = positions[i+num_p_tags_to_aggregate-1]
            body_lines[pos_to_append] = body_lines[pos_to_append].strip() + group
        elif len(selected_p_tags) % num_p_tags_to_aggregate != 0 and i + num_p_tags_to_aggregate > len(selected_p_tags):
            # Handle the edge case for the last few <p> tags
            pos_to_append = positions[-1]  # Append to the last position
            body_lines[pos_to_append] = body_lines[pos_to_append].strip() + group

    # Reconstruct the <body> content with the modified lines
    modified_body_content = '\n'.join(body_lines)

    # Replace the old <body> content with the modified content
    soup.body.replace_with(BeautifulSoup(modified_body_content, 'html.parser'))

    # Write the modified content back to the file
    with open(file_path, 'w') as file:
        file.write(str(soup))

# Get all .xhtml files in the current directory
xhtml_files = [f for f in os.listdir('.') if f.endswith('.xhtml')]

# Process each file
for xhtml_file in xhtml_files:
    process_xhtml_file(xhtml_file)

解epub/重新打包epub的部分還沒寫, 但目前這樣的代碼, 經測試是能將body內一般的譯文

tags正常的選取出來, 並保持譯文與原文相同格式.

當然這還是band-aid solution啦...希望上述代碼能多少提供一點runtime改進的思路. Thanks!

yihong0618 commented 5 months ago

我知道這個ticket是在討論不要一次只丟一句給api翻譯.

稍微離題一下, 我對下面這件事有點問題, 不知道有沒有解?

试着做了一下,发现在合并段落后,经过翻译很难保持原有格式,或许只能在single_translate模式下使用。

我自己的狀況是語言能力尚可, 而目前版本epub下的呈現方式:一句原文一句譯文, 會阻礙閱讀速度和理解. 用大塊的方式, 來一次閱讀一段原文而非一句原文, 後再接譯文, 會比較適當.

我很誠實的說我看不懂原本的代碼, 沒法直接提PR, 但我對產完的bilingual epub裡面的xhtml file做了這樣的處理:

import os
from bs4 import BeautifulSoup

# Variable for the number of <p> tags to aggregate
num_p_tags_to_aggregate = 5

# Function to process each XHTML file
def process_xhtml_file(file_path):
    with open(file_path, 'r') as file:
        content = file.read()

    soup = BeautifulSoup(content, 'html.parser')
    body = soup.find('body')

    # Get all lines within the <body> tag
    body_lines = str(body).split('\n')

    # Variables to store selected <p> tags and their positions
    selected_p_tags = []
    positions = []

    # Process each line within the <body> section
    for i, line in enumerate(body_lines):
        line_soup = BeautifulSoup(line, 'html.parser')
        p_tags = line_soup.find_all('p')
        if len(p_tags) >= 2:
            # Store the second <p> tag and its position
            selected_p_tags.append(str(p_tags[1]))
            positions.append(i)
            # Remove the selected <p> tag from the line
            body_lines[i] = str(line_soup).replace(str(p_tags[1]), '')

    # Grouping every 'num_p_tags_to_aggregate' <p> tags and appending them to the end of the line
    for i in range(0, len(selected_p_tags), num_p_tags_to_aggregate):
        group = ''.join(selected_p_tags[i:i+num_p_tags_to_aggregate])
        if i+num_p_tags_to_aggregate-1 < len(positions):
            pos_to_append = positions[i+num_p_tags_to_aggregate-1]
            body_lines[pos_to_append] = body_lines[pos_to_append].strip() + group
        elif len(selected_p_tags) % num_p_tags_to_aggregate != 0 and i + num_p_tags_to_aggregate > len(selected_p_tags):
            # Handle the edge case for the last few <p> tags
            pos_to_append = positions[-1]  # Append to the last position
            body_lines[pos_to_append] = body_lines[pos_to_append].strip() + group

    # Reconstruct the <body> content with the modified lines
    modified_body_content = '\n'.join(body_lines)

    # Replace the old <body> content with the modified content
    soup.body.replace_with(BeautifulSoup(modified_body_content, 'html.parser'))

    # Write the modified content back to the file
    with open(file_path, 'w') as file:
        file.write(str(soup))

# Get all .xhtml files in the current directory
xhtml_files = [f for f in os.listdir('.') if f.endswith('.xhtml')]

# Process each file
for xhtml_file in xhtml_files:
    process_xhtml_file(xhtml_file)

解epub/重新打包epub的部分還沒寫, 但目前這樣的代碼, 經測試是能將body內一般的譯文

tags正常的選取出來, 並保持譯文與原文相同格式.

當然這還是band-aid solution啦...希望上述代碼能多少提供一點runtime改進的思路. Thanks!

理论上 single 可以我研究下,谢谢你~