希望能够适当拼合短段落

aqssxlzc commented 6 months ago

我注意到当段落很短时，比如对话类型的，每段只有一句这种情况下是每一句单独请求一次翻译引擎但是对于GPT来说，每次给与适量长度的文本时，翻译效果会更好。能否在这种情况下适当拼合短段落将多句对话送入引擎去翻译

yihong0618 commented 6 months ago

可以，有兴趣来个 PR 么？

yihong0618 commented 6 months ago

可以，有兴趣来个 PR 么？

不过我们这块代码当时写的比较急没有把关的很好，可能稍微有点困难，我自己的话，可能放在之后做。

Ninzore commented 5 months ago

试着做了一下，发现在合并段落后，经过翻译很难保持原有格式，或许只能在single_translate模式下使用。

yihong0618 commented 5 months ago

试着做了一下，发现在合并段落后，经过翻译很难保持原有格式，或许只能在single_translate模式下使用。

谢谢之前我们也感觉格式不太好弄，不过可以做成 options 参数？

Ninzore commented 5 months ago

嗯，接收新的这个参数的时候也需要提醒开启single_translate，我发个pr上来？

yihong0618 commented 5 months ago

嗯，接收新的这个参数的时候也需要提醒开启single_translate，我发个pr上来？

如果不是 single 直接 raise 好一些，如果开启这个参数。 welcome

Ninzore commented 5 months ago

其实换个思路，可以把上下文一起发给LLM，但是指定只翻译当前的那一句。这样可以同时保证翻译质量和格式准确。但问题是token使用量暴涨，如果不是自己搭的普通人应该消耗不起。

yihong0618 commented 5 months ago

其实换个思路，可以把上下文一起发给LLM，但是指定只翻译当前的那一句。这样可以同时保证翻译质量和格式准确。但问题是token使用量暴涨，如果不是自己搭的普通人应该消耗不起。

倒是可以但是会严重拖慢速度和花更多的钱，要是做的话也应该是默认关

29988122 commented 5 months ago

我知道這個ticket是在討論不要一次只丟一句給api翻譯.

稍微離題一下, 我對下面這件事有點問題, 不知道有沒有解？

试着做了一下，发现在合并段落后，经过翻译很难保持原有格式，或许只能在single_translate模式下使用。

我自己的狀況是語言能力尚可, 而目前版本epub下的呈現方式：一句原文一句譯文, 會阻礙閱讀速度和理解. 用大塊的方式, 來一次閱讀一段原文而非一句原文, 後再接譯文, 會比較適當.

我很誠實的說我看不懂原本的代碼, 沒法直接提PR, 但我對產完的bilingual epub裡面的xhtml file做了這樣的處理：

import os
from bs4 import BeautifulSoup

# Variable for the number of <p> tags to aggregate
num_p_tags_to_aggregate = 5

# Function to process each XHTML file
def process_xhtml_file(file_path):
    with open(file_path, 'r') as file:
        content = file.read()

    soup = BeautifulSoup(content, 'html.parser')
    body = soup.find('body')

    # Get all lines within the <body> tag
    body_lines = str(body).split('\n')

    # Variables to store selected <p> tags and their positions
    selected_p_tags = []
    positions = []

    # Process each line within the <body> section
    for i, line in enumerate(body_lines):
        line_soup = BeautifulSoup(line, 'html.parser')
        p_tags = line_soup.find_all('p')
        if len(p_tags) >= 2:
            # Store the second <p> tag and its position
            selected_p_tags.append(str(p_tags[1]))
            positions.append(i)
            # Remove the selected <p> tag from the line
            body_lines[i] = str(line_soup).replace(str(p_tags[1]), '')

    # Grouping every 'num_p_tags_to_aggregate' <p> tags and appending them to the end of the line
    for i in range(0, len(selected_p_tags), num_p_tags_to_aggregate):
        group = ''.join(selected_p_tags[i:i+num_p_tags_to_aggregate])
        if i+num_p_tags_to_aggregate-1 < len(positions):
            pos_to_append = positions[i+num_p_tags_to_aggregate-1]
            body_lines[pos_to_append] = body_lines[pos_to_append].strip() + group
        elif len(selected_p_tags) % num_p_tags_to_aggregate != 0 and i + num_p_tags_to_aggregate > len(selected_p_tags):
            # Handle the edge case for the last few <p> tags
            pos_to_append = positions[-1]  # Append to the last position
            body_lines[pos_to_append] = body_lines[pos_to_append].strip() + group

    # Reconstruct the <body> content with the modified lines
    modified_body_content = '\n'.join(body_lines)

    # Replace the old <body> content with the modified content
    soup.body.replace_with(BeautifulSoup(modified_body_content, 'html.parser'))

    # Write the modified content back to the file
    with open(file_path, 'w') as file:
        file.write(str(soup))

# Get all .xhtml files in the current directory
xhtml_files = [f for f in os.listdir('.') if f.endswith('.xhtml')]

# Process each file
for xhtml_file in xhtml_files:
    process_xhtml_file(xhtml_file)

解epub/重新打包epub的部分還沒寫, 但目前這樣的代碼, 經測試是能將body內一般的譯文

tags正常的選取出來, 並保持譯文與原文相同格式.

當然這還是band-aid solution啦...希望上述代碼能多少提供一點runtime改進的思路. Thanks!

yihong0618 commented 5 months ago

我知道這個ticket是在討論不要一次只丟一句給api翻譯.

稍微離題一下, 我對下面這件事有點問題, 不知道有沒有解？

试着做了一下，发现在合并段落后，经过翻译很难保持原有格式，或许只能在single_translate模式下使用。

我自己的狀況是語言能力尚可, 而目前版本epub下的呈現方式：一句原文一句譯文, 會阻礙閱讀速度和理解. 用大塊的方式, 來一次閱讀一段原文而非一句原文, 後再接譯文, 會比較適當.

我很誠實的說我看不懂原本的代碼, 沒法直接提PR, 但我對產完的bilingual epub裡面的xhtml file做了這樣的處理：

import os
from bs4 import BeautifulSoup

# Variable for the number of <p> tags to aggregate
num_p_tags_to_aggregate = 5

# Function to process each XHTML file
def process_xhtml_file(file_path):
    with open(file_path, 'r') as file:
        content = file.read()

    soup = BeautifulSoup(content, 'html.parser')
    body = soup.find('body')

    # Get all lines within the <body> tag
    body_lines = str(body).split('\n')

    # Variables to store selected <p> tags and their positions
    selected_p_tags = []
    positions = []

    # Process each line within the <body> section
    for i, line in enumerate(body_lines):
        line_soup = BeautifulSoup(line, 'html.parser')
        p_tags = line_soup.find_all('p')
        if len(p_tags) >= 2:
            # Store the second <p> tag and its position
            selected_p_tags.append(str(p_tags[1]))
            positions.append(i)
            # Remove the selected <p> tag from the line
            body_lines[i] = str(line_soup).replace(str(p_tags[1]), '')

    # Grouping every 'num_p_tags_to_aggregate' <p> tags and appending them to the end of the line
    for i in range(0, len(selected_p_tags), num_p_tags_to_aggregate):
        group = ''.join(selected_p_tags[i:i+num_p_tags_to_aggregate])
        if i+num_p_tags_to_aggregate-1 < len(positions):
            pos_to_append = positions[i+num_p_tags_to_aggregate-1]
            body_lines[pos_to_append] = body_lines[pos_to_append].strip() + group
        elif len(selected_p_tags) % num_p_tags_to_aggregate != 0 and i + num_p_tags_to_aggregate > len(selected_p_tags):
            # Handle the edge case for the last few <p> tags
            pos_to_append = positions[-1]  # Append to the last position
            body_lines[pos_to_append] = body_lines[pos_to_append].strip() + group

    # Reconstruct the <body> content with the modified lines
    modified_body_content = '\n'.join(body_lines)

    # Replace the old <body> content with the modified content
    soup.body.replace_with(BeautifulSoup(modified_body_content, 'html.parser'))

    # Write the modified content back to the file
    with open(file_path, 'w') as file:
        file.write(str(soup))

# Get all .xhtml files in the current directory
xhtml_files = [f for f in os.listdir('.') if f.endswith('.xhtml')]

# Process each file
for xhtml_file in xhtml_files:
    process_xhtml_file(xhtml_file)

解epub/重新打包epub的部分還沒寫, 但目前這樣的代碼, 經測試是能將body內一般的譯文

tags正常的選取出來, 並保持譯文與原文相同格式.

當然這還是band-aid solution啦...希望上述代碼能多少提供一點runtime改進的思路. Thanks!

理论上 single 可以我研究下，谢谢你～

yihong0618 / bilingual_book_maker

希望能够适当拼合短段落 #360