zhijianma / data-juicer

A one-stop data processing system to make data higher-quality, juicier, and more digestible for LLMs! 🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷为大语言模型提供更高质量、更丰富、更易”消化“的数据!
Apache License 2.0
0 stars 0 forks source link

How to use language id filter ? #2

Open zhijianma opened 1 week ago

zhijianma commented 1 week ago

Before Asking 在提问之前

Search before asking 先搜索,再提问

Question

How to use language id filter ?

Additional 额外信息

No response

dosubot[bot] commented 1 week ago

I found a similar open issue titled "How to use language id filter?" which was last updated on October 25, 2024. The issue currently has no comments or solution [1].

To use the language ID filter in the data-juicer project, you can utilize the LanguageIDScoreFilter class from the data_juicer.ops.filter.language_id_score_filter module. Here's a brief guide on how to apply this filter:

  1. Create a Dataset: Prepare your dataset with the text data you want to filter.
  2. Initialize the Filter: Set up the LanguageIDScoreFilter with the desired language and minimum score.
  3. Run the Filter: Apply the filter to your dataset.

Here's a code example:

from datasets import Dataset
from data_juicer.ops.filter.language_id_score_filter import LanguageIDScoreFilter

# Example dataset
ds_list = [
    {'text': 'a=1\nb\nc=1+2+3+5\nd=6'},
    {'text': "Today is Sund Sund Sund Sunda and it's a happy day!\nYou know"},
    {'text': '我出生于2023年12月15日'},
    {'text': ',。、„”“«»1」「《》´∶:?!();–—'},
    {'text': '他的英文名字叫Harry Potter'},
    {'text': '这是一个测试'}
]

# Create a Dataset object
dataset = Dataset.from_list(ds_list)

# Initialize the LanguageIDScoreFilter for English language with a minimum score of 0.8
op = LanguageIDScoreFilter(lang='en', min_score=0.8)

# Add a 'stats' column if not present
if 'stats' not in dataset.features:
    dataset = dataset.add_column(name='stats', column=[{}] * dataset.num_rows)

# Compute stats and filter the dataset
dataset = dataset.map(op.compute_stats)
filtered_dataset = dataset.filter(op.process)

# Select only the 'text' column
filtered_dataset = filtered_dataset.select_columns(column_names=['text'])

# Convert the filtered dataset to a list
result_list = filtered_dataset.to_list()

# Output the filtered text
print(result_list)

This code filters the dataset to include only texts identified as English with a score of at least 0.8. Adjust the lang and min_score parameters as needed for different languages or score thresholds [2][3].

To continue talking to Dosu, mention @dosu.

Written by @dosu, approved by zhijianma