How to use language id filter ?

I found a similar open issue titled "How to use language id filter?" which was last updated on October 25, 2024. The issue currently has no comments or solution ^[1].

To use the language ID filter in the data-juicer project, you can utilize the LanguageIDScoreFilter class from the data_juicer.ops.filter.language_id_score_filter module. Here's a brief guide on how to apply this filter:

Create a Dataset: Prepare your dataset with the text data you want to filter.
Initialize the Filter: Set up the LanguageIDScoreFilter with the desired language and minimum score.
Run the Filter: Apply the filter to your dataset.

Here's a code example:

from datasets import Dataset
from data_juicer.ops.filter.language_id_score_filter import LanguageIDScoreFilter

# Example dataset
ds_list = [
    {'text': 'a=1\nb\nc=1+2+3+5\nd=6'},
    {'text': "Today is Sund Sund Sund Sunda and it's a happy day!\nYou know"},
    {'text': '我出生于2023年12月15日'},
    {'text': '，。、„”“«»１」「《》´∶：？！（）；–—'},
    {'text': '他的英文名字叫Harry Potter'},
    {'text': '这是一个测试'}
]

# Create a Dataset object
dataset = Dataset.from_list(ds_list)

# Initialize the LanguageIDScoreFilter for English language with a minimum score of 0.8
op = LanguageIDScoreFilter(lang='en', min_score=0.8)

# Add a 'stats' column if not present
if 'stats' not in dataset.features:
    dataset = dataset.add_column(name='stats', column=[{}] * dataset.num_rows)

# Compute stats and filter the dataset
dataset = dataset.map(op.compute_stats)
filtered_dataset = dataset.filter(op.process)

# Select only the 'text' column
filtered_dataset = filtered_dataset.select_columns(column_names=['text'])

# Convert the filtered dataset to a list
result_list = filtered_dataset.to_list()

# Output the filtered text
print(result_list)

This code filters the dataset to include only texts identified as English with a score of at least 0.8. Adjust the lang and min_score parameters as needed for different languages or score thresholds ^[2]^[3].

To continue talking to Dosu, mention @dosu.

Written by @dosu, approved by zhijianma

zhijianma / data-juicer

How to use language id filter ? #2

Before Asking 在提问之前

Search before asking 先搜索，再提问

Question

Additional 额外信息