Open zhijianma opened 1 week ago
I found a similar open issue titled "How to use language id filter?" which was last updated on October 25, 2024. The issue currently has no comments or solution [1].
To use the language ID filter in the data-juicer project, you can utilize the LanguageIDScoreFilter
class from the data_juicer.ops.filter.language_id_score_filter
module. Here's a brief guide on how to apply this filter:
LanguageIDScoreFilter
with the desired language and minimum score.Here's a code example:
from datasets import Dataset
from data_juicer.ops.filter.language_id_score_filter import LanguageIDScoreFilter
# Example dataset
ds_list = [
{'text': 'a=1\nb\nc=1+2+3+5\nd=6'},
{'text': "Today is Sund Sund Sund Sunda and it's a happy day!\nYou know"},
{'text': '我出生于2023年12月15日'},
{'text': ',。、„”“«»1」「《》´∶:?!();–—'},
{'text': '他的英文名字叫Harry Potter'},
{'text': '这是一个测试'}
]
# Create a Dataset object
dataset = Dataset.from_list(ds_list)
# Initialize the LanguageIDScoreFilter for English language with a minimum score of 0.8
op = LanguageIDScoreFilter(lang='en', min_score=0.8)
# Add a 'stats' column if not present
if 'stats' not in dataset.features:
dataset = dataset.add_column(name='stats', column=[{}] * dataset.num_rows)
# Compute stats and filter the dataset
dataset = dataset.map(op.compute_stats)
filtered_dataset = dataset.filter(op.process)
# Select only the 'text' column
filtered_dataset = filtered_dataset.select_columns(column_names=['text'])
# Convert the filtered dataset to a list
result_list = filtered_dataset.to_list()
# Output the filtered text
print(result_list)
This code filters the dataset to include only texts identified as English with a score of at least 0.8. Adjust the lang
and min_score
parameters as needed for different languages or score thresholds [2][3].
To continue talking to Dosu, mention @dosu.
Written by @dosu, approved by zhijianma
Before Asking 在提问之前
[X] I have read the README carefully. 我已经仔细阅读了 README 上的操作指引。
[X] I have pulled the latest code of main branch to run again and the problem still existed. 我已经拉取了主分支上最新的代码,重新运行之后,问题仍不能解决。
Search before asking 先搜索,再提问
Question
How to use language id filter ?
Additional 额外信息
No response