zhijianma / data-juicer

A one-stop data processing system to make data higher-quality, juicier, and more digestible for LLMs! 🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷为大语言模型提供更高质量、更丰富、更易”消化“的数据!
Apache License 2.0
0 stars 0 forks source link

[中文主页] | [Docs] | [API] | [DJ-SORA]

Data-Juicer: A One-Stop Data Processing System for Large Language Models

Data-Juicer

pypi version Docker version

DataModality Usage ModelScope- Demos HuggingFace- Demos

Document_List 文档列表 API Reference Paper

Data-Juicer is a one-stop multimodal data processing system to make data higher-quality, juicier, and more digestible for LLMs.

Data-Juicer (including DJ-SORA) is being actively updated and maintained. We will periodically enhance and add more features, data recipes and datasets. We welcome you to join us in promoting LLM data development and research!

If you find Data-Juicer useful for your research or development, please kindly cite our work. Welcome to join our Slack channel, DingDing group, or WeChat group (scan the QR code below with WeChat) for discussion.

QR Code for WeChat group

News

Table of Contents

Features

Overview

Documentation Index

Demos

Prerequisites

Installation

From Source

cd <path_to_data_juicer>
pip install -v -e .  # install a minimal dependencies, which support the basic functions
pip install -v -e .[tools] # install a subset of tools dependencies

The dependency options are listed below:

Tag Description
. or .[mini] Install minimal dependencies for basic Data-Juicer.
.[all] Install all optional dependencies (including minimal dependencies and all of the following).
.[sci] Install all dependencies for all OPs.
.[dist] Install dependencies for distributed data processing. (Experimental)
.[dev] Install dependencies for developing the package as contributors.
.[tools] Install dependencies for dedicated tools, such as quality classifiers.

Using pip

pip install py-data-juicer

Using Docker

Installation check

import data_juicer as dj
print(dj.__version__)

Quick Start

Data Processing

# only for installation from source
python tools/process_data.py --config configs/demo/process.yaml

# use command line tool
dj-process --config configs/demo/process.yaml
# cache home
export DATA_JUICER_CACHE_HOME="/path/to/another/directory"
# cache models
export DATA_JUICER_MODELS_CACHE="/path/to/another/directory/models"
# cache assets
export DATA_JUICER_ASSETS_CACHE="/path/to/another/directory/assets"

Distributed Data Processing

We have now implemented multi-machine distributed data processing based on RAY. The corresponding demos can be run using the following commands:

# Run text data processing
python tools/process_data.py --config ./demos/process_on_ray/configs/demo.yaml
# Run video data processing
python tools/process_data.py --config ./demos/process_video_on_ray/configs/demo.yaml

Data Analysis

# only for installation from source
python tools/analyze_data.py --config configs/demo/analyser.yaml

# use command line tool
dj-analyze --config configs/demo/analyser.yaml

Data Visualization

streamlit run app.py

Build Up Config Files

python xxx.py --config configs/demo/process.yaml --language_id_score_filter.lang=en

Preprocess Raw Data (Optional)

For Docker Users

# run the data processing directly
docker run --rm \  # remove container after the processing
  --name dj \  # name of the container
  -v <host_data_path>:<image_data_path> \  # mount data or config directory into the container
  -v ~/.cache/:/root/.cache/ \  # mount the cache directory into the container to reuse caches and models (recommended)
  datajuicer/data-juicer:<version_tag> \  # image to run
  dj-process --config /path/to/config.yaml  # similar data processing commands
# start the container
docker run -dit \  # run the container in the background
  --rm \
  --name dj \
  -v <host_data_path>:<image_data_path> \
  -v ~/.cache/:/root/.cache/ \
  datajuicer/data-juicer:latest /bin/bash

# enter into this container and then you can use data-juicer in editable mode
docker exec -it <container_id> bash

Data Recipes

License

Data-Juicer is released under Apache License 2.0.

Contributing

We are in a rapidly developing field and greatly welcome contributions of new features, bug fixes and better documentations. Please refer to How-to Guide for Developers.

If you have any questions, please join our discussion groups.

Acknowledgement

Data-Juicer is used across various LLM products and research initiatives, including industrial LLMs from Alibaba Cloud's Tongyi, such as Dianjin for financial analysis, and Zhiwen for reading assistant, as well as the Alibaba Cloud's platform for AI (PAI). We look forward to more of your experience, suggestions and discussions for collaboration!

Data-Juicer thanks and refers to several community projects, such as Huggingface-Datasets, Bloom, RedPajama, Pile, Alpaca-Cot, Megatron-LM, DeepSpeed, Arrow, Ray, Beam, LM-Harness, HELM, ....

References

If you find our work useful for your research or development, please kindly cite the following paper.

@inproceedings{chen2024datajuicer,
title={Data-Juicer: A One-Stop Data Processing System for Large Language Models},
author={Daoyuan Chen and Yilun Huang and Zhijian Ma and Hesen Chen and Xuchen Pan and Ce Ge and Dawei Gao and Yuexiang Xie and Zhaoyang Liu and Jinyang Gao and Yaliang Li and Bolin Ding and Jingren Zhou},
  booktitle={International Conference on Management of Data},
  year={2024}
}