Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix redpajama links #201

Merged
merged 37 commits into from
Jan 30, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
573a704
fix opencc serialization error
chenhesen Nov 16, 2023
988224f
Merge branch 'main' of github.com:alibaba/data-juicer
chenhesen Nov 20, 2023
0ac51cc
Merge branch 'main' of github.com:alibaba/data-juicer
chenhesen Nov 21, 2023
4fee9a1
support audio-text data reading
chenhesen Nov 21, 2023
d856a80
update multimodal_README
chenhesen Nov 22, 2023
29de3de
Merge branch 'main' of github.com:alibaba/data-juicer
chenhesen Nov 23, 2023
fc98733
fix pre-commit error
chenhesen Nov 23, 2023
539f099
modify audio_special_token
chenhesen Nov 23, 2023
ca34dfc
Merge branch 'main' of github.com:alibaba/data-juicer
chenhesen Nov 23, 2023
4c27643
support only one target_field
chenhesen Nov 23, 2023
e54d197
fix pre-commit
chenhesen Nov 23, 2023
4743e62
add id for log
chenhesen Nov 24, 2023
6c58bee
fix conflict
chenhesen Nov 29, 2023
6449f15
Merge branch 'main' of github.com:alibaba/data-juicer
chenhesen Dec 4, 2023
70b6c73
Merge branch 'main' of github.com:alibaba/data-juicer
chenhesen Dec 8, 2023
4105d1e
Merge branch 'main' of github.com:alibaba/data-juicer
chenhesen Dec 15, 2023
027af2b
Merge branch 'main' of github.com:alibaba/data-juicer
chenhesen Dec 21, 2023
8a4a0e7
add remove_repeat_sentences_mapper
chenhesen Dec 21, 2023
65bee64
Merge branch 'main' of github.com:alibaba/data-juicer
chenhesen Dec 22, 2023
05627ed
Merge branch 'main' of github.com:alibaba/data-juicer
chenhesen Dec 27, 2023
021d14b
modify mapper op number
chenhesen Dec 27, 2023
584783e
Merge branch 'main' of github.com:alibaba/data-juicer
chenhesen Dec 27, 2023
1688099
update image_blur
chenhesen Jan 5, 2024
5b72309
Merge branch 'main' of github.com:alibaba/data-juicer
chenhesen Jan 5, 2024
fb3188a
add image_blur_mapper
chenhesen Jan 11, 2024
8b3d87d
add image_blur_mapper
chenhesen Jan 17, 2024
d80d868
precommit
chenhesen Jan 17, 2024
3e264b6
update __init__
chenhesen Jan 17, 2024
1bf1c7d
Merge branch 'main' of github.com:alibaba/data-juicer
chenhesen Jan 17, 2024
58ae2c6
fix conflicts
chenhesen Jan 17, 2024
22021b7
fix conficts
chenhesen Jan 18, 2024
344e240
replaced by the latest load_data_with_context
chenhesen Jan 19, 2024
a52311f
fix docs conflicts
chenhesen Jan 19, 2024
5a6b552
fix Operators_ZH
chenhesen Jan 19, 2024
4794a74
fix docs conflicts
chenhesen Jan 19, 2024
124c656
Merge branch 'main' of github.com:alibaba/data-juicer
chenhesen Jan 22, 2024
402f4bf
fix redpajama link
chenhesen Jan 30, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion LICENSE
Original file line number Diff line number Diff line change
Expand Up @@ -251,7 +251,7 @@ Code in data_juicer/ops/mapper/clean_copyright_mapper.py, data_juicer/ops/mapper
data_juicer/ops/mapper/expand_macro_mapper.py, data_juicer/ops/mapper/remove_bibliography_mapper.py,
data_juicer/ops/mapper/remove_comments_mapper.py, data_juicer/ops/mapper/remove_header_mapper.py,
is adapted from
https://github.com/togethercomputer/RedPajama-Data
https://github.com/togethercomputer/RedPajama-Data/tree/rp_v1/

Copyright 2023 RedPajama authors.

Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -350,7 +350,7 @@ Cloud's platform for AI (PAI).
We look forward to more of your experience, suggestions and discussions for collaboration!

Data-Juicer thanks and refers to several community projects, such as
[Huggingface-Datasets](https://github.com/huggingface/datasets), [Bloom](https://huggingface.co/bigscience/bloom), [RedPajama](https://github.com/togethercomputer/RedPajama-Data), [Pile](https://huggingface.co/datasets/EleutherAI/pile), [Alpaca-Cot](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT), [Megatron-LM](https://github.com/NVIDIA/Megatron-LM), [DeepSpeed](https://www.deepspeed.ai/), [Arrow](https://github.com/apache/arrow), [Ray](https://github.com/ray-project/ray), [Beam](https://github.com/apache/beam), [LM-Harness](https://github.com/EleutherAI/lm-evaluation-harness), [HELM](https://github.com/stanford-crfm/helm), ....
[Huggingface-Datasets](https://github.com/huggingface/datasets), [Bloom](https://huggingface.co/bigscience/bloom), [RedPajama](https://github.com/togethercomputer/RedPajama-Data/tree/rp_v1), [Pile](https://huggingface.co/datasets/EleutherAI/pile), [Alpaca-Cot](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT), [Megatron-LM](https://github.com/NVIDIA/Megatron-LM), [DeepSpeed](https://www.deepspeed.ai/), [Arrow](https://github.com/apache/arrow), [Ray](https://github.com/ray-project/ray), [Beam](https://github.com/apache/beam), [LM-Harness](https://github.com/EleutherAI/lm-evaluation-harness), [HELM](https://github.com/stanford-crfm/helm), ....



Expand Down
2 changes: 1 addition & 1 deletion README_ZH.md
Original file line number Diff line number Diff line change
Expand Up @@ -328,7 +328,7 @@ Data-Juicer 被各种 LLM产品和研究工作使用,包括来自阿里云-通


Data-Juicer 感谢并参考了社区开源项目:
[Huggingface-Datasets](https://github.com/huggingface/datasets), [Bloom](https://huggingface.co/bigscience/bloom), [RedPajama](https://github.com/togethercomputer/RedPajama-Data), [Pile](https://huggingface.co/datasets/EleutherAI/pile), [Alpaca-Cot](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT), [Megatron-LM](https://github.com/NVIDIA/Megatron-LM), [DeepSpeed](https://www.deepspeed.ai/), [Arrow](https://github.com/apache/arrow), [Ray](https://github.com/ray-project/ray), [Beam](https://github.com/apache/beam), [LM-Harness](https://github.com/EleutherAI/lm-evaluation-harness), [HELM](https://github.com/stanford-crfm/helm), ....
[Huggingface-Datasets](https://github.com/huggingface/datasets), [Bloom](https://huggingface.co/bigscience/bloom), [RedPajama](https://github.com/togethercomputer/RedPajama-Data/tree/rp_v1), [Pile](https://huggingface.co/datasets/EleutherAI/pile), [Alpaca-Cot](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT), [Megatron-LM](https://github.com/NVIDIA/Megatron-LM), [DeepSpeed](https://www.deepspeed.ai/), [Arrow](https://github.com/apache/arrow), [Ray](https://github.com/ray-project/ray), [Beam](https://github.com/apache/beam), [LM-Harness](https://github.com/EleutherAI/lm-evaluation-harness), [HELM](https://github.com/stanford-crfm/helm), ....



Expand Down
10 changes: 5 additions & 5 deletions configs/reproduced_redpajama/README.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
# Redpajama Config Files

This folder contains example configuration files to easily and quickly reproduce the processing flow of [Redpajama](https://github.com/togethercomputer/RedPajama-Data/tree/main/data_prep).
This folder contains example configuration files to easily and quickly reproduce the processing flow of [Redpajama](https://github.com/togethercomputer/RedPajama-Data/tree/rp_v1/data_prep).

## arXiv
The raw data files can be downloaded from the same AWS link as in [Redpajama/arXiv](https://github.com/togethercomputer/RedPajama-Data/tree/main/data_prep/arxiv).
The raw data files can be downloaded from the same AWS link as in [Redpajama/arXiv](https://github.com/togethercomputer/RedPajama-Data/tree/rp_v1/data_prep/arxiv).

Once downloaded, use [raw_arxiv_to_jsonl.py](../../tools/preprocess/raw_arxiv_to_jsonl.py) to convert from the original format to `jsonl` that Data-Juicer can handle easily:

Expand All @@ -30,7 +30,7 @@ python tools/process_data.py --config configs/reproduced_redpajama/redpajama-arx

## Books

The raw data files can be downloaded from the same HuggingFace datasets as in [Redpajama/Books](https://github.com/togethercomputer/RedPajama-Data/tree/main/data_prep/book).
The raw data files can be downloaded from the same HuggingFace datasets as in [Redpajama/Books](https://github.com/togethercomputer/RedPajama-Data/tree/rp_v1/data_prep/book).

Once downloaded, modify the path configurations in [redpajama-books.yaml](redpajama-books.yaml) and execute the following command to reproduce the processing flow of RedPajama.

Expand All @@ -47,7 +47,7 @@ python tools/process_data.py --config configs/reproduced_redpajama/redpajama-boo

## Code

The raw data files can be downloaded from Google BigQuery as in [Redpajama/Code](https://github.com/togethercomputer/RedPajama-Data/tree/main/data_prep/github).
The raw data files can be downloaded from Google BigQuery as in [Redpajama/Code](https://github.com/togethercomputer/RedPajama-Data/tree/rp_v1/data_prep/github).

Once downloaded, unzip and delete files whose extensions are not in the following whitelist:

Expand All @@ -70,7 +70,7 @@ python tools/process_data.py --config configs/redpajama/redpajama-code.yaml

## StackExchange

The raw data files can be downloaded from the same Archive link as in [Redpajama/Stack_exchange](https://github.com/togethercomputer/RedPajama-Data/tree/main/data_prep/stack_exchange).
The raw data files can be downloaded from the same Archive link as in [Redpajama/Stack_exchange](https://github.com/togethercomputer/RedPajama-Data/tree/rp_v1/data_prep/stack_exchange).

Once downloaded, use [raw_stackexchange_to_jsonl.py](../../tools/preprocess/raw_stackexchange_to_jsonl.py) to convert from the original format to `jsonl` that Data-Juicer can handle easily:

Expand Down
10 changes: 5 additions & 5 deletions configs/reproduced_redpajama/README_ZH.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
# Redpajama 配置文件

此文件夹包含的配置文件用于轻松复现 [Redpajama](https://github.com/togethercomputer/RedPajama-Data/tree/main/data_prep) 的处理流程。
此文件夹包含的配置文件用于轻松复现 [Redpajama](https://github.com/togethercomputer/RedPajama-Data/tree/rp_v1/data_prep) 的处理流程。

## arXiv

原始数据文件从 [Redpajama/arXiv](https://github.com/togethercomputer/RedPajama-Data/tree/main/data_prep/arxiv) 中相同的 AWS 链接下载。
原始数据文件从 [Redpajama/arXiv](https://github.com/togethercomputer/RedPajama-Data/tree/rp_v1/data_prep/arxiv) 中相同的 AWS 链接下载。

下载完成后,使用 [raw_arxiv_to_jsonl.py](../../tools/preprocess/raw_arxiv_to_jsonl.py) 将原始格式转换为 Data-Juicer 易于处理的格式:

Expand All @@ -31,7 +31,7 @@ python tools/process_data.py --config configs/reproduced_redpajama/redpajama-arx

## Books

原始数据文件从 [Redpajama/Books](https://github.com/togethercomputer/RedPajama-Data/tree/main/data_prep/book) 中相同的 HuggingFace 链接下载。
原始数据文件从 [Redpajama/Books](https://github.com/togethercomputer/RedPajama-Data/tree/rp_v1/data_prep/book) 中相同的 HuggingFace 链接下载。

下载完成后,修改 [redpajama-books.yaml](redpajama-books.yaml) 中的数据路径,执行以下命令复现 RedPajama 的处理流程:

Expand All @@ -48,7 +48,7 @@ python tools/process_data.py --config configs/reproduced_redpajama/redpajama-boo

## Code

原始数据文件从 [Redpajama/Code](https://github.com/togethercomputer/RedPajama-Data/tree/main/data_prep/github) 中相同的 Google BigQuery 获取。
原始数据文件从 [Redpajama/Code](https://github.com/togethercomputer/RedPajama-Data/tree/rp_v1/data_prep/github) 中相同的 Google BigQuery 获取。

下载完成后,解压缩并删除扩展名不在以下白名单中的其他文件:

Expand All @@ -71,7 +71,7 @@ python tools/process_data.py --config configs/redpajama/redpajama-code.yaml

## StackExchange

原始数据文件从 [Redpajama/Stack_exchange](https://github.com/togethercomputer/RedPajama-Data/tree/main/data_prep/stack_exchange) 中相同的 Archive 链接获取。
原始数据文件从 [Redpajama/Stack_exchange](https://github.com/togethercomputer/RedPajama-Data/tree/rp_v1/data_prep/stack_exchange) 中相同的 Archive 链接获取。

下载完成后,使用 [raw_stackexchange_to_jsonl.py](../../tools/preprocess/raw_stackexchange_to_jsonl.py) 将原始格式转换为 Data-Juicer 易于处理的格式:

Expand Down
2 changes: 1 addition & 1 deletion data_juicer/ops/mapper/clean_copyright_mapper.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# Some code here has been modified from:
# https://github.com/togethercomputer/RedPajama-Data/
# https://github.com/togethercomputer/RedPajama-Data/tree/rp_v1/
# --------------------------------------------------------

import regex as re
Expand Down
2 changes: 1 addition & 1 deletion data_juicer/ops/mapper/clean_html_mapper.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# Some code here has been modified from:
# https://github.com/togethercomputer/RedPajama-Data/
# https://github.com/togethercomputer/RedPajama-Data/tree/rp_v1/
# --------------------------------------------------------

from data_juicer.utils.availability_utils import AvailabilityChecking
Expand Down
2 changes: 1 addition & 1 deletion data_juicer/ops/mapper/expand_macro_mapper.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# Some code here has been modified from:
# https://github.com/togethercomputer/RedPajama-Data/blob/main/data_prep/arxiv/arxiv_cleaner.py
# https://github.com/togethercomputer/RedPajama-Data/blob/rp_v1/data_prep/arxiv/arxiv_cleaner.py
# --------------------------------------------------------

import regex as re
Expand Down
2 changes: 1 addition & 1 deletion data_juicer/ops/mapper/remove_bibliography_mapper.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# Some code here has been modified from:
# https://github.com/togethercomputer/RedPajama-Data/
# https://github.com/togethercomputer/RedPajama-Data/tree/rp_v1/
# --------------------------------------------------------

import regex as re
Expand Down
2 changes: 1 addition & 1 deletion data_juicer/ops/mapper/remove_comments_mapper.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# Some code here has been modified from:
# https://github.com/togethercomputer/RedPajama-Data/
# https://github.com/togethercomputer/RedPajama-Data/tree/rp_v1/
# --------------------------------------------------------

from typing import List, Union
Expand Down
2 changes: 1 addition & 1 deletion data_juicer/ops/mapper/remove_header_mapper.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# Some code here has been modified from:
# https://github.com/togethercomputer/RedPajama-Data/
# https://github.com/togethercomputer/RedPajama-Data/tree/rp_v1/
# --------------------------------------------------------

import regex as re
Expand Down
4 changes: 2 additions & 2 deletions tools/preprocess/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,7 @@ python tools/preprocess/raw_arxiv_to_jsonl.py --help

**Note:**

* For downloading process, please refer to [here](https://github.com/togethercomputer/RedPajama-Data/tree/main/data_prep/arxiv).
* For downloading process, please refer to [here](https://github.com/togethercomputer/RedPajama-Data/tree/rp_v1/data_prep/arxiv).

* Before you downloading, converting or processing, you might make sure that your drive space is large enough to store the raw data (over 3TB), converted data (over 3TB), at least processed data (about 500-600GB), and even more cache data during processing.

Expand All @@ -71,7 +71,7 @@ python tools/preprocess/raw_arxiv_stackexchange_to_jsonl.py \
# get help
python tools/preprocess/raw_stackexchange_to_jsonl.py --help
```
- `src_dir`: if you download raw Stack Exchange data as Redpajama did, you will get a directory src which includes hundreds of 7z files whose filenames are like `*.*.com.7z `. You need to unzip these files and rename the POSTs.xml to the corresponding compressed package name and place it in that dir. For more details, please refer to [here](https://github.com/togethercomputer/RedPajama-Data/tree/main/data_prep/stack_exchange).
- `src_dir`: if you download raw Stack Exchange data as Redpajama did, you will get a directory src which includes hundreds of 7z files whose filenames are like `*.*.com.7z `. You need to unzip these files and rename the POSTs.xml to the corresponding compressed package name and place it in that dir. For more details, please refer to [here](https://github.com/togethercomputer/RedPajama-Data/tree/rp_v1/data_prep/stack_exchange).
- `target_dir`: result directory to store the converted jsonl files.
- `topk` (optional): select the topk sites with the most content. Default it's 28.
- `num_proc` (optional): number of process workers. Default it's 1.
Expand Down
4 changes: 2 additions & 2 deletions tools/preprocess/README_ZH.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@ python tools/preprocess/raw_arxiv_to_jsonl.py --help

**注意事项:**

* 下载过程请参考[这里](https://github.com/togethercomputer/RedPajama-Data/tree/main/data_prep/arxiv)。
* 下载过程请参考[这里](https://github.com/togethercomputer/RedPajama-Data/tree/rp_v1/data_prep/arxiv)。

* 在下载、转换或处理之前,您需要确保您的硬盘空间足够大,可以存储原始数据(超过 3TB)、转换后的数据(超过 3TB)、最小处理后的数据(大约 500-600GB),以及处理期间的缓存数据。

Expand All @@ -69,7 +69,7 @@ python tools/preprocess/raw_arxiv_stackexchange_to_jsonl.py \
python tools/preprocess/raw_stackexchange_to_jsonl.py --help
```

- `src_dir`: 如果像 Redpajama 一样下载原始 Stack Exchange 数据,你将得到一个目录 src,其中包含数百个 7z 文件,其文件名类似于 `*.*.com.7z`。 您需要解压这些文件并将 POSTs.xml 重命名为相应的压缩包名称并将其放在该目录中。更多详情请参考[这里](https://github.com/togethercomputer/RedPajama-Data/tree/main/data_prep/stack_exchange)。
- `src_dir`: 如果像 Redpajama 一样下载原始 Stack Exchange 数据,你将得到一个目录 src,其中包含数百个 7z 文件,其文件名类似于 `*.*.com.7z`。 您需要解压这些文件并将 POSTs.xml 重命名为相应的压缩包名称并将其放在该目录中。更多详情请参考[这里](https://github.com/togethercomputer/RedPajama-Data/tree/rp_v1/data_prep/stack_exchange)。
- `target_dir`: 用于存储转换后的 jsonl 文件的结果目录。
- `topk` (可选): 选择内容最多的 k 个站点,默认为 28.
- `num_proc` (可选): worker 进程数量,默认为 1。
Expand Down
4 changes: 2 additions & 2 deletions tools/preprocess/raw_arxiv_to_jsonl.py
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
# Part of the code here has been modified from:
# https://github.com/togethercomputer/RedPajama-Data/blob/main/data_prep/arxiv/arxiv_cleaner.py
# https://github.com/togethercomputer/RedPajama-Data/blob/rp_v1/data_prep/arxiv/arxiv_cleaner.py
# --------------------------------------------------------
#
# This tool is used for converting the raw arxiv data downloaded from S3
# (ref: https://info.arxiv.org/help/bulk_data_s3.html) to several jsonl files.
#
# For downloading process, please refer to:
# https://github.com/togethercomputer/RedPajama-Data/tree/main/data_prep/arxiv
# https://github.com/togethercomputer/RedPajama-Data/tree/rp_v1/data_prep/arxiv
#
# Notice: before you downloading, converting or processing, you might make sure
# that your drive space is large enough to store the raw data (over 3TB),
Expand Down
4 changes: 2 additions & 2 deletions tools/preprocess/raw_stackexchange_to_jsonl.py
Original file line number Diff line number Diff line change
@@ -1,13 +1,13 @@
# Part of the code here has been modified from:
# https://github.com/togethercomputer/RedPajama-Data/tree/main/data_prep/stack_exchange
# https://github.com/togethercomputer/RedPajama-Data/tree/rp_v1/data_prep/stack_exchange
# --------------------------------------------------------
#
# This tool is used for converting the raw Stack Exchange data downloaded from
# from Archive (ref: https://archive.org/download/stackexchange) to several
# jsonl files.
#
# For downloading process, please refer to:
# https://github.com/togethercomputer/RedPajama-Data/tree/main/data_prep/stack_exchange
# https://github.com/togethercomputer/RedPajama-Data/tree/rp_v1/data_prep/stack_exchange
#
# Notice: before you downloading, converting or processing, you might make sure
# that your drive space is large enough to store the raw data (over 100GB),
Expand Down
Loading