Skip to content

Commit

Permalink
Fix redpajama links (#201)
Browse files Browse the repository at this point in the history
* fix redpajama links
  • Loading branch information
chenhesen authored Jan 30, 2024
1 parent ab4d3c8 commit 1e512c5
Show file tree
Hide file tree
Showing 15 changed files with 27 additions and 27 deletions.
2 changes: 1 addition & 1 deletion LICENSE
Original file line number Diff line number Diff line change
Expand Up @@ -251,7 +251,7 @@ Code in data_juicer/ops/mapper/clean_copyright_mapper.py, data_juicer/ops/mapper
data_juicer/ops/mapper/expand_macro_mapper.py, data_juicer/ops/mapper/remove_bibliography_mapper.py,
data_juicer/ops/mapper/remove_comments_mapper.py, data_juicer/ops/mapper/remove_header_mapper.py,
is adapted from
https://github.com/togethercomputer/RedPajama-Data
https://github.com/togethercomputer/RedPajama-Data/tree/rp_v1/

Copyright 2023 RedPajama authors.

Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -350,7 +350,7 @@ Cloud's platform for AI (PAI).
We look forward to more of your experience, suggestions and discussions for collaboration!

Data-Juicer thanks and refers to several community projects, such as
[Huggingface-Datasets](https://github.com/huggingface/datasets), [Bloom](https://huggingface.co/bigscience/bloom), [RedPajama](https://github.com/togethercomputer/RedPajama-Data), [Pile](https://huggingface.co/datasets/EleutherAI/pile), [Alpaca-Cot](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT), [Megatron-LM](https://github.com/NVIDIA/Megatron-LM), [DeepSpeed](https://www.deepspeed.ai/), [Arrow](https://github.com/apache/arrow), [Ray](https://github.com/ray-project/ray), [Beam](https://github.com/apache/beam), [LM-Harness](https://github.com/EleutherAI/lm-evaluation-harness), [HELM](https://github.com/stanford-crfm/helm), ....
[Huggingface-Datasets](https://github.com/huggingface/datasets), [Bloom](https://huggingface.co/bigscience/bloom), [RedPajama](https://github.com/togethercomputer/RedPajama-Data/tree/rp_v1), [Pile](https://huggingface.co/datasets/EleutherAI/pile), [Alpaca-Cot](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT), [Megatron-LM](https://github.com/NVIDIA/Megatron-LM), [DeepSpeed](https://www.deepspeed.ai/), [Arrow](https://github.com/apache/arrow), [Ray](https://github.com/ray-project/ray), [Beam](https://github.com/apache/beam), [LM-Harness](https://github.com/EleutherAI/lm-evaluation-harness), [HELM](https://github.com/stanford-crfm/helm), ....



Expand Down
2 changes: 1 addition & 1 deletion README_ZH.md
Original file line number Diff line number Diff line change
Expand Up @@ -328,7 +328,7 @@ Data-Juicer 被各种 LLM产品和研究工作使用,包括来自阿里云-通


Data-Juicer 感谢并参考了社区开源项目:
[Huggingface-Datasets](https://github.com/huggingface/datasets), [Bloom](https://huggingface.co/bigscience/bloom), [RedPajama](https://github.com/togethercomputer/RedPajama-Data), [Pile](https://huggingface.co/datasets/EleutherAI/pile), [Alpaca-Cot](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT), [Megatron-LM](https://github.com/NVIDIA/Megatron-LM), [DeepSpeed](https://www.deepspeed.ai/), [Arrow](https://github.com/apache/arrow), [Ray](https://github.com/ray-project/ray), [Beam](https://github.com/apache/beam), [LM-Harness](https://github.com/EleutherAI/lm-evaluation-harness), [HELM](https://github.com/stanford-crfm/helm), ....
[Huggingface-Datasets](https://github.com/huggingface/datasets), [Bloom](https://huggingface.co/bigscience/bloom), [RedPajama](https://github.com/togethercomputer/RedPajama-Data/tree/rp_v1), [Pile](https://huggingface.co/datasets/EleutherAI/pile), [Alpaca-Cot](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT), [Megatron-LM](https://github.com/NVIDIA/Megatron-LM), [DeepSpeed](https://www.deepspeed.ai/), [Arrow](https://github.com/apache/arrow), [Ray](https://github.com/ray-project/ray), [Beam](https://github.com/apache/beam), [LM-Harness](https://github.com/EleutherAI/lm-evaluation-harness), [HELM](https://github.com/stanford-crfm/helm), ....



Expand Down
10 changes: 5 additions & 5 deletions configs/reproduced_redpajama/README.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
# Redpajama Config Files

This folder contains example configuration files to easily and quickly reproduce the processing flow of [Redpajama](https://github.com/togethercomputer/RedPajama-Data/tree/main/data_prep).
This folder contains example configuration files to easily and quickly reproduce the processing flow of [Redpajama](https://github.com/togethercomputer/RedPajama-Data/tree/rp_v1/data_prep).

## arXiv
The raw data files can be downloaded from the same AWS link as in [Redpajama/arXiv](https://github.com/togethercomputer/RedPajama-Data/tree/main/data_prep/arxiv).
The raw data files can be downloaded from the same AWS link as in [Redpajama/arXiv](https://github.com/togethercomputer/RedPajama-Data/tree/rp_v1/data_prep/arxiv).

Once downloaded, use [raw_arxiv_to_jsonl.py](../../tools/preprocess/raw_arxiv_to_jsonl.py) to convert from the original format to `jsonl` that Data-Juicer can handle easily:

Expand All @@ -30,7 +30,7 @@ python tools/process_data.py --config configs/reproduced_redpajama/redpajama-arx

## Books

The raw data files can be downloaded from the same HuggingFace datasets as in [Redpajama/Books](https://github.com/togethercomputer/RedPajama-Data/tree/main/data_prep/book).
The raw data files can be downloaded from the same HuggingFace datasets as in [Redpajama/Books](https://github.com/togethercomputer/RedPajama-Data/tree/rp_v1/data_prep/book).

Once downloaded, modify the path configurations in [redpajama-books.yaml](redpajama-books.yaml) and execute the following command to reproduce the processing flow of RedPajama.

Expand All @@ -47,7 +47,7 @@ python tools/process_data.py --config configs/reproduced_redpajama/redpajama-boo

## Code

The raw data files can be downloaded from Google BigQuery as in [Redpajama/Code](https://github.com/togethercomputer/RedPajama-Data/tree/main/data_prep/github).
The raw data files can be downloaded from Google BigQuery as in [Redpajama/Code](https://github.com/togethercomputer/RedPajama-Data/tree/rp_v1/data_prep/github).

Once downloaded, unzip and delete files whose extensions are not in the following whitelist:

Expand All @@ -70,7 +70,7 @@ python tools/process_data.py --config configs/redpajama/redpajama-code.yaml

## StackExchange

The raw data files can be downloaded from the same Archive link as in [Redpajama/Stack_exchange](https://github.com/togethercomputer/RedPajama-Data/tree/main/data_prep/stack_exchange).
The raw data files can be downloaded from the same Archive link as in [Redpajama/Stack_exchange](https://github.com/togethercomputer/RedPajama-Data/tree/rp_v1/data_prep/stack_exchange).

Once downloaded, use [raw_stackexchange_to_jsonl.py](../../tools/preprocess/raw_stackexchange_to_jsonl.py) to convert from the original format to `jsonl` that Data-Juicer can handle easily:

Expand Down
10 changes: 5 additions & 5 deletions configs/reproduced_redpajama/README_ZH.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
# Redpajama 配置文件

此文件夹包含的配置文件用于轻松复现 [Redpajama](https://github.com/togethercomputer/RedPajama-Data/tree/main/data_prep) 的处理流程。
此文件夹包含的配置文件用于轻松复现 [Redpajama](https://github.com/togethercomputer/RedPajama-Data/tree/rp_v1/data_prep) 的处理流程。

## arXiv

原始数据文件从 [Redpajama/arXiv](https://github.com/togethercomputer/RedPajama-Data/tree/main/data_prep/arxiv) 中相同的 AWS 链接下载。
原始数据文件从 [Redpajama/arXiv](https://github.com/togethercomputer/RedPajama-Data/tree/rp_v1/data_prep/arxiv) 中相同的 AWS 链接下载。

下载完成后,使用 [raw_arxiv_to_jsonl.py](../../tools/preprocess/raw_arxiv_to_jsonl.py) 将原始格式转换为 Data-Juicer 易于处理的格式:

Expand All @@ -31,7 +31,7 @@ python tools/process_data.py --config configs/reproduced_redpajama/redpajama-arx

## Books

原始数据文件从 [Redpajama/Books](https://github.com/togethercomputer/RedPajama-Data/tree/main/data_prep/book) 中相同的 HuggingFace 链接下载。
原始数据文件从 [Redpajama/Books](https://github.com/togethercomputer/RedPajama-Data/tree/rp_v1/data_prep/book) 中相同的 HuggingFace 链接下载。

下载完成后,修改 [redpajama-books.yaml](redpajama-books.yaml) 中的数据路径,执行以下命令复现 RedPajama 的处理流程:

Expand All @@ -48,7 +48,7 @@ python tools/process_data.py --config configs/reproduced_redpajama/redpajama-boo

## Code

原始数据文件从 [Redpajama/Code](https://github.com/togethercomputer/RedPajama-Data/tree/main/data_prep/github) 中相同的 Google BigQuery 获取。
原始数据文件从 [Redpajama/Code](https://github.com/togethercomputer/RedPajama-Data/tree/rp_v1/data_prep/github) 中相同的 Google BigQuery 获取。

下载完成后,解压缩并删除扩展名不在以下白名单中的其他文件:

Expand All @@ -71,7 +71,7 @@ python tools/process_data.py --config configs/redpajama/redpajama-code.yaml

## StackExchange

原始数据文件从 [Redpajama/Stack_exchange](https://github.com/togethercomputer/RedPajama-Data/tree/main/data_prep/stack_exchange) 中相同的 Archive 链接获取。
原始数据文件从 [Redpajama/Stack_exchange](https://github.com/togethercomputer/RedPajama-Data/tree/rp_v1/data_prep/stack_exchange) 中相同的 Archive 链接获取。

下载完成后,使用 [raw_stackexchange_to_jsonl.py](../../tools/preprocess/raw_stackexchange_to_jsonl.py) 将原始格式转换为 Data-Juicer 易于处理的格式:

Expand Down
2 changes: 1 addition & 1 deletion data_juicer/ops/mapper/clean_copyright_mapper.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# Some code here has been modified from:
# https://github.com/togethercomputer/RedPajama-Data/
# https://github.com/togethercomputer/RedPajama-Data/tree/rp_v1/
# --------------------------------------------------------

import regex as re
Expand Down
2 changes: 1 addition & 1 deletion data_juicer/ops/mapper/clean_html_mapper.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# Some code here has been modified from:
# https://github.com/togethercomputer/RedPajama-Data/
# https://github.com/togethercomputer/RedPajama-Data/tree/rp_v1/
# --------------------------------------------------------

from data_juicer.utils.availability_utils import AvailabilityChecking
Expand Down
2 changes: 1 addition & 1 deletion data_juicer/ops/mapper/expand_macro_mapper.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# Some code here has been modified from:
# https://github.com/togethercomputer/RedPajama-Data/blob/main/data_prep/arxiv/arxiv_cleaner.py
# https://github.com/togethercomputer/RedPajama-Data/blob/rp_v1/data_prep/arxiv/arxiv_cleaner.py
# --------------------------------------------------------

import regex as re
Expand Down
2 changes: 1 addition & 1 deletion data_juicer/ops/mapper/remove_bibliography_mapper.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# Some code here has been modified from:
# https://github.com/togethercomputer/RedPajama-Data/
# https://github.com/togethercomputer/RedPajama-Data/tree/rp_v1/
# --------------------------------------------------------

import regex as re
Expand Down
2 changes: 1 addition & 1 deletion data_juicer/ops/mapper/remove_comments_mapper.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# Some code here has been modified from:
# https://github.com/togethercomputer/RedPajama-Data/
# https://github.com/togethercomputer/RedPajama-Data/tree/rp_v1/
# --------------------------------------------------------

from typing import List, Union
Expand Down
2 changes: 1 addition & 1 deletion data_juicer/ops/mapper/remove_header_mapper.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# Some code here has been modified from:
# https://github.com/togethercomputer/RedPajama-Data/
# https://github.com/togethercomputer/RedPajama-Data/tree/rp_v1/
# --------------------------------------------------------

import regex as re
Expand Down
4 changes: 2 additions & 2 deletions tools/preprocess/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,7 @@ python tools/preprocess/raw_arxiv_to_jsonl.py --help

**Note:**

* For downloading process, please refer to [here](https://github.com/togethercomputer/RedPajama-Data/tree/main/data_prep/arxiv).
* For downloading process, please refer to [here](https://github.com/togethercomputer/RedPajama-Data/tree/rp_v1/data_prep/arxiv).

* Before you downloading, converting or processing, you might make sure that your drive space is large enough to store the raw data (over 3TB), converted data (over 3TB), at least processed data (about 500-600GB), and even more cache data during processing.

Expand All @@ -71,7 +71,7 @@ python tools/preprocess/raw_arxiv_stackexchange_to_jsonl.py \
# get help
python tools/preprocess/raw_stackexchange_to_jsonl.py --help
```
- `src_dir`: if you download raw Stack Exchange data as Redpajama did, you will get a directory src which includes hundreds of 7z files whose filenames are like `*.*.com.7z `. You need to unzip these files and rename the POSTs.xml to the corresponding compressed package name and place it in that dir. For more details, please refer to [here](https://github.com/togethercomputer/RedPajama-Data/tree/main/data_prep/stack_exchange).
- `src_dir`: if you download raw Stack Exchange data as Redpajama did, you will get a directory src which includes hundreds of 7z files whose filenames are like `*.*.com.7z `. You need to unzip these files and rename the POSTs.xml to the corresponding compressed package name and place it in that dir. For more details, please refer to [here](https://github.com/togethercomputer/RedPajama-Data/tree/rp_v1/data_prep/stack_exchange).
- `target_dir`: result directory to store the converted jsonl files.
- `topk` (optional): select the topk sites with the most content. Default it's 28.
- `num_proc` (optional): number of process workers. Default it's 1.
Expand Down
4 changes: 2 additions & 2 deletions tools/preprocess/README_ZH.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@ python tools/preprocess/raw_arxiv_to_jsonl.py --help

**注意事项:**

* 下载过程请参考[这里](https://github.com/togethercomputer/RedPajama-Data/tree/main/data_prep/arxiv)
* 下载过程请参考[这里](https://github.com/togethercomputer/RedPajama-Data/tree/rp_v1/data_prep/arxiv)

* 在下载、转换或处理之前,您需要确保您的硬盘空间足够大,可以存储原始数据(超过 3TB)、转换后的数据(超过 3TB)、最小处理后的数据(大约 500-600GB),以及处理期间的缓存数据。

Expand All @@ -69,7 +69,7 @@ python tools/preprocess/raw_arxiv_stackexchange_to_jsonl.py \
python tools/preprocess/raw_stackexchange_to_jsonl.py --help
```

- `src_dir`: 如果像 Redpajama 一样下载原始 Stack Exchange 数据,你将得到一个目录 src,其中包含数百个 7z 文件,其文件名类似于 `*.*.com.7z`。 您需要解压这些文件并将 POSTs.xml 重命名为相应的压缩包名称并将其放在该目录中。更多详情请参考[这里](https://github.com/togethercomputer/RedPajama-Data/tree/main/data_prep/stack_exchange)
- `src_dir`: 如果像 Redpajama 一样下载原始 Stack Exchange 数据,你将得到一个目录 src,其中包含数百个 7z 文件,其文件名类似于 `*.*.com.7z`。 您需要解压这些文件并将 POSTs.xml 重命名为相应的压缩包名称并将其放在该目录中。更多详情请参考[这里](https://github.com/togethercomputer/RedPajama-Data/tree/rp_v1/data_prep/stack_exchange)
- `target_dir`: 用于存储转换后的 jsonl 文件的结果目录。
- `topk` (可选): 选择内容最多的 k 个站点,默认为 28.
- `num_proc` (可选): worker 进程数量,默认为 1。
Expand Down
4 changes: 2 additions & 2 deletions tools/preprocess/raw_arxiv_to_jsonl.py
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
# Part of the code here has been modified from:
# https://github.com/togethercomputer/RedPajama-Data/blob/main/data_prep/arxiv/arxiv_cleaner.py
# https://github.com/togethercomputer/RedPajama-Data/blob/rp_v1/data_prep/arxiv/arxiv_cleaner.py
# --------------------------------------------------------
#
# This tool is used for converting the raw arxiv data downloaded from S3
# (ref: https://info.arxiv.org/help/bulk_data_s3.html) to several jsonl files.
#
# For downloading process, please refer to:
# https://github.com/togethercomputer/RedPajama-Data/tree/main/data_prep/arxiv
# https://github.com/togethercomputer/RedPajama-Data/tree/rp_v1/data_prep/arxiv
#
# Notice: before you downloading, converting or processing, you might make sure
# that your drive space is large enough to store the raw data (over 3TB),
Expand Down
4 changes: 2 additions & 2 deletions tools/preprocess/raw_stackexchange_to_jsonl.py
Original file line number Diff line number Diff line change
@@ -1,13 +1,13 @@
# Part of the code here has been modified from:
# https://github.com/togethercomputer/RedPajama-Data/tree/main/data_prep/stack_exchange
# https://github.com/togethercomputer/RedPajama-Data/tree/rp_v1/data_prep/stack_exchange
# --------------------------------------------------------
#
# This tool is used for converting the raw Stack Exchange data downloaded from
# from Archive (ref: https://archive.org/download/stackexchange) to several
# jsonl files.
#
# For downloading process, please refer to:
# https://github.com/togethercomputer/RedPajama-Data/tree/main/data_prep/stack_exchange
# https://github.com/togethercomputer/RedPajama-Data/tree/rp_v1/data_prep/stack_exchange
#
# Notice: before you downloading, converting or processing, you might make sure
# that your drive space is large enough to store the raw data (over 100GB),
Expand Down

0 comments on commit 1e512c5

Please sign in to comment.