Skip to content

Commit

Permalink
Merge pull request #11 from alibaba/main
Browse files Browse the repository at this point in the history
merge
  • Loading branch information
BeachWang authored Feb 23, 2024
2 parents cb1d163 + 55001cb commit eacd6ca
Show file tree
Hide file tree
Showing 6 changed files with 23 additions and 12 deletions.
13 changes: 12 additions & 1 deletion Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -3,11 +3,22 @@

FROM python:3.8.18

# prepare the java env
WORKDIR /opt
# download jdk
RUN wget https://aka.ms/download-jdk/microsoft-jdk-17.0.9-linux-x64.tar.gz -O jdk.tar.gz && \
tar -xzf jdk.tar.gz && \
rm -rf jdk.tar.gz && \
mv jdk-17.0.9+8 jdk

# set the environment variable
ENV JAVA_HOME=/opt/jdk

WORKDIR /data-juicer

# install requirements first to better reuse installed library cache
COPY environments/ environments/
RUN cat environments/* | xargs pip install
RUN cat environments/* | xargs pip install --default-timeout 1000

# install data-juicer then
COPY . .
Expand Down
6 changes: 3 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -169,7 +169,7 @@ pip install py-data-juicer
latest `data-juicer` with provided [Dockerfile](Dockerfile):

```shell
docker build -t data-juicer:<version_tag> .
docker build -t datajuicer/data-juicer:<version_tag> .
```

### Installation check
Expand Down Expand Up @@ -276,7 +276,7 @@ docker run --rm \ # remove container after the processing
--name dj \ # name of the container
-v <host_data_path>:<image_data_path> \ # mount data or config directory into the container
-v ~/.cache/:/root/.cache/ \ # mount the cache directory into the container to reuse caches and models (recommended)
data-juicer:<version_tag> \ # image to run
datajuicer/data-juicer:<version_tag> \ # image to run
dj-process --config /path/to/config.yaml # similar data processing commands
```

Expand All @@ -289,7 +289,7 @@ docker run -dit \ # run the container in the background
--name dj \
-v <host_data_path>:<image_data_path> \
-v ~/.cache/:/root/.cache/ \
data-juicer:latest /bin/bash
datajuicer/data-juicer:latest /bin/bash
# enter into this container and then you can use data-juicer in editable mode
docker exec -it <container_id> bash
Expand Down
6 changes: 3 additions & 3 deletions README_ZH.md
Original file line number Diff line number Diff line change
Expand Up @@ -154,7 +154,7 @@ pip install py-data-juicer
- 或者运行如下命令用我们提供的 [Dockerfile](Dockerfile) 来构建包括最新版本的 `data-juicer` 的 docker 镜像:

```shell
docker build -t data-juicer:<version_tag> .
docker build -t datajuicer/data-juicer:<version_tag> .
```

### 安装校验
Expand Down Expand Up @@ -254,7 +254,7 @@ docker run --rm \ # 在处理结束后将容器移除
--name dj \ # 容器名称
-v <host_data_path>:<image_data_path> \ # 将本地的数据或者配置目录挂载到容器中
-v ~/.cache/:/root/.cache/ \ # 将 cache 目录挂载到容器以复用 cache 和模型资源(推荐)
data-juicer:<version_tag> \ # 运行的镜像
datajuicer/data-juicer:<version_tag> \ # 运行的镜像
dj-process --config /path/to/config.yaml # 类似的数据处理命令
```

Expand All @@ -267,7 +267,7 @@ docker run -dit \ # 在后台启动容器
--name dj \
-v <host_data_path>:<image_data_path> \
-v ~/.cache/:/root/.cache/ \
data-juicer:latest /bin/bash
datajuicer/data-juicer:latest /bin/bash
# 进入这个容器,然后您可以在编辑模式下使用 data-juicer
docker exec -it <container_id> bash
Expand Down
4 changes: 2 additions & 2 deletions data_juicer/ops/common/helper_func.py
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ def strip(document, strip_characters):
emojis).
:param document: document to be processed
:param strip_characters: characters uesd for stripping document
:param strip_characters: characters used for stripping document
:return: stripped document
"""
if not document:
Expand Down Expand Up @@ -76,7 +76,7 @@ def split_on_newline_tab_whitespace(document):
First split on "\\\\n", then on "\\\\t", then on " ".
:param document: document to be splited
:return: setence list obtained after splitting document
:return: sentence list obtained after splitting document
"""
sentences = document.split('\n')
sentences = [sentence.split('\t') for sentence in sentences]
Expand Down
4 changes: 2 additions & 2 deletions data_juicer/ops/filter/image_size_filter.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@

@OPERATORS.register_module('image_size_filter')
class ImageSizeFilter(Filter):
"""Keep data samples whose image size (in bytes/kb/MB/...) within a
"""Keep data samples whose image size (in Bytes/KB/MB/...) within a
specific range.
"""

Expand All @@ -24,7 +24,7 @@ def __init__(self,
:param min_size: The min image size to keep samples. set to be "0" by
default for no size constraint
:param max_size: The max image size to keep samples. set to be
"1Tb" by default, an approximate for un-limited case
"1TB" by default, an approximate for un-limited case
:param any_or_all: keep this sample with 'any' or 'all' strategy of
all images. 'any': keep this sample if any images meet the
condition. 'all': keep this sample only if all images meet the
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ def __init__(self,
"""
Initialization method.
:param keep_alphabet: whether to keep alpabet
:param keep_alphabet: whether to keep alphabet
:param keep_number: whether to keep number
:param keep_punc: whether to keep punctuation
:param args: extra args
Expand Down

0 comments on commit eacd6ca

Please sign in to comment.