Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(script): generate tpch data set #6024

Merged
merged 5 commits into from
Jun 17, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -57,3 +57,6 @@ venv/
__pycache__/

*.zip

# tpch data set
benchmark/tpch/data
11 changes: 11 additions & 0 deletions benchmark/tpch/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# Databend TPCH-Benchmark

### TPCH DataSet
Run the following command to generate tpch dataset:
```shell
# scale_factor: scale of the database population. scale 1.0 represents ~1 GB of data
../../scripts/setup/dev_setup.sh -t <scale_factor>
```

### TPCH Benchmark
**TBD**
33 changes: 32 additions & 1 deletion scripts/setup/dev_setup.sh
Original file line number Diff line number Diff line change
Expand Up @@ -329,6 +329,7 @@ function usage {
-d Install development tools
-p Install profile
-s Install codegen tools
-t Install tpch data set
-v Verbose mode
EOF
}
Expand All @@ -353,6 +354,7 @@ Build tools (since -b or no option was provided):
* protobuf-compiler
* thrift-compiler
* openjdk
* tpch dataset for benchmark
EOF
fi

Expand All @@ -379,6 +381,12 @@ Moreover, ~/.profile will be updated (since -p was provided).
EOF
fi

if [[ "$INSTALL_TPCH_DATA" == "true" ]]; then
cat <<EOF
Tpch dataset (since -t was provided):
EOF
fi

cat <<EOF
If you'd prefer to install these dependencies yourself, please exit this script
now with Ctrl-C.
Expand All @@ -391,9 +399,10 @@ INSTALL_BUILD_TOOLS=false
INSTALL_DEV_TOOLS=false
INSTALL_PROFILE=false
INSTALL_CODEGEN=false
INSTALL_TPCH_DATA=false

# parse args
while getopts "ybdpsv" arg; do
while getopts "ybdpstv" arg; do
case "$arg" in
y)
AUTO_APPROVE="true"
Expand All @@ -413,6 +422,10 @@ while getopts "ybdpsv" arg; do
v)
VERBOSE="true"
;;
t)
INSTALL_TPCH_DATA="true"
;;

*)
usage
exit 0
Expand All @@ -427,6 +440,7 @@ fi
if [[ "$INSTALL_BUILD_TOOLS" == "false" ]] &&
[[ "$INSTALL_DEV_TOOLS" == "false" ]] &&
[[ "$INSTALL_PROFILE" == "false" ]] &&
[[ "$INSTALL_TPCH_DATA" == "false" ]] &&
[[ "$INSTALL_CODEGEN" == "false" ]]; then
INSTALL_BUILD_TOOLS="true"
fi
Expand Down Expand Up @@ -559,6 +573,23 @@ if [[ "$INSTALL_CODEGEN" == "true" ]]; then
"${PRE_COMMAND[@]}" python3 -m pip install --quiet coscmd PyYAML
fi

if [[ "$INSTALL_TPCH_DATA" == "true" ]]; then
# Construct a docker imagine to generate tpch-data
if [[ -z $2 ]]; then
docker build -f scripts/setup/tpchdata.dockerfile -t databend:latest .
else
docker build -f scripts/setup/tpchdata.dockerfile -t databend:latest --build-arg scale_factor=$2 .
fi
# Generate data into the ./data directory if it does not already exist
FILE=benchmark/tpch/data/customer.tbl
if test -f "$FILE"; then
echo "$FILE exists."
else
mkdir `pwd`/benchmark/tpch/data 2>/dev/null
docker run -v `pwd`/benchmark/tpch/data:/data --rm databend:latest
fi
fi

[[ "${AUTO_APPROVE}" == "false" ]] && cat <<EOF
Finished installing all dependencies.

Expand Down
5 changes: 5 additions & 0 deletions scripts/setup/run-tpch-dbgen.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
#!/bin/bash

cd /tpch-dbgen
./dbgen -vf -s $1
mv *.tbl /data
17 changes: 17 additions & 0 deletions scripts/setup/tpchdata.dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
FROM ubuntu:22.04
ARG scale_factor=1
ENV scale_factor=$scale_factor
RUN apt-get update && \
apt-get install -y git build-essential

# Use https://github.com/databricks/tpch-dbgen to generate data
RUN git clone https://github.com/databricks/tpch-dbgen.git && cd tpch-dbgen && make

WORKDIR /tpch-dbgen
ADD scripts/setup/run-tpch-dbgen.sh /tpch-dbgen/

VOLUME /data

SHELL ["/bin/bash", "-c"]
RUN chmod +x run-tpch-dbgen.sh
ENTRYPOINT ./run-tpch-dbgen.sh $scale_factor