PaquetBenchmarkGenerator

This project aims to generate Parquet data files for TPC-H and ClickBench datasets using parquet-java. The ParquetWriter in pyarrow.parquet depends on pyarrow-cpp, which is outdated and does not support specifying rowGroupSize.

This project is part of PixelsDB.

Type Conversion

Parquet Logical Types are used to construct the message schema.

Multi-threaded File Reading

To enable multi-threaded file reading, files need to be split into smaller chunks. The Linux split command is used for this purpose:

Copy code
# Test example
split -l 30 test1.csv ./test1/test1-csv-
split -l 40 test2.csv ./test2/test2-csv-

# ClickBench
split -l 156250 hits.tsv ./hits/hits-tsv-

# TPC-H
split -l 319150 customer.tbl ./customer/customer-tbl-
split -l 600040 lineitem.tbl ./lineitem/lineitem-tbl-
split -l 100 nation.tbl ./nation/nation-tbl-
split -l 638300 orders.tbl ./orders/orders-tbl-
split -l 769240 part.tbl ./part/part-tbl-
split -l 360370 partsupp.tbl ./partsupp/partsupp-tbl-
split -l 10 region.tbl ./region/region-tbl-
split -l 333340 supplier.tbl ./supplier/supplier-tbl-

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.idea		.idea
input		input
src/main		src/main
.gitignore		.gitignore
README-zn.md		README-zn.md
README.md		README.md
convert.py		convert.py
convert_tpch.py		convert_tpch.py
gen_data.py		gen_data.py
pom.xml		pom.xml
spilt.sh		spilt.sh
test.parquet		test.parquet

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PaquetBenchmarkGenerator

Type Conversion

Multi-threaded File Reading

References

About

Releases

Packages

Languages

whzruc/PaquetBenchmarkGenerator

Folders and files

Latest commit

History

Repository files navigation

PaquetBenchmarkGenerator

Type Conversion

Multi-threaded File Reading

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages