Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Got error when converting TPC-DS in .dat format to parquet format #6

Open
linqinluli opened this issue Jan 5, 2024 · 2 comments
Open

Comments

@linqinluli
Copy link

After I execute

cargo run --release -- generate --benchmark tpcds \
  --scale 1000 \
  --partitions 48 \
  --generator-path /path/to/DSGen-software-code-3.2.0rc1/tools \
  --output /tmp/tpcds/sf1000/

The data are generated in folder /tmp/tpcds/sf1000/. Then I execute

mkdir /tmp/tpcds/sf1000-parquet

cargo run --release -- convert --benchmark tpcds \
  --input /tmp/tpcds/sf1000/
  --output /tmp/tpcds/sf1000-parquet/

I got error below
ArrowError(CsvError("incorrect number of fields for line 1, expected 31 got more than 31"))
I found the code cause the error might be
df.write_parquet(&output_filename, Some(props)).await?;
in lib.rs

After I delete the first number in call_center.dat/part-1.dat, the error became to
ArrowError(CsvError("incorrect number of fields for line 2, expected 31 got 32"))

However the process of TPCH data is OK. The generators of TPCH and TPC-DS are obtained as you described in your repo.

@capoolebugchat
Copy link

Just leaving this comment here as a sub-optimal solution. The problem lies in the dataset. There are trailing comma, in this case the "|" character at the end of each lines of every .tbl files. This causes a mismatch between the defined schema in lib/tpcds.rs and the actual reading of the file. An ugly solution would be modifying every single schema definition for a blank column, then drop it after CSV file reading. And this exposes yet another problem in encoding that datafusion is currently not in support of latin-1 encode, which is the encoding scheme used in TPC-DS 3.0.1rc. Would love to see this problem resolve one day. But I dont mind creating a PR to this codebase.

@capoolebugchat
Copy link

@andygrove hello author of the tool. This is 2 problems I identified in the code and would love to see to resolving.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants