Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add examples from TPC-H #666

Merged
merged 30 commits into from
May 13, 2024
Merged

Add examples from TPC-H #666

merged 30 commits into from
May 13, 2024

Conversation

timsaucer
Copy link
Contributor

Which issue does this PR close?

This PR does not close #440 but it helps to address one part of it.

Rationale for this change

One of the difficulties for new users to DataFusion is to find helpful examples. This PR adds in a series of examples that are based on the queries performed for the TPC-H benchmark. Those are known examples and we have a script in place to generate data for users to work with. By adding these examples, we will give new users both dataframe to work with and a series of examples showing Data Fusion in operation.

What changes are included in this PR?

This PR makes one change to the generator script to update it's docker image location.

All other changes are within the examples folder.

Are there any user-facing changes?

No user facing changes.

@andygrove
Copy link
Member

These examples are looking really nice @timsaucer. Don't feel that you have to wait until all of them are implemented before we start merging into main. We could do this in stages if you like.

@timsaucer
Copy link
Contributor Author

timsaucer commented May 10, 2024

Thanks for the feedback. I am seeing a few differences between a couple of the results I'm getting and what's in the answers file, so I want to get those resolved before merging. I also want to put something in the readme pointing out which examples contain different features to make it easy for people to find things. At the rate I'm going, I'll probably have the last 10 done before mid week.

@timsaucer timsaucer changed the title Draft: Add examples from TPC-H Add examples from TPC-H May 12, 2024
Copy link
Contributor

@Michael-J-Ward Michael-J-Ward left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only got through part of it so far, but awesome work.

examples/tpch/README.md Outdated Show resolved Hide resolved
examples/tpch/q01_pricing_summary_report.py Show resolved Hide resolved
examples/tpch/q04_order_priority_checking.py Show resolved Hide resolved
examples/tpch/q05_local_supplier_volume.py Show resolved Hide resolved
@timsaucer
Copy link
Contributor Author

I've added to the main readme in the examples folder, so I think this PR is good to go pending review.

("C_ADDRESS", pyarrow.string()),
("C_NATIONKEY", pyarrow.int32()),
("C_PHONE", pyarrow.string()),
("C_ACCTBAL", pyarrow.float32()),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: not important for example usage, but the numeric fields should be decimal not float

Copy link
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incredible work @timsaucer. This is super helpful to help new users understand how to use the DataFrame API.

@andygrove andygrove merged commit d71c436 into apache:main May 13, 2024
13 checks passed
@timsaucer timsaucer deleted the examples/tpch branch August 1, 2024 12:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[DISCUSSION] We need a Hero for datafusion-python
3 participants