Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: tpch + tpcds GHA launcher #3619

Open
wants to merge 8 commits into
base: main
Choose a base branch
from
Open

feat: tpch + tpcds GHA launcher #3619

wants to merge 8 commits into from

Conversation

raunakab
Copy link
Contributor

@raunakab raunakab commented Dec 19, 2024

Overview

This PR adds a "tpch" and "tpcds" launcher to the available tools. Allows you to easily scale up a ray-cluster and run queries against it.

Usage

In order to run tpcds, run the following:

uv run tools/tpch.py --scale-factor=2 --num-partitions=2 --questions='1-10'

In order to run tpcds, run the following:

uv run tools/tpcds.py --scale-factor=100 --questions='1-10'

As always, if you want help, run uv run tools/tpch.py --help or uv run tools/tpcds.py --help.

@github-actions github-actions bot added the feat label Dec 19, 2024
@raunakab raunakab marked this pull request as ready for review December 19, 2024 18:15
@raunakab raunakab mentioned this pull request Dec 19, 2024
Copy link

codspeed-hq bot commented Dec 19, 2024

CodSpeed Performance Report

Merging #3619 will improve performances by 35.68%

Comparing tpcds-wrapper (ff772e1) with main (f6002f9)

Summary

⚡ 1 improvements
✅ 26 untouched benchmarks

Benchmarks breakdown

Benchmark main tpcds-wrapper Change
test_iter_rows_first_row[100 Small Files] 308.4 ms 227.3 ms +35.68%

Copy link

codecov bot commented Dec 19, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 77.88%. Comparing base (f6002f9) to head (ff772e1).

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #3619      +/-   ##
==========================================
- Coverage   77.88%   77.88%   -0.01%     
==========================================
  Files         719      719              
  Lines       88410    88410              
==========================================
- Hits        68861    68860       -1     
- Misses      19549    19550       +1     

see 1 file with indirect coverage changes

@universalmind303
Copy link
Contributor

@raunakab It doesn't look like the prompt properly gets printed unless you have a really wide terminal:

image

@raunakab
Copy link
Contributor Author

raunakab commented Dec 19, 2024

@raunakab It doesn't look like the prompt properly gets printed unless you have a really wide terminal:

image

@universalmind303 Oh that's strange. I can throw in an edit there soon. If you want to get by that for now, just type in a y (yes) or an n (no).

Copy link
Contributor

@universalmind303 universalmind303 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one thing i'm worried about is discovery for this.

I know I'm not going to remember uv run tools/tpcds.py --scale-factor=100 --questions='1-10' --cluster-profile='medium-x86'

does uv have any built in discovery for scripts?

@raunakab
Copy link
Contributor Author

one thing i'm worried about is discovery for this.

I know I'm not going to remember uv run tools/tpcds.py --scale-factor=100 --questions='1-10' --cluster-profile='medium-x86'

does uv have any built in discovery for scripts?

Hmm, that is a good point. This might be something that @samster25 might know about. I'll try to see if something can be fashioned to help with discoverability.

@universalmind303
Copy link
Contributor

One other improvement that could be made. When I run the command, there's not a easy to use output, and I need to go dig through the logs to find out what even happened.

image

is it possible to customize the "build summary" with basic information about the run

@raunakab
Copy link
Contributor Author

raunakab commented Dec 19, 2024

One other improvement that could be made. When I run the command, there's not a easy to use output, and I need to go dig through the logs to find out what even happened.

image is it possible to customize the "build summary" with basic information about the run

@universalmind303 Yes, that is a point that I found annoying. I'm currently working on that right now.

My current thought is to produce an output CSV file which can be downloaded and viewed. It would list the queries, how long each one took, and any failures observed.

@raunakab raunakab changed the title feat: tpcds GHA launcher feat: tpch + tpcds GHA launcher Dec 19, 2024
@raunakab
Copy link
Contributor Author

@universalmind303, here is another PR which aims to make the outputs of runs nicer to visualize: #3625.

The first run is still running right now, but you should be able to see an output.csv file uploaded to GitHub for you to download and view.

The run is here:
https://github.com/Eventual-Inc/Daft/actions/runs/12420945783

Comment on lines +1 to +8
# /// script
# requires-python = ">=3.12"
# dependencies = [
# "ray[default]",
# "getdaft",
# ]
# ///

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just to make it possible to run this via uv run benchmarking/tpch/ray_job_runner.py.

@jaychia
Copy link
Contributor

jaychia commented Dec 20, 2024

WRT discoverability, once we have more concrete workflows we can start organizing things as uv tools

https://docs.astral.sh/uv/guides/tools/#running-tools

We can probably have a daft-bench tool which is its own CLI that can be invoked from uv. Can include things such as data generation, running benchmarks etc.

Copy link
Contributor

@jaychia jaychia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly LGTM, some comments

"--questions", type=str, required=False, default="*", help="A comma separated list of questions to run"
)
parser.add_argument("--scale-factor", type=int, required=False, default=2, help="The scale factor to run on")
parser.add_argument("--cluster-profile", type=str, required=False, help="The ray cluster configuration to run on")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Choice

parser.add_argument(
"--questions", type=str, required=False, default="*", help="A comma separated list of questions to run"
)
parser.add_argument("--scale-factor", type=int, required=False, default=2, help="The scale factor to run on")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Choice

type=str,
required=False,
help="A comma separated list of environment variables to pass to ray job",
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually CLIs take a list like so:

--env-var A=1 --env-var B=1

commit_hash = (
subprocess.check_output(["git", "rev-parse", branch_name], stderr=subprocess.STDOUT).strip().decode("utf-8")
)
return name, commit_hash
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename file into git_utils.py?

except ValueError:
raise ValueError(f"Invalid question item; expected a number or a range, instead got {item}")

return nums
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't realize your question parsing logic was so complex until reading the code.

I think to keep things simple, your workflows can just:

  1. Take in comma separated list of questions
  2. Otherwise, runs all questions

This utility function should just be 4 lines:

if questions is None:
    return list(range(total_number_of_questions))
else:
    return [int(q) for q in questions.split(",")]

You can perform validation logic somewhere else with a regex if you want

...

if "-" not in item:
raise ValueError("...")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's this?

nums.append(str(num))
continue
except ValueError:
...
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants