-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Flux transformer benchmarking #870
base: main
Are you sure you want to change the base?
Conversation
23e4919
to
913e9c0
Compare
This PR depends on iree-org/iree-turbine#418. |
Add also some more general functionality to check benchmark results against baseline results. This requires the Google Benchmark compare.py tool that is not a part of the pip package. That is why I added the repo as a git submodule. This tool does a statistical comparison between benchmarks with proper p-value calculation. I don't think we should roll out our own. Adds a new nightly CI job that should contain nightly tests and benchmarks that are not in their own category like Llama is now.
913e9c0
to
47e648a
Compare
I am not sure if we should put the benchmarks in this repo or at https://github.com/nod-ai/SHARK-TestSuite. |
.github/workflows/ci-sharktank.yml
Outdated
pytest -n 4 sharktank/ --durations=10 | ||
pytest \ | ||
-n 4 \ | ||
--durations=10 \ | ||
-m "not expensive" \ | ||
sharktank/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Be careful when adding new tests, particularly if they are expensive. Developers should be able to run pytest
and expect a reasonable default set of tests to run. Developers will not remember opt-out marks like this.
The filtering also didn't work here? https://github.com/nod-ai/shark-ai/actions/runs/13079961026/job/36501056330?pr=870 is still running on standard github-hosted runners after 2h+.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test failure is due to this required PR iree-org/iree-turbine#418.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the default marker selection in pyproject.toml
to exclude expensive
.
[submodule "third_party/benchmark"] | ||
path = third_party/benchmark | ||
url = https://github.com/google/benchmark |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this actually used? You have google-benchmark
in Python test requirements already.
A source dependency on the C++ library could be added to shortfin as needed, probably via FetchContent here:
shark-ai/shortfin/CMakeLists.txt
Lines 326 to 340 in 4eac34e
if(SHORTFIN_BUILD_TESTS) | |
if (NOT SHORTFIN_BUNDLE_DEPS AND NOT SHORTFIN_IREE_SOURCE_DIR) | |
# For now we use gtest shipped alongside with IREE. | |
FetchContent_Declare( | |
googletest | |
URL https://github.com/google/googletest/archive/03597a01ee50ed33e9dfd640b249b4be3799d395.zip | |
) | |
# For Windows: Prevent overriding the parent project's compiler/linker settings | |
set(gtest_force_shared_crt ON CACHE BOOL "" FORCE) | |
FetchContent_MakeAvailable(googletest) | |
endif() | |
include(GoogleTest) | |
enable_testing() | |
add_custom_target(shortfin_testdata_deps) | |
endif() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wrongly left the pip package dependency. It is removed now. Unfortunately, the script to compare benchmark results is not a part of the pip package. This script is used in the python test.
See https://github.com/google/benchmark/blob/main/docs/tools.md#comparepy
That is why the submodule is used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd much prefer we consolidate existing nightly workflows (eval, sglang benchmark, llama large, etc.) before adding a new one. That being said, I like the "sharktank-nightly" name better than model-specific names...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the CI jobs are due for a refactoring. To reorganize them and to reduce code duplication. I want to do that next.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The advantage of keeping them separately is the ease of tracking various nightly across llm, flux models and sharktank/ shortfin regressions. If there are tools to do this, or a CI summary in github-pages, that would be great.
--verbose \ | ||
--capture=no \ | ||
--iree-hip-target=gfx942 \ | ||
--iree-device=hip://6 \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Match what other workflows do with hip devices: #725
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we actually allocate GPUs to CI jobs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I change it to used hip://0
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Generally, CI jobs should be written to run on any compatible runner machine and be easy to reference and run locally. The runners themselves should be responsible for assigning devices.
--html=out/benchmark/index.html \ | ||
sharktank/tests | ||
|
||
- name: Deploy to GitHub Pages | ||
uses: peaceiris/actions-gh-pages@4f9cc6602d3f66b9c108549d475ec49e8ef4d45e # v4.0.0 | ||
with: | ||
github_token: ${{ secrets.SHARK_PLATFORM_GH_TOKEN }} | ||
publish_dir: ./out/benchmark | ||
destination_dir: ./benchmark |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"benchmark" isn't a very specific name, and it also doesn't match "sharktank nightly". Consider how you want this to slot in to
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is a PR for the GH pages #899.
with: | ||
github_token: ${{ secrets.SHARK_PLATFORM_GH_TOKEN }} | ||
publish_dir: ./out/sharktank-nightly | ||
destination_dir: ./sharktank-nightly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe publish_dir
should be ./out/sharktank_nightly/benchmark
. Same for destination_dir
Also, we need to update sharktank-nightly
to sharktank_nightly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to have one step to push all sharktank_nightly artifacts. This workflow in the future may have more jobs that produce artifacts. I would be surprised if the action does not support whole directory trees.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, were you able to test this nightly on this PR?
Add also some more general functionality to check benchmark results against baseline results. This requires the Google Benchmark compare.py tool that is not a part of the pip package. That is why I added the repo as a git submodule. This tool does a statistical comparison between benchmarks with proper p-value calculation. I don't think we should roll out our own.
Adds a new nightly CI job that should contain nightly tests and benchmarks that are not in their own category like Llama is now.