Add Flux transformer benchmarking #870

sogartar · 2025-01-27T14:26:30Z

Add also some more general functionality to check benchmark results against baseline results. This requires the Google Benchmark compare.py tool that is not a part of the pip package. That is why I added the repo as a git submodule. This tool does a statistical comparison between benchmarks with proper p-value calculation. I don't think we should roll out our own.

Adds a new nightly CI job that should contain nightly tests and benchmarks that are not in their own category like Llama is now.

sogartar · 2025-01-27T14:38:38Z

This PR depends on iree-org/iree-turbine#418.

Add also some more general functionality to check benchmark results against baseline results. This requires the Google Benchmark compare.py tool that is not a part of the pip package. That is why I added the repo as a git submodule. This tool does a statistical comparison between benchmarks with proper p-value calculation. I don't think we should roll out our own. Adds a new nightly CI job that should contain nightly tests and benchmarks that are not in their own category like Llama is now.

sogartar · 2025-01-27T16:36:45Z

I am not sure if we should put the benchmarks in this repo or at https://github.com/nod-ai/SHARK-TestSuite.

ScottTodd · 2025-01-31T22:14:01Z

.github/workflows/ci-sharktank.yml

-          pytest -n 4 sharktank/ --durations=10
+          pytest \
+            -n 4 \
+            --durations=10 \
+            -m "not expensive" \
+            sharktank/


Be careful when adding new tests, particularly if they are expensive. Developers should be able to run pytest and expect a reasonable default set of tests to run. Developers will not remember opt-out marks like this.

The filtering also didn't work here? https://github.com/nod-ai/shark-ai/actions/runs/13079961026/job/36501056330?pr=870 is still running on standard github-hosted runners after 2h+.

The test failure is due to this required PR iree-org/iree-turbine#418.

I changed the default marker selection in pyproject.toml to exclude expensive.

ScottTodd · 2025-01-31T22:15:55Z

.gitmodules

+[submodule "third_party/benchmark"]
+	path = third_party/benchmark
+	url = https://github.com/google/benchmark


Is this actually used? You have google-benchmark in Python test requirements already.

A source dependency on the C++ library could be added to shortfin as needed, probably via FetchContent here:

shark-ai/shortfin/CMakeLists.txt

Lines 326 to 340 in 4eac34e

if(SHORTFIN_BUILD_TESTS)

if (NOT SHORTFIN_BUNDLE_DEPS AND NOT SHORTFIN_IREE_SOURCE_DIR)

# For now we use gtest shipped alongside with IREE.

FetchContent_Declare(

googletest

URL https://github.com/google/googletest/archive/03597a01ee50ed33e9dfd640b249b4be3799d395.zip

)

# For Windows: Prevent overriding the parent project's compiler/linker settings

set(gtest_force_shared_crt ON CACHE BOOL "" FORCE)

FetchContent_MakeAvailable(googletest)

endif()

include(GoogleTest)

enable_testing()

add_custom_target(shortfin_testdata_deps)

endif()

I wrongly left the pip package dependency. It is removed now. Unfortunately, the script to compare benchmark results is not a part of the pip package. This script is used in the python test.
See https://github.com/google/benchmark/blob/main/docs/tools.md#comparepy
That is why the submodule is used.

ScottTodd · 2025-01-31T22:17:36Z

.github/workflows/ci-sharktank-nightly.yaml

I'd much prefer we consolidate existing nightly workflows (eval, sglang benchmark, llama large, etc.) before adding a new one. That being said, I like the "sharktank-nightly" name better than model-specific names...

I think the CI jobs are due for a refactoring. To reorganize them and to reduce code duplication. I want to do that next.

The advantage of keeping them separately is the ease of tracking various nightly across llm, flux models and sharktank/ shortfin regressions. If there are tools to do this, or a CI summary in github-pages, that would be great.

ScottTodd · 2025-01-31T22:18:51Z

.github/workflows/ci-sharktank-nightly.yaml

+            --verbose \
+            --capture=no \
+            --iree-hip-target=gfx942 \
+            --iree-device=hip://6 \


Match what other workflows do with hip devices: #725

Do we actually allocate GPUs to CI jobs?

I change it to used hip://0.

Thanks. Generally, CI jobs should be written to run on any compatible runner machine and be easy to reference and run locally. The runners themselves should be responsible for assigning devices.

ScottTodd · 2025-01-31T22:20:35Z

.github/workflows/ci-sharktank-nightly.yaml

+            --html=out/benchmark/index.html \
+            sharktank/tests
+
+      - name: Deploy to GitHub Pages
+        uses: peaceiris/actions-gh-pages@4f9cc6602d3f66b9c108549d475ec49e8ef4d45e # v4.0.0
+        with:
+          github_token: ${{ secrets.SHARK_PLATFORM_GH_TOKEN }}
+          publish_dir: ./out/benchmark
+          destination_dir: ./benchmark


"benchmark" isn't a very specific name, and it also doesn't match "sharktank nightly". Consider how you want this to slot in to

https://github.com/nod-ai/shark-ai/tree/gh-pages

https://nod-ai.github.io/shark-ai/

Here is a PR for the GH pages #899.

.github/workflows/ci-sharktank-nightly.yaml

archana-ramalingam · 2025-02-03T20:24:38Z

.github/workflows/ci-sharktank-nightly.yaml

+        with:
+          github_token: ${{ secrets.SHARK_PLATFORM_GH_TOKEN }}
+          publish_dir: ./out/sharktank-nightly
+          destination_dir: ./sharktank-nightly


I believe publish_dir should be ./out/sharktank_nightly/benchmark. Same for destination_dir
Also, we need to update sharktank-nightly to sharktank_nightly

I wanted to have one step to push all sharktank_nightly artifacts. This workflow in the future may have more jobs that produce artifacts. I would be surprised if the action does not support whole directory trees.

Sure, were you able to test this nightly on this PR?

sogartar force-pushed the flux-transformer-benchmark branch from 23e4919 to 913e9c0 Compare January 27, 2025 14:34

sogartar marked this pull request as ready for review January 27, 2025 14:39

sogartar requested review from KyleHerndon, rsuderman and IanNod January 27, 2025 14:39

sogartar marked this pull request as draft January 27, 2025 15:03

sogartar force-pushed the flux-transformer-benchmark branch from 913e9c0 to 47e648a Compare January 27, 2025 15:11

sogartar marked this pull request as ready for review January 27, 2025 15:25

Remove commented out code

51862dc

ScottTodd requested changes Jan 31, 2025

View reviewed changes

Address Scott's review comments

5fafeb7

sogartar requested a review from ScottTodd February 3, 2025 16:48

archana-ramalingam reviewed Feb 3, 2025

View reviewed changes

.github/workflows/ci-sharktank-nightly.yaml Outdated Show resolved Hide resolved

archana-ramalingam reviewed Feb 3, 2025

View reviewed changes

sharktank-nightly -> sharktank_nightly

2fa1288

sogartar requested a review from archana-ramalingam February 5, 2025 20:22

Merge branch 'main' into flux-transformer-benchmark

c52b8a7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Flux transformer benchmarking #870

Add Flux transformer benchmarking #870

sogartar commented Jan 27, 2025

sogartar commented Jan 27, 2025

sogartar commented Jan 27, 2025

ScottTodd Jan 31, 2025

sogartar Feb 3, 2025

sogartar Feb 3, 2025

ScottTodd Jan 31, 2025

sogartar Feb 3, 2025

ScottTodd Jan 31, 2025

sogartar Feb 3, 2025

archana-ramalingam Feb 3, 2025

ScottTodd Jan 31, 2025

sogartar Feb 1, 2025

sogartar Feb 3, 2025

ScottTodd Feb 3, 2025

ScottTodd Jan 31, 2025

sogartar Feb 3, 2025

archana-ramalingam Feb 3, 2025

sogartar Feb 5, 2025 •

edited

Loading

archana-ramalingam Feb 5, 2025

	if(SHORTFIN_BUILD_TESTS)
	if (NOT SHORTFIN_BUNDLE_DEPS AND NOT SHORTFIN_IREE_SOURCE_DIR)
	# For now we use gtest shipped alongside with IREE.
	FetchContent_Declare(
	googletest
	URL https://github.com/google/googletest/archive/03597a01ee50ed33e9dfd640b249b4be3799d395.zip
	)
	# For Windows: Prevent overriding the parent project's compiler/linker settings
	set(gtest_force_shared_crt ON CACHE BOOL "" FORCE)
	FetchContent_MakeAvailable(googletest)
	endif()
	include(GoogleTest)
	enable_testing()
	add_custom_target(shortfin_testdata_deps)
	endif()

Add Flux transformer benchmarking #870

Are you sure you want to change the base?

Add Flux transformer benchmarking #870

Conversation

sogartar commented Jan 27, 2025

sogartar commented Jan 27, 2025

sogartar commented Jan 27, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sogartar Feb 5, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sogartar Feb 5, 2025 •

edited

Loading