[FEA] Add total core seconds into top candidate view #1342

cindyyuanjiang · 2024-09-11T20:45:15Z

Contributes to #1307

PR Changes

After an offline discussion with Felix, here are the changes we want to proceed with:

Filter apps with total core secs less than threshold 691200. This is the total core seconds of an n1-standard-8 instance running for one day. Running an n1-standard-8 for one day takes < $10, which is $3650 a year (not much cast savings potential).
In qual tool console output, the table is currently sorted based on Estimated GPU Speedup and Unsupported Operators Stage Duration Percent. This PR changes the sorting columns to 'Speedup Category Order' (Large > Medium > Small > Not Recommended) and Total Core Seconds, so that within each speedup category, the apps are sorted based on total core seconds.

Output Changes

Added Total Core Seconds column in qualification_summary.csv

Example:

,Vendor,Driver Host,App Name,App ID,Estimated GPU Speedup,Estimated GPU Duration,App Duration,SQL Stage Durations Sum,Unsupported Operators Stage Duration,Unsupported Operators Stage Duration Percent,Total Core Seconds,Skip by Heuristics,Estimated GPU Speedup Category
0,onprem,10.110.46.117,sanity_test,application_1673577383569_0004,1.89,35853,67725,41980,0.00,0.00,1692,False,Large
1,onprem,ip-172-31-56-248.us-west-2.compute.internal,sanity_test,application_1673577383569_0003,1.86,35759,66374,40168,0.00,0.00,1626,False,Large
2,onprem,ip-172-31-56-248.us-west-2.compute.internal,sanity_test,application_1673503196578_0001,1.76,41158,72507,41659,0.00,0.00,1718,False,Medium
3,onprem,ip-172-31-56-248.us-west-2.compute.internal,sanity_test,application_1673577383569_0001,1.76,41292,72649,41647,0.00,0.00,1675,False,Medium

Testing

Cmd:
spark_rapids qualification -v -e <my-event-logs> --tools_jar <my_tools_jar>

Console output:

+----+-------------+--------------------------------+-----------------+------------------------------------+-------------------------------------+------------------------------------+
|    | App Name    | App ID                         | Estimated GPU   | Qualified                          | Full Cluster                        | GPU Config                         |
|    |             |                                | Speedup         | Cluster                            | Config                              | Recommendation                     |
|    |             |                                | Category**      | Recommendation                     | Recommendations*                    | Breakdown*                         |
|----+-------------+--------------------------------+-----------------+------------------------------------+-------------------------------------+------------------------------------|
|  0 | sanity_test | application_1673577383569_0004 | Large           | 1 x Node with 64 vCPUs (1 L4 each) | Does not exist, see log for errors  | Does not exist, see log for errors |
|  1 | sanity_test | application_1673577383569_0003 | Large           | 4 x Node with 8 vCPUs (1 L4 each)  | application_1673577383569_0003.conf | application_1673577383569_0003.log |
|  2 | sanity_test | application_1673503196578_0001 | Medium          | 4 x Node with 8 vCPUs (1 L4 each)  | application_1673503196578_0001.conf | application_1673503196578_0001.log |
|  3 | sanity_test | application_1673577383569_0001 | Medium          | 4 x Node with 8 vCPUs (1 L4 each)  | application_1673577383569_0001.conf | application_1673577383569_0001.log |
+----+-------------+--------------------------------+-----------------+------------------------------------+-------------------------------------+------------------------------------+

The total core seconds for the apps (in order of the above table) are: 1692, 1626, 1718, 1675.

Signed-off-by: cindyyuanjiang <cindyj@nvidia.com>

tgravescs · 2024-09-16T12:44:46Z

Hey Cindy,

Filter apps with total core secs less than threshold 691200. This is the total core seconds of an n1-standard-8 instance running for one day. Running an n1-standard-8 for one day takes < $10, which is $3650 a year (not much cast savings potential).

This is unclear to me. This is just an example or where did this requirement come from? Its the first I'm hearing of it and I'm not sure why we would do any sort of automatic filtering here on what appears to be a random threshold of 691200.

amahussein

Thanks @cindyyuanjiang !

It will be helpful to update the PR description or the issue description with the requirements we get from Felix so that the reviewers can be on the same page.
Regarding the threshold 691200: I can understand the logic behind putting that number. However, I think it cannot be applied here for the following reasons:

we use absolute core-seconds from the eventlog. this value is not mapped to core-seconds "per-day" . This means we are comparing apple-to-oranges. For example an application can run for entire 100 days but with very low core-seconds. Another application runs periodically with higher core-seconds. The view in this PR would show the slow running app as more important. In real world, we should be interested in the second job. Therefore, IMHO the view is incorrect because there is no normalization.
691200 comes from Dataproc n1-standard-8. What if we are dealing with users running on-prem or Databricks/Photon. How can we justify to them excluding a job based on core-seconds of a different platform?

user_tools/src/spark_rapids_tools/tools/additional_heuristics.py

user_tools/src/spark_rapids_pytools/rapids/qualification.py

user_tools/src/spark_rapids_pytools/resources/qualification-conf.yaml

user_tools/src/spark_rapids_tools/tools/top_candidates.py

cindyyuanjiang · 2024-09-16T20:17:13Z

This is unclear to me. This is just an example or where did this requirement come from? Its the first I'm hearing of it and I'm not sure why we would do any sort of automatic filtering here on what appears to be a random threshold of 691200.

Thanks @tgravescs! This requirement is from discussion with Felix. I will update the PR description.

parthosa

Thanks @cindyyuanjiang. Similar to @amahussein's comments, I think we keep a DF created from qual tool jar output to most of the functions. We could use that. (although we might be dropping certain columns)

user_tools/src/spark_rapids_tools/tools/additional_heuristics.py

tgravescs · 2024-09-19T21:14:38Z

are we not putting total core seconds into the qualification_summary.csv file?

Signed-off-by: cindyyuanjiang <cindyj@nvidia.com>

cindyyuanjiang · 2024-09-20T23:58:08Z

are we not putting total core seconds into the qualification_summary.csv file?

Thanks @tgravescs, added "Total Core Seconds" column in qualification_summary.csv. cc: @amahussein

user_tools/src/spark_rapids_tools/tools/additional_heuristics.py

Signed-off-by: cindyyuanjiang <cindyj@nvidia.com>

parthosa

Thanks @cindyyuanjiang. LGTME.

amahussein

Thanks @cindyyuanjiang
A couple of nits.
Can you please append this PR to the list of changes in our internal documentation issue created to keep track of recent changes?

user_tools/src/spark_rapids_pytools/resources/qualification-conf.yaml

user_tools/src/spark_rapids_tools/tools/top_candidates.py

Signed-off-by: cindyyuanjiang <cindyj@nvidia.com>

cindyyuanjiang · 2024-09-25T18:55:59Z

Can you please append this PR to the list of changes in our internal documentation issue created to keep track of recent changes?

Thanks @amahussein! Added this PR to internal documentation issue.

amahussein

LGTME!
Thanks @cindyyuanjiang !

cindyyuanjiang added 4 commits September 11, 2024 13:43

add total core seconds into top candidate view

a30d7d1

Signed-off-by: cindyyuanjiang <cindyj@nvidia.com>

fix style error

60e812a

Signed-off-by: cindyyuanjiang <cindyj@nvidia.com>

clean up code comments

52c8851

Signed-off-by: cindyyuanjiang <cindyj@nvidia.com>

update comments

ece820d

Signed-off-by: cindyyuanjiang <cindyj@nvidia.com>

cindyyuanjiang marked this pull request as ready for review September 16, 2024 00:09

cindyyuanjiang self-assigned this Sep 16, 2024

cindyyuanjiang added feature request New feature or request user_tools Scope the wrapper module running CSP, QualX, and reports (python) labels Sep 16, 2024

cindyyuanjiang changed the title ~~WIP: [FEA] Add total core seconds into top candidate view~~ [FEA] Add total core seconds into top candidate view Sep 16, 2024

cindyyuanjiang mentioned this pull request Aug 29, 2024

[FEA] Consider core-seconds in TCL heuristics #1307

Closed

2 tasks

amahussein requested changes Sep 16, 2024

View reviewed changes

parthosa reviewed Sep 18, 2024

View reviewed changes

user_tools/src/spark_rapids_tools/tools/additional_heuristics.py Outdated Show resolved Hide resolved

cindyyuanjiang added 3 commits September 20, 2024 13:52

address review feedback

9787fc9

Signed-off-by: cindyyuanjiang <cindyj@nvidia.com>

fix python style

45e9ccd

Signed-off-by: cindyyuanjiang <cindyj@nvidia.com>

moved hard coded categories to yaml file

90c244b

Signed-off-by: cindyyuanjiang <cindyj@nvidia.com>

cindyyuanjiang added the affect-output A change that modifies the output (add/remove/rename files, add/remove/rename columns) label Sep 20, 2024

cindyyuanjiang requested review from parthosa, amahussein, tgravescs and nartal1 September 21, 2024 00:02

parthosa reviewed Sep 23, 2024

View reviewed changes

user_tools/src/spark_rapids_tools/tools/additional_heuristics.py Outdated Show resolved Hide resolved

user_tools/src/spark_rapids_tools/tools/additional_heuristics.py Outdated Show resolved Hide resolved

address review feedback

cfa1d1f

Signed-off-by: cindyyuanjiang <cindyj@nvidia.com>

cindyyuanjiang requested a review from parthosa September 23, 2024 23:31

parthosa previously approved these changes Sep 23, 2024

View reviewed changes

amahussein requested changes Sep 25, 2024

View reviewed changes

user_tools/src/spark_rapids_pytools/resources/qualification-conf.yaml Outdated Show resolved Hide resolved

user_tools/src/spark_rapids_tools/tools/top_candidates.py Outdated Show resolved Hide resolved

fix typos

c8ebb3c

Signed-off-by: cindyyuanjiang <cindyj@nvidia.com>

cindyyuanjiang dismissed parthosa’s stale review via c8ebb3c September 25, 2024 18:55

cindyyuanjiang requested review from amahussein and parthosa September 25, 2024 18:56

parthosa approved these changes Sep 25, 2024

View reviewed changes

amahussein approved these changes Sep 26, 2024

View reviewed changes

cindyyuanjiang merged commit 55a0460 into NVIDIA:dev Sep 26, 2024
14 checks passed

cindyyuanjiang deleted the spark-rapids-tools-1307-python branch September 26, 2024 21:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Add total core seconds into top candidate view #1342

[FEA] Add total core seconds into top candidate view #1342

cindyyuanjiang commented Sep 11, 2024 •

edited

Loading

tgravescs commented Sep 16, 2024

amahussein left a comment

cindyyuanjiang commented Sep 16, 2024 •

edited

Loading

parthosa left a comment

tgravescs commented Sep 19, 2024

cindyyuanjiang commented Sep 20, 2024

parthosa left a comment

amahussein left a comment

cindyyuanjiang commented Sep 25, 2024

amahussein left a comment

[FEA] Add total core seconds into top candidate view #1342

[FEA] Add total core seconds into top candidate view #1342

Conversation

cindyyuanjiang commented Sep 11, 2024 • edited Loading

PR Changes

Output Changes

Testing

tgravescs commented Sep 16, 2024

amahussein left a comment

Choose a reason for hiding this comment

cindyyuanjiang commented Sep 16, 2024 • edited Loading

parthosa left a comment

Choose a reason for hiding this comment

tgravescs commented Sep 19, 2024

cindyyuanjiang commented Sep 20, 2024

parthosa left a comment

Choose a reason for hiding this comment

amahussein left a comment

Choose a reason for hiding this comment

cindyyuanjiang commented Sep 25, 2024

amahussein left a comment

Choose a reason for hiding this comment

cindyyuanjiang commented Sep 11, 2024 •

edited

Loading

cindyyuanjiang commented Sep 16, 2024 •

edited

Loading