Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Add total core seconds into top candidate view #1342

Merged

Conversation

cindyyuanjiang
Copy link
Collaborator

@cindyyuanjiang cindyyuanjiang commented Sep 11, 2024

Contributes to #1307

PR Changes

After an offline discussion with Felix, here are the changes we want to proceed with:

  1. Filter apps with total core secs less than threshold 691200. This is the total core seconds of an n1-standard-8 instance running for one day. Running an n1-standard-8 for one day takes < $10, which is $3650 a year (not much cast savings potential).
  2. In qual tool console output, the table is currently sorted based on Estimated GPU Speedup and Unsupported Operators Stage Duration Percent. This PR changes the sorting columns to 'Speedup Category Order' (Large > Medium > Small > Not Recommended) and Total Core Seconds, so that within each speedup category, the apps are sorted based on total core seconds.

Output Changes

  • Added Total Core Seconds column in qualification_summary.csv

Example:

,Vendor,Driver Host,App Name,App ID,Estimated GPU Speedup,Estimated GPU Duration,App Duration,SQL Stage Durations Sum,Unsupported Operators Stage Duration,Unsupported Operators Stage Duration Percent,Total Core Seconds,Skip by Heuristics,Estimated GPU Speedup Category
0,onprem,10.110.46.117,sanity_test,application_1673577383569_0004,1.89,35853,67725,41980,0.00,0.00,1692,False,Large
1,onprem,ip-172-31-56-248.us-west-2.compute.internal,sanity_test,application_1673577383569_0003,1.86,35759,66374,40168,0.00,0.00,1626,False,Large
2,onprem,ip-172-31-56-248.us-west-2.compute.internal,sanity_test,application_1673503196578_0001,1.76,41158,72507,41659,0.00,0.00,1718,False,Medium
3,onprem,ip-172-31-56-248.us-west-2.compute.internal,sanity_test,application_1673577383569_0001,1.76,41292,72649,41647,0.00,0.00,1675,False,Medium

Testing

Cmd:
spark_rapids qualification -v -e <my-event-logs> --tools_jar <my_tools_jar>

Console output:

+----+-------------+--------------------------------+-----------------+------------------------------------+-------------------------------------+------------------------------------+
|    | App Name    | App ID                         | Estimated GPU   | Qualified                          | Full Cluster                        | GPU Config                         |
|    |             |                                | Speedup         | Cluster                            | Config                              | Recommendation                     |
|    |             |                                | Category**      | Recommendation                     | Recommendations*                    | Breakdown*                         |
|----+-------------+--------------------------------+-----------------+------------------------------------+-------------------------------------+------------------------------------|
|  0 | sanity_test | application_1673577383569_0004 | Large           | 1 x Node with 64 vCPUs (1 L4 each) | Does not exist, see log for errors  | Does not exist, see log for errors |
|  1 | sanity_test | application_1673577383569_0003 | Large           | 4 x Node with 8 vCPUs (1 L4 each)  | application_1673577383569_0003.conf | application_1673577383569_0003.log |
|  2 | sanity_test | application_1673503196578_0001 | Medium          | 4 x Node with 8 vCPUs (1 L4 each)  | application_1673503196578_0001.conf | application_1673503196578_0001.log |
|  3 | sanity_test | application_1673577383569_0001 | Medium          | 4 x Node with 8 vCPUs (1 L4 each)  | application_1673577383569_0001.conf | application_1673577383569_0001.log |
+----+-------------+--------------------------------+-----------------+------------------------------------+-------------------------------------+------------------------------------+

The total core seconds for the apps (in order of the above table) are: 1692, 1626, 1718, 1675.

Signed-off-by: cindyyuanjiang <cindyj@nvidia.com>
Signed-off-by: cindyyuanjiang <cindyj@nvidia.com>
Signed-off-by: cindyyuanjiang <cindyj@nvidia.com>
Signed-off-by: cindyyuanjiang <cindyj@nvidia.com>
@cindyyuanjiang cindyyuanjiang marked this pull request as ready for review September 16, 2024 00:09
@cindyyuanjiang cindyyuanjiang self-assigned this Sep 16, 2024
@cindyyuanjiang cindyyuanjiang added feature request New feature or request user_tools Scope the wrapper module running CSP, QualX, and reports (python) labels Sep 16, 2024
@cindyyuanjiang cindyyuanjiang changed the title WIP: [FEA] Add total core seconds into top candidate view [FEA] Add total core seconds into top candidate view Sep 16, 2024
@tgravescs
Copy link
Collaborator

Hey Cindy,

Filter apps with total core secs less than threshold 691200. This is the total core seconds of an n1-standard-8 instance running for one day. Running an n1-standard-8 for one day takes < $10, which is $3650 a year (not much cast savings potential).

This is unclear to me. This is just an example or where did this requirement come from? Its the first I'm hearing of it and I'm not sure why we would do any sort of automatic filtering here on what appears to be a random threshold of 691200.

Copy link
Collaborator

@amahussein amahussein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @cindyyuanjiang !

It will be helpful to update the PR description or the issue description with the requirements we get from Felix so that the reviewers can be on the same page.
Regarding the threshold 691200: I can understand the logic behind putting that number. However, I think it cannot be applied here for the following reasons:

  1. we use absolute core-seconds from the eventlog. this value is not mapped to core-seconds "per-day" . This means we are comparing apple-to-oranges. For example an application can run for entire 100 days but with very low core-seconds. Another application runs periodically with higher core-seconds. The view in this PR would show the slow running app as more important. In real world, we should be interested in the second job. Therefore, IMHO the view is incorrect because there is no normalization.
  2. 691200 comes from Dataproc n1-standard-8. What if we are dealing with users running on-prem or Databricks/Photon. How can we justify to them excluding a job based on core-seconds of a different platform?

@cindyyuanjiang
Copy link
Collaborator Author

cindyyuanjiang commented Sep 16, 2024

This is unclear to me. This is just an example or where did this requirement come from? Its the first I'm hearing of it and I'm not sure why we would do any sort of automatic filtering here on what appears to be a random threshold of 691200.

Thanks @tgravescs! This requirement is from discussion with Felix. I will update the PR description.

Copy link
Collaborator

@parthosa parthosa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @cindyyuanjiang. Similar to @amahussein's comments, I think we keep a DF created from qual tool jar output to most of the functions. We could use that. (although we might be dropping certain columns)

@tgravescs
Copy link
Collaborator

are we not putting total core seconds into the qualification_summary.csv file?

Signed-off-by: cindyyuanjiang <cindyj@nvidia.com>
Signed-off-by: cindyyuanjiang <cindyj@nvidia.com>
Signed-off-by: cindyyuanjiang <cindyj@nvidia.com>
@cindyyuanjiang
Copy link
Collaborator Author

are we not putting total core seconds into the qualification_summary.csv file?

Thanks @tgravescs, added "Total Core Seconds" column in qualification_summary.csv. cc: @amahussein

@cindyyuanjiang cindyyuanjiang added the affect-output A change that modifies the output (add/remove/rename files, add/remove/rename columns) label Sep 20, 2024
Signed-off-by: cindyyuanjiang <cindyj@nvidia.com>
parthosa
parthosa previously approved these changes Sep 23, 2024
Copy link
Collaborator

@parthosa parthosa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @cindyyuanjiang. LGTME.

Copy link
Collaborator

@amahussein amahussein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @cindyyuanjiang
A couple of nits.
Can you please append this PR to the list of changes in our internal documentation issue created to keep track of recent changes?

Signed-off-by: cindyyuanjiang <cindyj@nvidia.com>
@cindyyuanjiang
Copy link
Collaborator Author

Can you please append this PR to the list of changes in our internal documentation issue created to keep track of recent changes?

Thanks @amahussein! Added this PR to internal documentation issue.

Copy link
Collaborator

@amahussein amahussein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTME!
Thanks @cindyyuanjiang !

@cindyyuanjiang cindyyuanjiang merged commit 55a0460 into NVIDIA:dev Sep 26, 2024
14 checks passed
@cindyyuanjiang cindyyuanjiang deleted the spark-rapids-tools-1307-python branch September 26, 2024 21:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affect-output A change that modifies the output (add/remove/rename files, add/remove/rename columns) feature request New feature or request user_tools Scope the wrapper module running CSP, QualX, and reports (python)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants