-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Add total core seconds into top candidate view #1342
[FEA] Add total core seconds into top candidate view #1342
Conversation
Signed-off-by: cindyyuanjiang <cindyj@nvidia.com>
Signed-off-by: cindyyuanjiang <cindyj@nvidia.com>
Signed-off-by: cindyyuanjiang <cindyj@nvidia.com>
Signed-off-by: cindyyuanjiang <cindyj@nvidia.com>
Hey Cindy,
This is unclear to me. This is just an example or where did this requirement come from? Its the first I'm hearing of it and I'm not sure why we would do any sort of automatic filtering here on what appears to be a random threshold of 691200. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @cindyyuanjiang !
It will be helpful to update the PR description or the issue description with the requirements we get from Felix so that the reviewers can be on the same page.
Regarding the threshold 691200
: I can understand the logic behind putting that number. However, I think it cannot be applied here for the following reasons:
- we use absolute core-seconds from the eventlog. this value is not mapped to core-seconds "
per-day
" . This means we are comparing apple-to-oranges. For example an application can run for entire 100 days but with very low core-seconds. Another application runs periodically with higher core-seconds. The view in this PR would show the slow running app as more important. In real world, we should be interested in the second job. Therefore, IMHO the view is incorrect because there is no normalization. 691200
comes from Dataproc n1-standard-8. What if we are dealing with users running on-prem or Databricks/Photon. How can we justify to them excluding a job based on core-seconds of a different platform?
user_tools/src/spark_rapids_tools/tools/additional_heuristics.py
Outdated
Show resolved
Hide resolved
Thanks @tgravescs! This requirement is from discussion with Felix. I will update the PR description. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @cindyyuanjiang. Similar to @amahussein's comments, I think we keep a DF created from qual tool jar output to most of the functions. We could use that. (although we might be dropping certain columns)
user_tools/src/spark_rapids_tools/tools/additional_heuristics.py
Outdated
Show resolved
Hide resolved
are we not putting total core seconds into the qualification_summary.csv file? |
Signed-off-by: cindyyuanjiang <cindyj@nvidia.com>
Signed-off-by: cindyyuanjiang <cindyj@nvidia.com>
Signed-off-by: cindyyuanjiang <cindyj@nvidia.com>
Thanks @tgravescs, added "Total Core Seconds" column in |
user_tools/src/spark_rapids_tools/tools/additional_heuristics.py
Outdated
Show resolved
Hide resolved
user_tools/src/spark_rapids_tools/tools/additional_heuristics.py
Outdated
Show resolved
Hide resolved
Signed-off-by: cindyyuanjiang <cindyj@nvidia.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @cindyyuanjiang. LGTME.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @cindyyuanjiang
A couple of nits.
Can you please append this PR to the list of changes in our internal documentation issue created to keep track of recent changes?
user_tools/src/spark_rapids_pytools/resources/qualification-conf.yaml
Outdated
Show resolved
Hide resolved
Thanks @amahussein! Added this PR to internal documentation issue. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTME!
Thanks @cindyyuanjiang !
Contributes to #1307
PR Changes
After an offline discussion with Felix, here are the changes we want to proceed with:
691200
. This is the total core seconds of an n1-standard-8 instance running for one day. Running an n1-standard-8 for one day takes < $10, which is $3650 a year (not much cast savings potential).Estimated GPU Speedup
andUnsupported Operators Stage Duration Percent
. This PR changes the sorting columns to 'Speedup Category Order' (Large > Medium > Small > Not Recommended) andTotal Core Seconds
, so that within each speedup category, the apps are sorted based on total core seconds.Output Changes
Total Core Seconds
column inqualification_summary.csv
Example:
Testing
Cmd:
spark_rapids qualification -v -e <my-event-logs> --tools_jar <my_tools_jar>
Console output:
The total core seconds for the apps (in order of the above table) are:
1692
,1626
,1718
,1675
.