-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cluster information should handle dynamic allocation and nodes being removed and added #1369
base: dev
Are you sure you want to change the base?
Conversation
cluster information. Signed-off-by: Thomas Graves <tgraves@nvidia.com>
platform per app. Signed-off-by: Thomas Graves <tgraves@nvidia.com>
found an issue with dataproc gpu recommendation working on fixing it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @tgravescs. Going over the PR.
QQ: For onprem, even if dynamic allocation is disabled, is the plan to disable recommending cluster shape (since numWorkerNodes=-1
)
E.g. I tested it on a sample event log. Entry in rapids_4_spark_qualification_output_cluster_information.json
contains the following:
"recommendedClusterInfo" : {
"vendor" : "onprem",
"coresPerExecutor" : 16,
"numWorkerNodes" : -1,
"numGpus" : 1,
"numExecutors" : 8,
"gpuDevice" : "l4",
"dynamicAllocationEnabled" : false,
"dynamicAllocationMaxExecutors" : "N/A",
"dynamicAllocationMinExecutors" : "N/A",
"dynamicAllocationInitialExecutors" : "N/A",
"workerNodeType" : "onprem"
}
yes any onprem we don't know what the node configuration is, so I don't recommend a number of nodes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @tgravescs. Minor comment.
We duplicated platform field to make it per-app since it stores cluster information, which can vary across apps.
At a high level, instead of having platform per app, should we have cluster info per-app? This way the app and its ClusterInfo could be passed to the AutoTuner.
The challenge would come when a user provides a custom worker info file.
This could be in a future refactor if needed.
numExecsPerNode, activeHosts.toSet.size, sparkProperties, systemProperties) | ||
platform.configureClusterInfoFromEventLog(execCoreCounts.max, | ||
numExecsPerNode, maxNumExecutorsRunning, maxNumNodesRunning, | ||
sparkProperties, systemProperties) | ||
} else { | ||
// if no executors do we want to qualify at all? maybe not, else we could look at |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The warning message in the else block should be updated, as the if condition does not check for active executors.
Infact, should we throw an exception in the else block? This would indicate that no executors were found.
Yes this should be separated out in the future. We had another issue to to redo how the platform is created and figured we could do it under that one. |
fixes #799
The qualification tool cluster information output was just looking at the number of nodes and executors at the end of the application. This is not correct when dynamic allocation is used or when nodes/executors get removed and re-added.
To fix this I am changing the number of Executors and number of Nodes to be high water marks. Meaning throughout the lifetime of the application what was the maximum number at any point in time. This would be the max resource usage.
Note this needs to be documented somewhere still.
The other change in this PR is to fix concurrency problem with the cluster information being stored in Platform. Initially we only had one instance of this so if multiple applications are processed simultaneously it could corrupt that information. This PR changes it to make a Platform instance per application.
Update README on how to run a single test.
I added a new event log that has dynamic allocation enabled and went through some cycles or executors added and idle time out and then readded, etc...
cluster information json file now looks like below when dynamic allocaiton enabled:
csv file: