Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster information should handle dynamic allocation and nodes being removed and added #1369

Open
wants to merge 17 commits into
base: dev
Choose a base branch
from

Conversation

tgravescs
Copy link
Collaborator

@tgravescs tgravescs commented Oct 3, 2024

fixes #799

The qualification tool cluster information output was just looking at the number of nodes and executors at the end of the application. This is not correct when dynamic allocation is used or when nodes/executors get removed and re-added.

To fix this I am changing the number of Executors and number of Nodes to be high water marks. Meaning throughout the lifetime of the application what was the maximum number at any point in time. This would be the max resource usage.
Note this needs to be documented somewhere still.

The other change in this PR is to fix concurrency problem with the cluster information being stored in Platform. Initially we only had one instance of this so if multiple applications are processed simultaneously it could corrupt that information. This PR changes it to make a Platform instance per application.

Update README on how to run a single test.

I added a new event log that has dynamic allocation enabled and went through some cycles or executors added and idle time out and then readded, etc...

cluster information json file now looks like below when dynamic allocaiton enabled:

 "clusterInfo" : {
    "vendor" : "onprem",
    "coresPerExecutor" : 4,
    "numExecsPerNode" : -1,
    "numExecutors" : 7,
    "numWorkerNodes" : 5,
    "executorHeapMemory" : 20480,
    "dynamicAllocationEnabled" : true,
    "dynamicAllocationMaxExecutors" : "2147483647",
    "dynamicAllocationMinExecutors" : "0",
    "dynamicAllocationInitialExecutors" : "2",
    "driverHost" : "10.10.6.9"
  },
  "recommendedClusterInfo" : {
    "vendor" : "onprem",
    "coresPerExecutor" : 4,
    "numWorkerNodes" : -1,
    "numGpus" : 1,
    "numExecutors" : 7,
    "gpuDevice" : "l4",
    "dynamicAllocationEnabled" : true,
    "dynamicAllocationMaxExecutors" : "2147483647",
    "dynamicAllocationMinExecutors" : "0",
    "dynamicAllocationInitialExecutors" : "2",
    "workerNodeType" : "onprem"
  }

csv file:

App ID,App Name,Event Log,Vendor,Driver Host,Cluster Id,Cluster Name,Worker Node Type,Driver Node Type,Num Worker Nodes,Num Executors Per Node,Num Executors,Executor Heap Memory,Dynamic Allocation Enabled,Dynamic Allocation Max Executors,Dynamic Allocation Min Executors,Dynamic Allocation Initial Executors,Cores Per Executor,Recommended Worker Node Type,Recommended Num Executors,Recommended Num Worker Nodes,Recommended Cores Per Executor,Recommended GPU Device,Recommended Num GPUs Per Node,Recommended Vendor,Recommended Dynamic Allocation Enabled,Recommended Dynamic Allocation Max Executors,Recommended Dynamic Allocation Min Executors,Recommended Dynamic Allocation Initial Executors
"application_1707709865217_0493","Spark shell","file:/home/tgraves/workspace/spark-rapids-tools2/user_tools/application_1707709865217_0493.zstd","onprem","10.10.6.9","","","","","5","-1","7","20480","true","2147483647","0","2","4","onprem","7","-1","4","l4","1","onprem","true","2147483647","0","2"

@tgravescs
Copy link
Collaborator Author

found an issue with dataproc gpu recommendation working on fixing it

Copy link
Collaborator

@parthosa parthosa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @tgravescs. Going over the PR.

QQ: For onprem, even if dynamic allocation is disabled, is the plan to disable recommending cluster shape (since numWorkerNodes=-1)

E.g. I tested it on a sample event log. Entry in rapids_4_spark_qualification_output_cluster_information.json contains the following:

  "recommendedClusterInfo" : {
    "vendor" : "onprem",
    "coresPerExecutor" : 16,
    "numWorkerNodes" : -1,
    "numGpus" : 1,
    "numExecutors" : 8,
    "gpuDevice" : "l4",
    "dynamicAllocationEnabled" : false,
    "dynamicAllocationMaxExecutors" : "N/A",
    "dynamicAllocationMinExecutors" : "N/A",
    "dynamicAllocationInitialExecutors" : "N/A",
    "workerNodeType" : "onprem"
  }

@tgravescs
Copy link
Collaborator Author

yes any onprem we don't know what the node configuration is, so I don't recommend a number of nodes

Copy link
Collaborator

@parthosa parthosa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @tgravescs. Minor comment.

We duplicated platform field to make it per-app since it stores cluster information, which can vary across apps.

At a high level, instead of having platform per app, should we have cluster info per-app? This way the app and its ClusterInfo could be passed to the AutoTuner.

The challenge would come when a user provides a custom worker info file.

This could be in a future refactor if needed.

numExecsPerNode, activeHosts.toSet.size, sparkProperties, systemProperties)
platform.configureClusterInfoFromEventLog(execCoreCounts.max,
numExecsPerNode, maxNumExecutorsRunning, maxNumNodesRunning,
sparkProperties, systemProperties)
} else {
// if no executors do we want to qualify at all? maybe not, else we could look at
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The warning message in the else block should be updated, as the if condition does not check for active executors.

Infact, should we throw an exception in the else block? This would indicate that no executors were found.

@tgravescs
Copy link
Collaborator Author

At a high level, instead of having platform per app, should we have cluster info per-app? This way the app and its ClusterInfo could be passed to the AutoTuner.

Yes this should be separated out in the future. We had another issue to to redo how the platform is created and figured we could do it under that one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Handle dynamic allocation when determining the number of nodes
2 participants