Cluster information should handle dynamic allocation and nodes being removed and added #1369

tgravescs · 2024-10-03T20:09:05Z

fixes #799

The qualification tool cluster information output was just looking at the number of nodes and executors at the end of the application. This is not correct when dynamic allocation is used or when nodes/executors get removed and re-added.

To fix this I am changing the number of Executors and number of Nodes to be high water marks. Meaning throughout the lifetime of the application what was the maximum number at any point in time. This would be the max resource usage.
Note this needs to be documented somewhere still.

The other change in this PR is to fix concurrency problem with the cluster information being stored in Platform. Initially we only had one instance of this so if multiple applications are processed simultaneously it could corrupt that information. This PR changes it to make a Platform instance per application.

Update README on how to run a single test.

I added a new event log that has dynamic allocation enabled and went through some cycles or executors added and idle time out and then readded, etc...

cluster information json file now looks like below when dynamic allocaiton enabled:

 "clusterInfo" : {
    "vendor" : "onprem",
    "coresPerExecutor" : 4,
    "numExecsPerNode" : -1,
    "numExecutors" : 7,
    "numWorkerNodes" : 5,
    "executorHeapMemory" : 20480,
    "dynamicAllocationEnabled" : true,
    "dynamicAllocationMaxExecutors" : "2147483647",
    "dynamicAllocationMinExecutors" : "0",
    "dynamicAllocationInitialExecutors" : "2",
    "driverHost" : "10.10.6.9"
  },
  "recommendedClusterInfo" : {
    "vendor" : "onprem",
    "coresPerExecutor" : 4,
    "numWorkerNodes" : -1,
    "numGpus" : 1,
    "numExecutors" : 7,
    "gpuDevice" : "l4",
    "dynamicAllocationEnabled" : true,
    "dynamicAllocationMaxExecutors" : "2147483647",
    "dynamicAllocationMinExecutors" : "0",
    "dynamicAllocationInitialExecutors" : "2",
    "workerNodeType" : "onprem"
  }

csv file:

App ID,App Name,Event Log,Vendor,Driver Host,Cluster Id,Cluster Name,Worker Node Type,Driver Node Type,Num Worker Nodes,Num Executors Per Node,Num Executors,Executor Heap Memory,Dynamic Allocation Enabled,Dynamic Allocation Max Executors,Dynamic Allocation Min Executors,Dynamic Allocation Initial Executors,Cores Per Executor,Recommended Worker Node Type,Recommended Num Executors,Recommended Num Worker Nodes,Recommended Cores Per Executor,Recommended GPU Device,Recommended Num GPUs Per Node,Recommended Vendor,Recommended Dynamic Allocation Enabled,Recommended Dynamic Allocation Max Executors,Recommended Dynamic Allocation Min Executors,Recommended Dynamic Allocation Initial Executors
"application_1707709865217_0493","Spark shell","file:/home/tgraves/workspace/spark-rapids-tools2/user_tools/application_1707709865217_0493.zstd","onprem","10.10.6.9","","","","","5","-1","7","20480","true","2147483647","0","2","4","onprem","7","-1","4","l4","1","onprem","true","2147483647","0","2"

cluster information. Signed-off-by: Thomas Graves <tgraves@nvidia.com>

platform per app. Signed-off-by: Thomas Graves <tgraves@nvidia.com>

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

tgravescs · 2024-10-07T13:16:15Z

found an issue with dataproc gpu recommendation working on fixing it

parthosa

Thanks @tgravescs. Going over the PR.

QQ: For onprem, even if dynamic allocation is disabled, is the plan to disable recommending cluster shape (since numWorkerNodes=-1)

E.g. I tested it on a sample event log. Entry in rapids_4_spark_qualification_output_cluster_information.json contains the following:

  "recommendedClusterInfo" : {
    "vendor" : "onprem",
    "coresPerExecutor" : 16,
    "numWorkerNodes" : -1,
    "numGpus" : 1,
    "numExecutors" : 8,
    "gpuDevice" : "l4",
    "dynamicAllocationEnabled" : false,
    "dynamicAllocationMaxExecutors" : "N/A",
    "dynamicAllocationMinExecutors" : "N/A",
    "dynamicAllocationInitialExecutors" : "N/A",
    "workerNodeType" : "onprem"
  }

tgravescs · 2024-10-07T20:58:47Z

yes any onprem we don't know what the node configuration is, so I don't recommend a number of nodes

parthosa

Thanks @tgravescs. Minor comment.

We duplicated platform field to make it per-app since it stores cluster information, which can vary across apps.

At a high level, instead of having platform per app, should we have cluster info per-app? This way the app and its ClusterInfo could be passed to the AutoTuner.

The challenge would come when a user provides a custom worker info file.

This could be in a future refactor if needed.

parthosa · 2024-10-10T21:18:05Z

core/src/main/scala/org/apache/spark/sql/rapids/tool/qualification/QualificationAppInfo.scala

-        numExecsPerNode, activeHosts.toSet.size, sparkProperties, systemProperties)
+      platform.configureClusterInfoFromEventLog(execCoreCounts.max,
+        numExecsPerNode, maxNumExecutorsRunning, maxNumNodesRunning,
+        sparkProperties, systemProperties)
    } else {
      // if no executors do we want to qualify at all?  maybe not, else we could look at


The warning message in the else block should be updated, as the if condition does not check for active executors.

Infact, should we throw an exception in the else block? This would indicate that no executors were found.

tgravescs · 2024-10-11T18:52:21Z

At a high level, instead of having platform per app, should we have cluster info per-app? This way the app and its ClusterInfo could be passed to the AutoTuner.

Yes this should be separated out in the future. We had another issue to to redo how the platform is created and figured we could do it under that one.

tgravescs added 13 commits September 19, 2024 14:19

Properly handle multi-tenant clusters and dynamic allocation in the

a9fe124

cluster information. Signed-off-by: Thomas Graves <tgraves@nvidia.com>

Merge remote-tracking branch 'origin/dev' into clusterConfigYarn

e8fa40a

fixes

b5f32df

remove debug

b5828f6

fix tests

d6a624f

Fix threading issues with storing cluster information in Platform. Make

79356fa

platform per app. Signed-off-by: Thomas Graves <tgraves@nvidia.com>

put in 3.4.2

78dbfe4

Merge remote-tracking branch 'origin/dev' into clusterConfigYarn

d1ff507

cleanup

f87491b

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

update README and fix test

b197a8b

fix README

c54c0c0

update

8277cde

revert total core seconds change

dec1627

tgravescs added the feature request New feature or request label Oct 3, 2024

tgravescs self-assigned this Oct 3, 2024

tgravescs mentioned this pull request Oct 3, 2024

[FEA] Python app_metadata include more cluster information #1370

Open

tgravescs and others added 3 commits October 7, 2024 08:23

fix up number of nodes on dataproc when executors aren't exact match

ef7979e

fix number cores on dataproc

7f6bae9

Update test results

a0cc4fc

parthosa reviewed Oct 7, 2024

View reviewed changes

parthosa reviewed Oct 10, 2024

View reviewed changes

Fix log warning

cc40e93

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cluster information should handle dynamic allocation and nodes being removed and added #1369

Cluster information should handle dynamic allocation and nodes being removed and added #1369

tgravescs commented Oct 3, 2024 •

edited

Loading

tgravescs commented Oct 7, 2024

parthosa left a comment •

edited

Loading

tgravescs commented Oct 7, 2024

parthosa left a comment •

edited

Loading

parthosa Oct 10, 2024

tgravescs commented Oct 11, 2024

Cluster information should handle dynamic allocation and nodes being removed and added #1369

Are you sure you want to change the base?

Cluster information should handle dynamic allocation and nodes being removed and added #1369

Conversation

tgravescs commented Oct 3, 2024 • edited Loading

tgravescs commented Oct 7, 2024

parthosa left a comment • edited Loading

Choose a reason for hiding this comment

tgravescs commented Oct 7, 2024

parthosa left a comment • edited Loading

Choose a reason for hiding this comment

parthosa Oct 10, 2024

Choose a reason for hiding this comment

tgravescs commented Oct 11, 2024

tgravescs commented Oct 3, 2024 •

edited

Loading

parthosa left a comment •

edited

Loading

parthosa left a comment •

edited

Loading