This is a guide for the RAPIDS tools for Apache Spark on AWS EMR. At the end of this guide, the user will be able to run the RAPIDS tools to analyze the clusters and the applications running on AWS EMR.
- Install the AWS CLI version 2. Follow the instructions on aws-cli-getting-started
- Set the configuration settings and credentials of the AWS CLI by creating
credentials
andconfig
files as described in aws-cli-configure-files. - In order to be able to run tools that require SSH on the EMR nodes then:
- make sure that you have SSH access to the cluster nodes; and
- create a key pair using Amazon EC2 through the AWS CLI command
aws ec2 create-key-pair
as instructed in aws-cli-create-key-pairs. - the private
.pem
file then can be passed to the RAPIDS tools CLI if needed.
- Spark event logs:
- The RAPIDS tools can process Apache Spark CPU event logs from Spark 2.0 or higher (raw, .lz4, .lzf, .snappy, .zstd)
- For
qualification
/profiling
commands, the event logs need to be archived to an accessible S3 folder.
- Install
spark-rapids-user-tools
with python [3.8, 3.11] using:- pip:
pip install spark-rapids-user-tools
- wheel-file:
pip install <wheel-file>
- from source:
pip install -e .
- pip:
- verify the command is installed correctly by running
spark_rapids_user_tools emr -- --help
Before running any command, you can set environment variables to specify configurations.
- RAPIDS variables have a naming pattern
RAPIDS_USER_TOOLS_*
:RAPIDS_USER_TOOLS_CACHE_FOLDER
: specifies the location of a local directory that the RAPIDS-cli uses to store and cache the downloaded resources. The default is/var/tmp/spark_rapids_user_tools_cache
. Note that caching the resources locally has an impact on the total execution time of the command.RAPIDS_USER_TOOLS_OUTPUT_DIRECTORY
: specifies the location of a local directory that the RAPIDS-cli uses to generate the output. The wrapper CLI arguments override that environment variable (local_folder
for Qualification).
- For AWS CLI, some environment variables can be set and picked by the RAPIDS-user tools such as:
AWS_PROFILE
,AWS_DEFAULT_REGION
,AWS_CONFIG_FILE
,AWS_SHARED_CREDENTIALS_FILE
. See the full list of variables in aws-cli-configure-envvars.
spark_rapids_user_tools emr qualification [options]
spark_rapids_user_tools emr qualification -- --help
The local deployment runs on the local development machine. It requires:
- Installing and configuring the AWS CLI
- Java 1.8+ development environment
- Internet access to download JAR dependencies from mvn:
spark-*.jar
,hadoop-aws-*.jar
, andaws-java-sdk-bundle*.jar
- Dependencies are cached on the local disk to reduce the overhead of the download.
Option | Description | Default | Required |
---|---|---|---|
cpu_cluster | The EMR-cluster on which the Apache Spark applications were executed. Accepted values are an EMR-cluster name, or a valid path to the cluster properties file (json format) generated by AWS CLI command emr describe-cluster |
N/A | N |
eventlogs | A comma separated list pointing to event logs or S3 directory | Reads the Spark's property spark.eventLog.dir defined in cpu_cluster . This property should be included in the output of emr describe-cluster . Note that the wrapper will raise an exception if the property is not set. |
N |
remote_folder | The S3 folder where the output of the wrapper's output is copied. If missing, the output will be available only on local disk | N/A | N |
gpu_cluster | The EMR-cluster on which the Spark applications is planned to be migrated. The argument can be an EMR-cluster or a valid path to the cluster's properties file (json format) generated by the AWS CLI emr describe-cluster command |
The wrapper maps the EC2 machine instances of the original cluster into EC2 instances that support GPU acceleration | N |
local_folder | Local work-directory path to store the output and to be used as root directory for temporary folders/files. The final output will go into a subdirectory named qual-${EXEC_ID} where exec_id is an auto-generated unique identifier of the execution. |
If the argument is NONE, the default value is the env variable RAPIDS_USER_TOOLS_OUTPUT_DIRECTORY if any; or the current working directory. |
N |
jvm_heap_size | The maximum heap size of the JVM in gigabytes | 24 | N |
profile | A named AWS profile that you can specify to get the settings/credentials of the AWS account | "default" if the env-variable AWS_PROFILE is not set |
N |
tools_jar | Path to a bundled jar including RAPIDS tool. The path is a local filesystem, or remote S3 url | Downloads the latest rapids-tools_*.jar from mvn repo | N |
filter_apps | Filtering criteria of the applications listed in the final STDOUT table is one of the following (ALL , SPEEDUPS , SAVINGS ). "ALL " means no filter applied. "SPEEDUPS " lists all the apps that are either 'Recommended', or 'Strongly Recommended' based on speedups. "SAVINGS " lists all the apps that have positive estimated GPU savings except for the apps that are 'Not Applicable'. |
SAVINGS |
N |
gpu_cluster_recommendation | The type of GPU cluster recommendation to generate. It accepts one of the following (CLUSTER , JOB , MATCH ). MATCH : keep GPU cluster same number of nodes as CPU cluster; CLUSTER : recommend optimal GPU cluster by cost for entire cluster. JOB : recommend optimal GPU cluster by cost per job |
MATCH |
N |
cpu_discount | A percent discount for the cpu cluster cost in the form of an integer value (e.g. 30 for 30% discount) | N/A | N |
gpu_discount | A percent discount for the gpu cluster cost in the form of an integer value (e.g. 30 for 30% discount) | N/A | N |
global_discount | A percent discount for both the cpu and gpu cluster costs in the form of an integer value (e.g. 30 for 30% discount) | N/A | N |
verbose | True or False to enable verbosity to the wrapper script | False if RAPIDS_USER_TOOLS_LOG_DEBUG is not set |
N |
rapids_options** | A list of valid Qualification tool options. Note that (output-directory , platform ) flags are ignored, and that multiple "spark-property" is not supported. |
N/A | N |
A typical workflow to successfully run the qualification
command in local mode is described as follows:
-
Store the Apache Spark event logs in S3 folder.
-
A user sets up his development machine:
- configures Java
- installs AWS CLI and configures the profile and the credentials to make sure the AWS CLI
commands can access the S3 resources at
LOGS_BUCKET
. - installs
spark_rapids_user_tools
-
If the results of the wrapper need to be stored on S3, then another s3 uri is required
REMOTE_FOLDER=s3://OUT_BUCKET/
-
User defines the EMR-cluster on which the Spark application were running. Note that the cluster does not have to be active; but it has to be visible by the AWS CLI (i.e., can run
aws emr describe-cluster
). -
The following script runs qualification by passing an AWS profile and S3 remote directory to store the output (user needs to define
LOGS_BUCKET
):# define the wrapper cache directory if necessary export RAPIDS_USER_TOOLS_CACHE_FOLDER=my_cache_folder export EVENTLOGS=s3://LOGS_BUCKET/eventlogs/ export CLUSTER_NAME=my-emr-cpu-cluster export REMOTE_FOLDER=s3://OUT_BUCKET/wrapper_output export MY_AWS_PROFILE=my-aws-profile spark_rapids_user_tools emr qualification \ --eventlogs $EVENTLOGS \ --cpu_cluster $CLUSTER_NAME \ --profile $MY_AWS_PROFILE \ --remote_folder $REMOTE_FOLDER
The wrapper generates a unique-Id for each execution in the format of
qual_<YYYYmmddHHmmss>_<0x%08X>
The above command will generate a directory containingqualification_summary.csv
in addition to the actual folder of the RAPIDS Qualification tool. The directory will be mirrored to S3 path (REMOTE_FOLDER
)../qual_<YYYYmmddHHmmss>_<0x%08X>/qualification_summary.csv ./qual_<YYYYmmddHHmmss>_<0x%08X>/rapids_4_spark_qualification_output/
For each app, the command output lists the following fields:
App ID
: An application is referenced by its application ID, 'app-id'. When running on YARN, each application may have multiple attempts, but there are attempt IDs only for applications in cluster mode, not applications in client mode. Applications in YARN cluster mode can be identified by their attempt-id.App Name
: Name of the applicationSpeedup Based Recommendation
: Recommendation based on 'Estimated Speed-up Factor'. Note that an application that has job or stage failures will be labeled 'Not Applicable'Savings Based Recommendation
: Recommendation based on 'Estimated GPU Savings'.- 'Strongly Recommended': An app with savings GEQ 40%
- 'Recommended': An app with savings between (1, 40) %
- 'Not Recommended': An app with no savings
- 'Not Applicable': An app that has job or stage failures.
Estimated GPU Speedup
: Speed-up factor estimated for the app. Calculated as the ratio between 'App Duration' and 'Estimated GPU Duration'.Estimated GPU Duration
: Predicted runtime of the app if it was run on GPUApp Duration
: Wall-Clock time measured since the application starts till it is completed. If an app is not completed an estimated completion time would be computed.Estimated GPU Savings(%)
: Percentage of cost savings of the app if it migrates to an accelerated cluster. It is calculated as:estimated_saving = 100 - ((100 * gpu_cost) / cpu_cost)
The command creates a directory with UUID that contains the following:
- Directory generated by the RAPIDS qualification tool
rapids_4_spark_qualification_output
; - A CSV file that contains the summary of all the applications along with estimated absolute costs
- Sample directory structure:
. ├── qualification_summary.csv └── rapids_4_spark_qualification_output ├── rapids_4_spark_qualification_output.csv ├── rapids_4_spark_qualification_output.log ├── rapids_4_spark_qualification_output_execs.csv ├── rapids_4_spark_qualification_output_stages.csv └── ui
spark_rapids_user_tools emr profiling [options]
spark_rapids_user_tools emr profiling -- --help
The local deployment runs on the local development machine. It requires:
- Installing and configuring the AWS CLI
- Java 1.8+ development environment
- Internet access to download JAR dependencies from mvn:
spark-*.jar
,hadoop-aws-*.jar
, andaws-java-sdk-bundle*.jar
- Dependencies are cached on the local disk to reduce the overhead of the download.
Option | Description | Default | Required |
---|---|---|---|
gpu_cluster | The EMR-cluster on which the Spark applications were executed. The argument can be an EMR-cluster or a valid path to the cluster's properties file (json format) generated by the AWS CLI command aws emr describe-cluster |
If missing, then the argument worker_info may be provided. | N |
worker_info | A path pointing to a yaml file containing the system information of a worker node. It is assumed that all workers are homogenous. The format of the file is described in the following section. | None | N |
eventlogs | A comma seperated list of S3 urls pointing to event logs or S3 directory | Reads the Spark's property spark.eventLog.dir defined in gpu_cluster . This property should be included in the output of emr describe-cluster . Note that the wrapper will raise an exception if the property is not set. |
N |
remote_folder | The S3 folder where the output of the wrapper's output is copied. If missing, the output will be available only on local disk | N/A | N |
local_folder | Local work-directory path to store the output and to be used as root directory for temporary folders/files. The final output will go into a subdirectory named prof-${EXEC_ID} where exec_id is an auto-generated unique identifier of the execution. |
If the argument is NONE, the default value is the env variable RAPIDS_USER_TOOLS_OUTPUT_DIRECTORY if any; or the current working directory. |
N |
profile | A named AWS profile to get the settings/credentials of the AWS account. | "DEFAULT" | N |
jvm_heap_size | The maximum heap size of the JVM in gigabytes | 24 | N |
tools_jar | Path to a bundled jar including RAPIDS tool. The path is a local filesystem, or remote S3 url | Downloads the latest rapids-4-spark-tools_*.jar from mvn repo |
N |
verbose | True or False to enable verbosity to the wrapper script | False if RAPIDS_USER_TOOLS_LOG_DEBUG is not set |
N |
rapids_options** | A list of valid Profiling tool options. Note that (output-directory , auto-tuner , combined ) flags are ignored |
N/A | N |
If the CLI does not provide an argument gpu_cluster
, then a valid path to yaml file can be
provided through the arg worker_info
.
The worker_info
is a yaml file that contains the HW description of the workers. It must contain
the following properties:
system.numCores
: number of cores of a single worker nodesystem.memory
: RAM size in MiB of a single nodesystem.numWorkers
: number of workersgpu.name
: the accelerator installed on the worker nodegpu.memory
: memory size of the accelerator in MiB. (i.e., 16GB for Nvidia-T4)softwareProperties
: Spark default-configurations of the target cluster
An example of valid worker_info.yaml
:
system:
numCores: 32
memory: 212992MiB
numWorkers: 5
gpu:
memory: 15109MiB
count: 4
name: T4
softwareProperties:
spark.driver.maxResultSize: 7680m
spark.driver.memory: 15360m
spark.executor.cores: '8'
spark.executor.instances: '2'
spark.executor.memory: 47222m
spark.executorEnv.OPENBLAS_NUM_THREADS: '1'
spark.scheduler.mode: FAIR
spark.sql.cbo.enabled: 'true'
spark.ui.port: '0'
spark.yarn.am.memory: 640m
A typical workflow to successfully run the profiling
command in local mode is described as follows:
- Store the Apache Spark event logs in S3 folder.
- A user sets up his development machine:
- configures Java
- installs AWS CLI and configures the profile and the credentials to make sure the AWS CLI
commands can access the S3 resources
LOGS_BUCKET
. - installs
spark_rapids_user_tools
- If the results of the wrapper need to be stored on S3, then another S3 uri is required
REMOTE_FOLDER=s3://OUT_BUCKET/
- Depending on the accessibility of the cluster properties, the user chooses one of the 2 cases below ("Case-A", and "Case-B") to trigger the CLI.
For each successful execution, the wrapper generates a new directory in the format of
prof_<YYYYmmddHHmmss>_<0x%08X>
. The directory contains profiling_summary.log
in addition to
the actual folder of the RAPIDS Profiling tool. The directory will be mirrored to S3 folder if the
argument --remote_folder
was a valid S3 path.
./prof_<YYYYmmddHHmmss>_<0x%08X>/profiling_summary.log
./prof_<YYYYmmddHHmmss>_<0x%08X>/rapids_4_spark_profile/
Case-A: A gpu-cluster property file is accessible:
A cluster property is still accessible if one of the following conditions applies:
-
The cluster is listed by the
aws emr list-clusters
cmd. In this case, the CLI will be triggered by providing--gpu_cluster $CLUSTER_NAME
``` # run the command using the GPU cluster name export RAPIDS_USER_TOOLS_CACHE_FOLDER=my_cache_folder export EVENTLOGS=s3://LOGS_BUCKET/eventlogs/ export CLUSTER_NAME=my-emr-gpu-cluster export REMOTE_FOLDER=s3://OUT_BUCKET/wrapper_output spark_rapids_user_tools emr profiling \ --eventlogs $EVENTLOGS \ --gpu_cluster $CLUSTER_NAME \ --remote_folder $REMOTE_FOLDER ```
-
The cluster properties file is accessible on local disk or a valid S3 path.
$> export CLUSTER_PROPS_FILE=cluster-props.json $> aws emr describe-cluster --cluster-id $(aws emr list-clusters --query 'Clusters[?Name==$CLUSTER_NAME].Id' --output text) > $CLUSTER_PROPS_FILE
Trigger the CLI by providing the path to the properties file
--gpu_cluster $CLUSTER_PROPS_FILE
$> spark_rapids_user_tools emr profiling \ --eventlogs $EVENTLOGS \ --gpu_cluster $CLUSTER_PROPS_FILE \ --remote_folder $REMOTE_FOLDER
Case-B: GPU cluster information is missing:
In this scenario, users can write down a simple yaml file to describe the shape of the worker nodes.
This case is relevant to the following plans:
- Users who might want to experiment with different configurations before deciding on the final cluster shape.
- Users who have no access to the properties of the cluster.
The CLI is triggered by providing the location where the yaml file is stored --worker_info $WORKER_INFO_PATH
```
# First, create a yaml file as described in previous section
$> export WORKER_INFO_PATH=worker-info.yaml
# Run the profiling cmd
$> spark_rapids_user_tools emr profiling \
--eventlogs $EVENTLOGS \
--worker_info $WORKER_INFO_PATH \
--remote_folder $REMOTE_FOLDER
```
Note that if the user does not supply a cluster or worker properties file, the autotuner will still recommend tuning settings based on the job event log.
spark_rapids_user_tools emr diagnostic [options]
spark_rapids_user_tools emr diagnostic -- --help
Run diagnostic command to collects information from EMR cluster, such as OS version, # of worker nodes, Yarn configuration, Spark version and error logs etc. The cluster has to be running and the user must have SSH access.
Option | Description | Default | Required |
---|---|---|---|
cluster | Name of the EMR cluster running an accelerated computing instance | N/A | Y |
profile | A named AWS profile that you can specify to get the settings/credentials of the AWS account. | "default" if the the env-variable AWS_PROFILE is not set |
N |
output_folder | Path to local directory where the final recommendations is logged | env variable RAPIDS_USER_TOOLS_OUTPUT_DIRECTORY if any; or the current working directory. |
N |
key_pair_path | A '.pem' file path that enables to connect to EC2 instances using SSH. For more details on creating key pairs, visit aws-create-key-pair-guide | env variable 'RAPIDS_USER_TOOLS_KEY_PAIR_PATH ' if any |
N |
thread_num | Number of threads to access remote cluster nodes in parallel | 3 | N |
yes | auto confirm to interactive question | False | N |
verbose | True or False to enable verbosity to the wrapper script | False if RAPIDS_USER_TOOLS_LOG_DEBUG is not set |
N |
The default is to collect info from each cluster node via SSH access and an archive would be created to output folder at last.
The steps to run the command:
-
The user creates a cluster
-
The user creates a key pair access ".pem" file as instructed in aws-create-key-pair-guide
-
The user runs the following command:
spark_rapids_user_tools emr diagnostic \ --cluster my-cluster-name \ --key_pair_path my-file-path
If
key_pair_path
is missing, the user must set an ev-variableRAPIDS_USER_TOOLS_KEY_PAIR_PATH
If the connection to EC2 instances cannot be established through SSH, the command will raise error.