When creating a Google Cloud Dataproc cluster, you can specify initialization actions in executables and/or scripts that Cloud Dataproc will run on all nodes in your Cloud Dataproc cluster immediately after the cluster is set up. Initialization actions often set up job dependencies, such as installing Python packages, so that jobs can be submitted to the cluster without having to install dependencies when the jobs are run.
Initialization actions are stored in a Google Cloud Storage bucket and can be passed as a parameter to the gcloud
command or the clusters.create
API when creating a Cloud Dataproc cluster. For example, to specify an initialization action when creating a cluster with the gcloud
command, you can run:
gcloud dataproc clusters create <CLUSTER_NAME> \
[--initialization-actions [GCS_URI,...]] \
[--initialization-action-timeout TIMEOUT]
Before creating clusters, you need to copy initialization actions to your own GCS bucket. For example:
MY_BUCKET=<gcs-bucket>
gsutil cp presto/presto.sh gs://$MY_BUCKET/
gcloud dataproc clusters create my-presto-cluster \
--initialization-actions gs://$MY_BUCKET/presto.sh
You can decide when to sync your copy of the initialization action with any changes to the initialization action that occur in the GitHub repository. This is also useful if you want to modify initialization actions to fit your needs.
These samples are provided to show how various packages and components can be installed on Cloud Dataproc clusters. You should understand how these samples work before running them on your clusters. The initialization actions provided in this repository are provided without support and you use them at your own risk.
This repository presently offers the following actions for use with Cloud Dataproc clusters.
- Install additional Apache Hadoop ecosystem components
- Improve data science and interactive experiences
- Configure the environment
- Configure a nice shell environment
- To switch to Python 3, use the conda initialization action
- Connect to Google Cloud Platform services
- Install alternate versions of the Cloud Storage and BigQuery connectors. Specific versions of these connectors come pre-installed on Cloud Dataproc clusters.
- Share a Google Cloud SQL Hive Metastore, or simply read/write data from Cloud SQL.
- Set up monitoring
Single Node clusters have dataproc-role
set to Master
and dataproc-worker-count
set to 0
. Most of the initialization actions in this repository should work out of the box, as they run only on the master. Examples include notebooks (such as Apache Zeppelin) and libraries (such as Apache Tez). Actions that run on all nodes of the cluster (such as cloud-sql-proxy) similarly work out of the box.
Some initialization actions are known not to work on Single Node clusters. All of these expect to have daemons on multiple nodes.
- Apache Drill
- Apache Flink
- Apache Kafka
- Apache Zookeeper
Feel free to send pull requests or file issues if you have a good use case for running one of these actions on a Single Node cluster.
Cloud Dataproc sets special metadata values for the instances that run in your cluster. You can use these values to customize the behavior of initialization actions, for example:
ROLE=$(/usr/share/google/get_metadata_value attributes/dataproc-role)
if [[ "${ROLE}" == 'Master' ]]; then
... master specific actions ...
else
... worker specific actions ...
fi
You can also use the ‑‑metadata
flag of the gcloud dataproc clusters create
command to provide your own
custom metadata:
gcloud dataproc clusters create cluster-name \
--initialization-actions ... \
--metadata name1=value1,name2=value2... \
... other flags ...
For more information, review the Cloud Dataproc documentation. You can also pose questions to the Stack Overflow community with the tag google-cloud-dataproc
.
See our other Google Cloud Platform github
repos for sample applications and
scaffolding for other frameworks and use cases.
Subscribe to cloud-dataproc-discuss@google.com for announcements and discussion.
- See CONTRIBUTING.md
- See LICENSE