This Terraform template shows an E2E demonstration of how to connect from Azure Databricks to an HDInsight HBase cluster using the hbase-spark connector. In doing so it takes care of the following caveats;
- The Hortwonworks shc connector is broken on Databricks, see this issue.
hbase-spark
andshc
have some subtle but important differences in package and data source names. Correct usage can be seen in this example published by Cloudera.- Databricks and HDInsight HBase must be provisioned in the same VNET.
- Authentication to HBase is done via config
hbase-site.xml
. This file exists on HDInsight head node and is copied to the attached Blob Storage. This blob storage container is then also mounted to Databricks i.e. the config file becomes available to all Databricks cluster nodes at/dbfs/mnt/hdi/hbase-site.xml
. - Databricks Cluster must be provisioned with runtime Scala 2.11 e.g. Runtime v6.6. Runtimes with Scala 2.12 won't work yet.
- The following 3 libraries must be attached to the cluster. Note the extra two in addition to
hbase-spark
;
org.apache.hbase.connectors.spark:hbase-spark:1.0.0
org.apache.hbase:hbase-common:2.3.1
org.apache.hbase:hbase-server:2.3.1
- Virtual Network
- Blob Storage
- Azure Databricks Workspace
- HDInsight HBase cluster
Note: The HBase cluster is provisioned with cheapest possible VMs for Head, Region and Zookeeper nodes. It will cost you ~$550 / month in Western Europe.
Once terraform apply
has succeeded, navigate to the Databricks workspace and run the notebook /Shared/TestHBase.scala
. This notebook connects to the HBase cluster and loads Contacts
table into a DataFrame
. This table was populated into HBase as part of the Terraform provisioning.