Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Databricks on GCP data exfiltration protection workspace deployment #172

Open
wants to merge 8 commits into
base: main
Choose a base branch
from
88 changes: 88 additions & 0 deletions examples/gcp-with-psc-exfiltration-protection/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
# Provisioning Databricks on GCP workspace with a Hub & Spoke network architecture for data exfiltration protection

This example is using the [gcp-with-psc-exfiltration-protection](../../modules/gcp-with-psc-exfiltration-protection) module.

This template provides an example deployment of: Hub-Spoke networking with egress firewall to control all outbound traffic from Databricks subnets.

With this setup, you can setup firewall rules to block / allow egress traffic from your Databricks clusters. You can also use firewall to block all access to storage accounts, and use private endpoint connection to bypass this firewall, such that you allow access only to specific storage accounts.


To find IP and FQDN for your deployment, go to: https://docs.gcp.databricks.com/en/resources/ip-domain-region.html

## Overall Architecture

![alt text](../../modules/gcp-with-psc-exfiltration-protection/images/architecture.png)

Resources to be created:
* Hub VPC and its subnet
* Spoke VPC and its subnets
* Peering between Hub and Spoke VPC
* Private Service Connect (PSC) endpoints
* DNS private and peering zones
* Firewall rules for Hub and Spoke VPCs
* Databricks workspace with private link to control plane, user to webapp and private link to DBFS




## How to use

1. Reference this module using one of the different [module source types](https://developer.hashicorp.com/terraform/language/modules/sources)
2. Add `terraform.tfvars` with the information about service principals to be provisioned at account level.

## How to fill in variable values

Variables have no default values in order to avoid misconfiguration

Most of the values are related to resources managed by Databricks. Values to use be found at: https://docs.gcp.databricks.com/en/resources/ip-domain-region.html

<!-- BEGIN_TF_DOCS -->
## Requirements

| Name | Version |
|------------------------------------------------------------------|---------|
| <a name="requirement_google"></a> [google](#requirement\_google) | 6.17.0 |

## Providers

| Name | Version |
|------------------------------------------------------------------------|---------|
| <a name="provider_databricks"></a> [databricks](#provider\_databricks) | n/a |
| <a name="provider_google"></a> [google](#provider\_google) | n/a |
| <a name="provider_random"></a> [random](#provider\_random) | n/a |

## Modules

No modules.

## Resources

No resources.

## Inputs

| Name | Description | Type | Default | Required |
|------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------|---------------|---------|:--------:|
| <a name="input_databricks_account_id"></a> [databricks\_account\_id](#input\_databricks\_account\_id) | Databricks Account ID | `string` | n/a | yes |
| <a name="input_gke_master_ip_range"></a> [gke\_master\_ip\_range](#input\_gke\_master\_ip\_range) | IP Range for GKE Master subnet | `string` | n/a | yes |
| <a name="input_google_region"></a> [google\_region](#input\_google\_region) | Google Cloud region where the resources will be created | `string` | n/a | yes |
| <a name="input_hive_metastore_ip"></a> [hive\_metastore\_ip](#input\_hive\_metastore\_ip) | Value of regional default Hive Metastore IP | `string` | n/a | yes |
| <a name="input_hub_vpc_cidr"></a> [hub\_vpc\_cidr](#input\_hub\_vpc\_cidr) | CIDR for Hub VPC | `string` | n/a | yes |
| <a name="input_hub_vpc_google_project"></a> [hub\_vpc\_google\_project](#input\_hub\_vpc\_google\_project) | Google Cloud project ID related to Hub VPC | `string` | n/a | yes |
| <a name="input_is_spoke_vpc_shared"></a> [is\_spoke\_vpc\_shared](#input\_is\_spoke\_vpc\_shared) | Whether the Spoke VPC is a Shared or a dedicated VPC | `bool` | n/a | yes |
| <a name="input_pod_ip_cidr_range"></a> [pod\_ip\_cidr\_range](#input\_pod\_ip\_cidr\_range) | IP Range for Pods subnet (secondary) | `string` | n/a | yes |
| <a name="input_prefix"></a> [prefix](#input\_prefix) | Prefix to use in generated resources name | `string` | n/a | yes |
| <a name="input_psc_subnet_cidr"></a> [psc\_subnet\_cidr](#input\_psc\_subnet\_cidr) | CIDR for Spoke VPC | `string` | n/a | yes |
| <a name="input_service_ip_cidr_range"></a> [service\_ip\_cidr\_range](#input\_service\_ip\_cidr\_range) | IP Range for Services subnet (secondary) | `string` | n/a | yes |
| <a name="input_spoke_vpc_cidr"></a> [spoke\_vpc\_cidr](#input\_spoke\_vpc\_cidr) | CIDR for Spoke VPC | `string` | n/a | yes |
| <a name="input_spoke_vpc_google_project"></a> [spoke\_vpc\_google\_project](#input\_spoke\_vpc\_google\_project) | Google Cloud project ID related to Spoke VPC | `string` | n/a | yes |
| <a name="input_tags"></a> [tags](#input\_tags) | Map of tags to add to all resources | `map(string)` | `{}` | no |
| <a name="input_workspace_google_project"></a> [workspace\_google\_project](#input\_workspace\_google\_project) | Google Cloud project ID related to Databricks workspace | `string` | n/a | yes |

## Outputs

| Name | Description |
|-------------------------------------------------------------------------------|--------------------------------------------------------------------------------------|
| <a name="output_workspace_id"></a> [workspace\_id](#output\_workspace\_id) | The Databricks workspace ID |
| <a name="output_workspace_url"></a> [workspace\_url](#output\_workspace\_url) | The workspace URL which is of the format '{workspaceId}.{random}.gcp.databricks.com' |
<!-- END_TF_DOCS -->
19 changes: 19 additions & 0 deletions examples/gcp-with-psc-exfiltration-protection/main.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
module "gcp_with_data_exfiltration_protection" {
source = "../../modules/gcp-with-psc-exfiltration-protection"

databricks_account_id = var.databricks_account_id
hub_vpc_google_project = var.hub_vpc_google_project
is_spoke_vpc_shared = var.is_spoke_vpc_shared
prefix = var.prefix
spoke_vpc_google_project = var.spoke_vpc_google_project
workspace_google_project = var.workspace_google_project
gke_master_ip_range = var.gke_master_ip_range
google_region = var.google_region
hive_metastore_ip = var.hive_metastore_ip
hub_vpc_cidr = var.hub_vpc_cidr
pod_ip_cidr_range = var.pod_ip_cidr_range
psc_subnet_cidr = var.psc_subnet_cidr
service_ip_cidr_range = var.service_ip_cidr_range
spoke_vpc_cidr = var.spoke_vpc_cidr
tags = var.tags
}
10 changes: 10 additions & 0 deletions examples/gcp-with-psc-exfiltration-protection/outputs.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@

output "workspace_url" {
value = module.gcp_with_data_exfiltration_protection.workspace_url
description = "The workspace URL which is of the format '{workspaceId}.{random}.gcp.databricks.com'"
}

output "workspace_id" {
description = "The Databricks workspace ID"
value = module.gcp_with_data_exfiltration_protection.workspace_id
}
7 changes: 7 additions & 0 deletions examples/gcp-with-psc-exfiltration-protection/providers.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
provider "databricks" {
host = "https://accounts.gcp.databricks.com"
account_id = var.databricks_account_id
}

provider "google" {
}
14 changes: 14 additions & 0 deletions examples/gcp-with-psc-exfiltration-protection/terraform.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
terraform {
required_providers {
databricks = {
source = "databricks/databricks"
}
google = {
source = "hashicorp/google"
version = "6.17.0"
}
random = {
source = "hashicorp/random"
}
}
}
19 changes: 19 additions & 0 deletions examples/gcp-with-psc-exfiltration-protection/terraform.tfvars
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
databricks_account_id = ""

google_region = ""

workspace_google_project = ""

spoke_vpc_google_project = ""
hub_vpc_google_project = ""
is_spoke_vpc_shared = true

prefix = ""

hive_metastore_ip = ""
hub_vpc_cidr = ""
spoke_vpc_cidr = ""
psc_subnet_cidr = ""
gke_master_ip_range = ""
pod_ip_cidr_range = ""
service_ip_cidr_range = ""
79 changes: 79 additions & 0 deletions examples/gcp-with-psc-exfiltration-protection/variables.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
variable "databricks_account_id" {
type = string
description = "Databricks Account ID"
}

variable "google_region" {
type = string
description = "Google Cloud region where the resources will be created"
}

variable "workspace_google_project" {
type = string
description = "Google Cloud project ID related to Databricks workspace"
}

variable "spoke_vpc_google_project" {
type = string
description = "Google Cloud project ID related to Spoke VPC"
}

variable "hub_vpc_google_project" {
type = string
description = "Google Cloud project ID related to Hub VPC"
}

variable "is_spoke_vpc_shared" {
type = bool
description = "Whether the Spoke VPC is a Shared or a dedicated VPC"
}

variable "prefix" {
type = string
description = "Prefix to use in generated resources name"
}

# For the value of the regional Hive Metastore IP, refer to the Databricks documentation
# Here - https://docs.gcp.databricks.com/en/resources/ip-domain-region.html#addresses-for-default-metastore
variable "hive_metastore_ip" {
type = string
description = "Value of regional default Hive Metastore IP"
}

variable "hub_vpc_cidr" {
type = string
description = "CIDR for Hub VPC"
}

variable "spoke_vpc_cidr" {
type = string
description = "CIDR for Spoke VPC"
}

variable "psc_subnet_cidr" {
type = string
description = "CIDR for Spoke VPC"
}

variable "gke_master_ip_range" {
type = string
description = "IP Range for GKE Master subnet"
}

variable "pod_ip_cidr_range" {
type = string
description = "IP Range for Pods subnet (secondary)"
}

variable "service_ip_cidr_range" {
type = string
description = "IP Range for Services subnet (secondary)"
}

variable "tags" {
type = map(string)
description = "Map of tags to add to all resources"

default = {}
}

Loading