Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adding privatelink guide #1084

Merged
merged 17 commits into from
Feb 21, 2022
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
246 changes: 246 additions & 0 deletions docs/guides/aws-private-link-workspace.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,246 @@
---
page_title: "Enable AWS PrivateLink for Databricks Workspace"
---

# Deploying pre-requisite resources and enabling PrivateLink connections (AWS Preview)

-> **Private Preview** This feature is in [Public Preview](https://docs.databricks.com/release-notes/release-types.html). Contact your Databricks representative to request access.

Databricks PrivateLink support enables private connectivity between users and their Databricks workspaces and between clusters on the data plane and core services on the control plane within the Databricks workspace infrastructure. You can use Terraform to deploy the underlying cloud resources and the private access settings resources automatically, using a programmatic approach. This guide assumes you are deploying into an existing VPC and you have set up credentials and storage configurations as per prior examples, notably here.

This guide uses the following variables in configurations:

rportilla-databricks marked this conversation as resolved.
Show resolved Hide resolved
- `databricks_account_username`: The username an account-level admin uses to log in to [https://accounts.cloud.databricks.com](https://accounts.cloud.databricks.com).
- `databricks_account_password`: The password for `databricks_account_username`.
- `databricks_account_id`: The numeric ID for your Databricks account. When you are logged in, it appears in the bottom left corner of the page.
- `vpc_id` - The ID for the AWS VPC
rportilla-databricks marked this conversation as resolved.
Show resolved Hide resolved
nfx marked this conversation as resolved.
Show resolved Hide resolved
- `network_name` - Name for your Databricks-configured network
rportilla-databricks marked this conversation as resolved.
Show resolved Hide resolved
- `region` - AWS region
nfx marked this conversation as resolved.
Show resolved Hide resolved
- `existing_network_sg` - Security groups set up for the existing VPC
rportilla-databricks marked this conversation as resolved.
Show resolved Hide resolved
- `existing_network_subnets` - Existing subnets being used for the customer managed VPC
rportilla-databricks marked this conversation as resolved.
Show resolved Hide resolved
- `workspace_vpce_service` - Choose the region-specific service endpoint from this table.
- `relay_vpce_service` - Choose the region-specific service from this table.
- `vpc_cidr_block` - CIDR range for the VPC being deployed into
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you already pass vpc_id in the argument, so you can get it from a data resource. remove it from variables.

- `vpce_cidr` - CIDR range for the subnet chosen for the VPC endpoint
- `credentials_id` - Databricks workspace credential ID
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it may not be clear for new users what credential id is. link to databricks_mws_storage_cresentials resource page.

- `storage_configuration_id` - Databricks workspace storage configuration ID
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above


This guide is provided as-is and you can use this guide as the basis for your custom Terraform module.

To get started with AWS PrivateLink integration, this guide takes you throw the following high-level steps:
- Initialize the required providers
- Configure AWS objects
- A subnet dedicated to your VPC relay and workspace endpoints
- A security group dedicated to your VPC endpoints
- Two AWS VPC endpoints
- Workspace Creation

## Provider initialization

Initialize [provider with `mws` alias](https://www.terraform.io/language/providers/configuration#alias-multiple-provider-configurations) to set up account-level resources. See [provider authentication](../index.md#authenticating-with-hostname,-username,-and-password) for more details.

```hcl
terraform {
required_providers {
databricks = {
source = "databrickslabs/databricks"
version = "0.4.7"
rportilla-databricks marked this conversation as resolved.
Show resolved Hide resolved
}
aws = {
source = "hashicorp/aws"
version = "3.49.0"
}
}
}

provider "aws" {
region = var.region
}

// initialize provider in "MWS" mode for provisioning workspace with AWS PrivateLink
provider "databricks" {
alias = "mws"
host = "https://accounts.cloud.databricks.com"
username = var.databricks_account_username
password = var.databricks_account_password
}
```

Define the required variables

```hcl
variable databricks_account_id {}
rportilla-databricks marked this conversation as resolved.
Show resolved Hide resolved
variable databricks_account_username {}
variable databricks_account_password {}
variable vpc_id {}
variable network_name {}
variable region {}
variable existing_network_sg {}
variable existing_network_subnets {}

variable workspace_vpce_service {}
variable relay_vpce_service {}

variable vpc_cidr_block {}
variable vpce_cidr {}

// Use the Databricks Account API 2.0 to retrieve these two IDs below - https://docs.databricks.com/dev-tools/api/latest/account.html
rportilla-databricks marked this conversation as resolved.
Show resolved Hide resolved
variable credentials_id {}
variable storage_configuration_id {}

locals {
prefix = "private-link-ws"
}

locals {
workspace_subnet_1 = (
element(split(",", var.existing_network_subnets), 1)
rportilla-databricks marked this conversation as resolved.
Show resolved Hide resolved
)
workspace_subnet_2 = (
element(split(",", var.existing_network_subnets), 2)
)
}
```

## Configure AWS objects
The first step is to create the required AWS objects:
- A subnet dedicated to your VPC endpoints
- A security group dedicated to your VPC endpoints and satisfying required inbound/outbound TCP/HTTPS traffic rules on ports 443 and 6666, respectively.
- Lastly, creation of the private access settings and workspace.

```hcl
resource "aws_subnet" "vpce" {
vpc_id = var.vpc_id
cidr_block = var.vpce_cidr

tags = {
Name = "vpce subnet for workspace"
}
}
```

```hcl
resource "aws_security_group" "vpce_sg" {
rportilla-databricks marked this conversation as resolved.
Show resolved Hide resolved
name = "VPC endpoint security group"
rportilla-databricks marked this conversation as resolved.
Show resolved Hide resolved
description = "Security group shared with relay and workspace endpoints"
vpc_id = var.vpc_id

ingress {
description = "Inbound rules"
from_port = 443
to_port = 443
protocol = "tcp"
cidr_blocks = [var.vpc_cidr_block]
}

ingress {
description = "Outbound rules"
from_port = 6666
to_port = 6666
protocol = "tcp"
cidr_blocks = [var.vpc_cidr_block]
}

egress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["0.0.0.0/0"]
rportilla-databricks marked this conversation as resolved.
Show resolved Hide resolved
ipv6_cidr_blocks = ["::/0"]
}

tags = {
Name = "vpce_rules"
}
}
```

nfx marked this conversation as resolved.
Show resolved Hide resolved
```hcl
resource "aws_vpc_endpoint" "workspace" {
vpc_id = var.vpc_id
service_name = var.workspace_vpce_service
vpc_endpoint_type = "Interface"
security_group_ids = [aws_security_group.vpce_sg.id]
subnet_ids = [aws_subnet.vpce.id]
// run terraform apply twice when configuring PrivateLink.
// Run 1 - comment the `private_dns_enabled` line
// Run 2 - uncomment the `private_dns_enabled` line
// private_dns_enabled = true
depends_on = [aws_subnet.vpce]
}

resource "aws_vpc_endpoint" "relay" {
vpc_id = var.vpc_id
service_name = var.relay_vpce_service
vpc_endpoint_type = "Interface"
security_group_ids = [aws_security_group.vpce_sg.id]
subnet_ids = [aws_subnet.vpce.id]
// run terraform apply twice when configuring PrivateLink.
// Run 1 - comment the `private_dns_enabled` line
// Run 2 - uncomment the `private_dns_enabled` line
// private_dns_enabled = true
depends_on = [aws_subnet.vpce]
}

resource "databricks_mws_vpc_endpoint" "workspace" {
provider = databricks.mws
account_id = var.databricks_account_id
aws_vpc_endpoint_id = aws_vpc_endpoint.workspace.id
vpc_endpoint_name = "VPC Relay for ${var.vpc_id}"
rportilla-databricks marked this conversation as resolved.
Show resolved Hide resolved
region = var.region
depends_on = [aws_vpc_endpoint.workspace]
}

resource "databricks_mws_vpc_endpoint" "relay" {
provider = databricks.mws
account_id = var.databricks_account_id
aws_vpc_endpoint_id = aws_vpc_endpoint.relay.id
vpc_endpoint_name = "VPC Relay for ${var.vpc_id}"
region = var.region
depends_on = [aws_vpc_endpoint.relay]
}
```

## Workspace creation

Once the VPC endpoints are created, they can be supplied in the `databricks_mws_networks` resource for workspace creation with AWS PrivateLink. After the terraform apply is run once (see the comment in the aws_vpc_endpoint resource above), run the terraform apply a second time with the line for private_dns_enabled set to true uncommented to set the proper DNS settings for PrivateLink.

```hcl
// Inputs are 2 subnets and one security group from existing VPC that will be used for your Databricks workspace
resource "databricks_mws_networks" "this" {
provider = databricks.mws
account_id = var.databricks_account_id
network_name = var.network_name
security_group_ids = [var.existing_network_sg]
subnet_ids = [local.workspace_subnet_1, local.workspace_subnet_2]
rportilla-databricks marked this conversation as resolved.
Show resolved Hide resolved
vpc_id = var.vpc_id
vpc_endpoints {
dataplane_relay = [databricks_mws_vpc_endpoint.relay.vpc_endpoint_id]
rest_api = [databricks_mws_vpc_endpoint.workspace.vpc_endpoint_id]
}
depends_on = [aws_vpc_endpoint.workspace, aws_vpc_endpoint.relay]
rportilla-databricks marked this conversation as resolved.
Show resolved Hide resolved
}

resource "databricks_mws_private_access_settings" "pas" {
provider = databricks.mws
account_id = var.databricks_account_id
private_access_settings_name = "Private Access Settings for ${local.prefix}"
region = var.region
public_access_enabled = true
}

resource "databricks_mws_workspaces" "this" {
provider = databricks.mws
account_id = var.databricks_account_id
aws_region = var.region
workspace_name = local.prefix
deployment_name = local.prefix
credentials_id = var.credentials_id
storage_configuration_id = var.storage_configuration_id
network_id = databricks_mws_networks.this.network_id
private_access_settings_id = databricks_mws_private_access_settings.pas.private_access_settings_id
pricing_tier = "ENTERPRISE"
depends_on = [databricks_mws_networks.this]
}
```