Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Creation of Replay Pipeline for Splunk export #11

Merged
5 changes: 5 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,11 @@ $ terraform output dataflow_log_export_dashboad

2. Visit newly created Monitoring Dashboard in Cloud Console by replacing dashboard_id in the following URL: https://console.cloud.google.com/monitoring/dashboards/builder/{dashboard_id}

#### Deploy replay pipeline

In the `replay.tf` file, uncomment the code under `splunk_dataflow_replay` and follow the sequence of `terraform plan` and `terraform apply`.

Once the replay pipeline is no longer needed (the number of messages in the PubSub deadletter topic are at 0), comment out `splunk_dataflow_replay` and follow the `plan` and `apply` sequence above.

### Cleanup

Expand Down
3 changes: 2 additions & 1 deletion main.tf
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@ resource "random_id" "bucket_suffix" {
locals {
dataflow_temporary_gcs_bucket_name = "${var.project}-${var.dataflow_job_name}-${random_id.bucket_suffix.hex}"
dataflow_temporary_gcs_bucket_path = "tmp/"
dataflow_template_path = "gs://dataflow-templates/${var.dataflow_template_version}/Cloud_PubSub_to_Splunk"

subnet_name = coalesce(var.subnet, "${var.network}-${var.region}")
project_log_sink_name = "${var.dataflow_job_name}-project-log-sink"
Expand All @@ -39,7 +40,7 @@ locals {
dataflow_output_deadletter_sub_name = "${var.dataflow_job_name}-deadletter-subscription"

dataflow_replay_job_name = "${var.dataflow_job_name}-replay"

dataflow_deadletter_template_gcs_path = "gs://dataflow-templates/${var.dataflow_template_version}/Cloud_PubSub_to_Cloud_PubSub"
# dataflow job parameters (not externalized for this project)
dataflow_job_include_pubsub_message = true
}
Expand Down
2 changes: 1 addition & 1 deletion pipeline.tf
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@ resource "google_storage_bucket_object" "dataflow_job_temp_object" {
resource "random_id" "dataflow_job_instance" {
byte_length = 2
keepers = {
template_gcs_path = var.dataflow_template_path
template_gcs_path = local.dataflow_template_path
}
}

Expand Down
39 changes: 39 additions & 0 deletions replay.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
# Copyright 2021 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

/*
The replay job should stay commented out while the main export pipeline is initially deployed.
When the replay job needs to be run, simply uncomment the module and deploy the replay pipeline.
From the CLI, this may look like `terraform apply -target="google_dataflow_job.splunk_dataflow_replay"`
After the deadletter Pub/Sub topic has no more messages, comment out the module and run a regular terraform deployment (ex. terraform apply). Terraform will automatically destroy the replay job.

`terraform apply -target` usage documentation is here: https://www.terraform.io/docs/cli/commands/apply.html
*/

resource "google_dataflow_job" "splunk_dataflow_replay" {
name = local.dataflow_replay_job_name
template_gcs_path = local.dataflow_deadletter_template_gcs_path
temp_gcs_location = "gs://${local.dataflow_temporary_gcs_bucket_name}/${local.dataflow_temporary_gcs_bucket_path}"
machine_type = var.dataflow_job_machine_type
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The machine type and count specified are specific to Pub/Sub to Splunk template sizing. I wonder if we should leave those out for Pub/Sub to Pub/Sub template, and rely on default since it's an ephemeral pipeline? I'm fine either way.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember when we did this with Tempus, if the machine type for the replay pipeline was too small, then it took quite some time (depending on the number of logs) to burn down the PubSub to PubSub template. Perhaps we can just rely on the default, but give the opportunity for customization?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM

max_workers = var.dataflow_job_machine_count
parameters = {
inputSubscription = google_pubsub_subscription.dataflow_deadletter_pubsub_sub.id
outputTopic = google_pubsub_topic.dataflow_input_pubsub_topic.id
}
region = var.region
network = var.network
subnetwork = "regions/${var.region}/subnetworks/${local.subnet_name}"
ip_configuration = "WORKER_IP_PRIVATE"
# service_account_email = ""
}
6 changes: 6 additions & 0 deletions variables.tf
Original file line number Diff line number Diff line change
Expand Up @@ -72,6 +72,12 @@ variable "splunk_hec_token" {

# Dataflow job parameters

variable "dataflow_template_version" {
type = string
description = "Dataflow template version for the replay job."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this version variable applies to both templates/jobs. Just minor update to description (and associated README table of params)

default = "latest"
}

variable "dataflow_template_path" {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@npredey Should we remove this input variable, now that you have a similarly-named local variable?

description = "Dataflow template path. Defaults to latest version of Google-hosted Pub/Sub to Splunk template"
default = "gs://dataflow-templates/latest/Cloud_PubSub_to_Splunk"
Expand Down