[FEATURE] Restart running streaming job to redeploy configuration changes #389

MarekLani · 2020-10-30T09:02:50Z

Hello,

I am trying to force re-creation of job, but apparently there is no single field in the databricks job resource, which edited value would cause re-creation of the job on databricks side. Even job name can be edited without redeploying the job. Would it be possible to add some field, which would be used only for the needs of terraform and its value would have effect on re-creation of the databricks job?

Terraform Version

N/A

Affected Resource(s)

databricks_job

Expected Behavior

I want to have build specific part of setup usually resources are teared down and redeployed if I change their name. Of course I understand that databricks jobs API does not work that way and it enables editing of the name without re-creating of the job. Nevertheless if I want to recreate the job from terraform file basically I have no option to do that. Scenario is that I have databricks streaming job which runs continuously and is set to 1 concurrent run. If I make any change to the job terraform configuration or DBFS file on which the job is based I have no way how to get these changes to already running job or force the job run to restart from terraform. I also should not use Databricks API to re-create the job, because terraform state would get out of sync. And as there is no stop/restart job command in Databricks API I am simply unable to create new run of the job with new configuration without re-creating the job. I know there is reset command in Databricks API, but that requires json configuration of the job to be passed as parameter, which would again put me out of sync with terraform state.

Actual Behavior

There is no way how to force re-creation of job just from terraform file. Workaround is to use terraform taint command from command line, which forces redeploy of the resource, nevertheless I would like to avoid this approach.

Steps to Reproduce

N/A

nfx · 2020-10-30T11:26:18Z

@MarekLani what you describe here is actually expected behavior or all of terraform resources. Are you sure you're using it the right way?..

What's the use case?

stikkireddy · 2020-10-30T11:37:42Z

This can be done via tainting the resource. On your next apply the resource will be recreated with new id and new things. As the api goes all the fields are editable and updated in place so there is no field i believe afaik that forces a recreate. your best option is terraform taint. https://www.terraform.io/docs/commands/taint.html

MarekLani · 2020-10-30T11:46:28Z

@nfx @stikkireddy thanks for replies. I ended up using taint from command line. Nevertheless when I want to have build specific part of setup usually resources are teared down and redeployed if I change their name. Of course I understand that databricks jobs API does not work that way and it enables editing of the name without re-creating of the job. Nevertheless if I want to recreate the job from terraform file basically I have no option to do that. Scenario is that I have databricks streaming job which runs continuously and is set to 1 concurrent run. If I make any change to the job terraform configuration or DBFS file on which the job is based I have no way how to get these changes to already running job or force the job run to restart from terraform. I also should not use Databricks API to re-create the job, because terraform state would get out of sync. And as there is no stop/restart job command in Databricks API I am simply unable to create new run of the job with new configuration without re-creating the job. I know there is reset command in Databricks API, but that requires json configuration of the job to be passed as parameter, which would again put me out of sync with terraform state.

nfx · 2020-10-30T12:31:00Z

@MarekLani thanks for more context.

If my understanding is correct, you're building CD process with terraform for a streaming job. This is a feature request, rather than a bug.

There's an artifact, that you store on DBFS. Is it spark_jar_task, spark_python_task or spark_notebook_task?
Is the artifact supplied through the library?
Does the artifact have -latest version modifier or each time you deploy a different version, it has a different name?

If we have different name of the artifact - e.g. version, we can implement a graceful job restart, if it's running. With additional fields. We won't be re-creating a job, because job run history would be lost and that's not the logical behavior.

Please provide more information.

MarekLani · 2020-10-30T17:18:31Z

thanks @nfx and sorry I might have changed this entry to feature when creating.

I am using spark_python_task referencing DBFS file created this way:

resource "databricks_dbfs_file" "streaming_task" {
	content              = filebase64("../../../src/streaming/processing/main.py")
	content_b64_md5      = md5(filebase64("../../../src/streaming/processing/main.py"))
	path                 = "/sri/terraformdbfs/streaming/main.py"
	overwrite            = true
	mkdirs               = true
	validate_remote_file = true
}
#Then used in databricks job definition following way:

resource "databricks_job" "streaming_job" {
...
         spark_python_task {
		python_file = "dbfs:${databricks_dbfs_file.streaming_task.path}"
		parameters = [
...

Can you please share bit more on how the version might be applied? It sounds like ideal solution. Also didn't realize the loss of history, that is indeed not desired behavior.

nfx · 2020-10-30T17:27:05Z

@MarekLani and does src/streaming/processing/main.py have all the code required for processing? or is it in the library?

MarekLani · 2020-10-30T17:37:02Z

it makes use of pypi and maven libraries, but core processing logic is only in the main.py

nfx · 2020-10-30T17:45:44Z

terraform needs to know if something has changed or not. can you upload each new version of a notebook with different suffix? e.g. /Productions/streaming-stuff-${var.version}/main.py?..

Then the parameter i'm thinking to introduce is continuous = true or always_running = true or something like that.

MarekLani · 2020-10-30T18:04:20Z

yes I can include part of build id as version. If I can provide my vote I wold use always_running. So idea is that this would enable graceful restart?

nfx · 2020-10-30T19:32:28Z

something like that. I'll check with couple of more folks, because we might need exactly the same "virtual option" for always-on clusters (e.g. i see pattern that auto-starts clusters for business hours).

nfx · 2020-12-04T14:30:26Z

Hi @MarekLani , somewhere after 0.3.0 we could do restart of jobs if underlying file changes (thanks to changed in databricks_dbfs_file from #417 ). How graceful the restart of streaming job has to be? What happens if you press cancel on streaming job UI? does the stream gracefully end? If this is the case, then it might be relatively easy to implement.

data "databricks_node_type" "smallest" {
    local_disk = true
}

data "databricks_me" "me" {
}

resource "databricks_dbfs_file" "this" {
    path = "/home/${data.databricks_me.me.user_name}/foo.py"
    source = "t.py"
}

resource "databricks_job" "this" {
    name = "File Run"
    
    timeout_seconds = 3600
    max_retries = 1
    max_concurrent_runs = 1

    spark_python_task {
        python_file = databricks_dbfs_file.this.id
    }
    
    new_cluster  {
        num_workers   = 2
        spark_version = "6.6.x-scala2.11"
        node_type_id  = data.databricks_node_type.smallest.id
        aws_attributes {}
    }

    email_notifications {}
}

MarekLani · 2020-12-08T14:20:55Z

Hi @nfx thanks for response. Pardon my delay, but as I am responsible purely for CD part of the project I needed to connect with the rest of my team working on the actual job logic. We need to get understanding of the behavior of libraries we are using to connect to the queues and how much resiliency there is provided out of the box and will get back to you. However my guess would be your suggested approach should be enough.

MarekLani · 2020-12-10T10:19:03Z

So @nfx the approach with cancelling the job you described should be fine for us. We are connecting to Azure Event Hubs while libraries we use should have resilient checkpointing implemented. So position is not checkpointed until DF processing is not finished and thus it will meet our at least once delivery requirement.

However just of of curiosity I would be interested to hear your ideas on whether more graceful approach would be possible and what would it take? This probably goes down to what interfaces Databricks API offers at this point.

Thank you.

stikkireddy · 2021-01-27T22:10:00Z

@MarekLani if you are still trying to restart a job after the configuration changes. What you are trying to actually do is:

Stop an existing run of a job
Wait for its life cycle to be in a terminated state
Start a new run of that job

Something that you may be able to do is use a null resource and make it trigger from your databricks_job configs: https://registry.terraform.io/providers/hashicorp/null/latest/docs/resources/resource

With a null resource and a trigger that listens to the appropriate configs of your job that you are expecting to change, then you can set up a local provisioner that runs a python script that executes the 3 above steps taking advantage of the Runs api. https://docs.databricks.com/dev-tools/api/latest/jobs.html#runs-cancel

This provider at this point in time does not track or manage state of runs only the job level config and there is nothing in the platform that automatically restarts a run if config changes AFAIK.

MarekLani · 2021-01-28T14:53:30Z

@stikkireddy thank you, this sounds as a fine approach to really restart the job only when needed instead of doing it on each run of the terraform. Thank you. However still leaving this one opened, as it would be nice to be able to allow for restart directly within the terraform resource without need of touching databricks API.

travisemichael · 2021-02-12T02:30:55Z

Hopefully I can piggyback on this thread and ask a related question.

I have the same use case as above (a streaming job with max concurrent runs = 1) but I would also like to be able to trigger a run of the job on a fresh deployment. Would this also be included in the v0.4.0 milestone? Should I file another feature request?

nfx · 2021-02-14T15:30:11Z

@travisemichael just to confirm - fresh deployment would mean fresh JAR/Python file/Notebook with different name/eventually notebook project with newer version, correct?..

tagging @lennartcl to see if we can get more graceful streaming stops than those provided by cancel run api call.

travisemichael · 2021-02-16T17:38:55Z

@nfx What I mean by a fresh deployment would be a brand new Job. It would be nice to have it set up so that on the first deploy the job will start a single run, and on subsequent deploys the previous run will be restarted as described above.

nfx · 2021-02-16T18:00:31Z

@travisemichael makes sense for always_running jobs

* Implements feature #389 * Functionality is triggered only if `always_running` attribute is present * Uses RunsList, RunsCancel and RunNow methods from Jobs API

nfx · 2021-07-07T10:42:07Z

@travisemichael @MarekLani i've started working on this feature in the linked pull request

* Implements feature #389 * Functionality is triggered only if `always_running` attribute is present * Uses RunsList, RunsCancel and RunNow methods from Jobs API

* Implements feature databricks#389 * Functionality is triggered only if `always_running` attribute is present * Uses RunsList, RunsCancel and RunNow methods from Jobs API

nfx changed the title ~~[ISSUE] Add terraform only field to Databricks Job resource to enable re-creation~~ [FEATURE] Restart running streaming job to redeploy configuration changes Oct 30, 2020

nfx added the Medium Size label Nov 3, 2020

nfx added this to the v0.4.0 milestone Feb 11, 2021

nfx pinned this issue Feb 14, 2021

nfx added a commit that referenced this issue Jul 7, 2021

Add functionality to restart always running jobs

901de25

* Implements feature #389 * Functionality is triggered only if `always_running` attribute is present * Uses RunsList, RunsCancel and RunNow methods from Jobs API

nfx mentioned this issue Jul 7, 2021

Add functionality to restart always running jobs #715

Merged

nfx linked a pull request Jul 7, 2021 that will close this issue

Add functionality to restart always running jobs #715

Merged

nfx added a commit that referenced this issue Jul 8, 2021

Add functionality to restart always running jobs

f046a02

* Implements feature #389 * Functionality is triggered only if `always_running` attribute is present * Uses RunsList, RunsCancel and RunNow methods from Jobs API

nfx added a commit that referenced this issue Jul 8, 2021

Add functionality to restart always running jobs

b3f3893

* Implements feature #389 * Functionality is triggered only if `always_running` attribute is present * Uses RunsList, RunsCancel and RunNow methods from Jobs API

nfx closed this as completed in #715 Jul 8, 2021

nfx added a commit that referenced this issue Jul 8, 2021

Add functionality to restart always running jobs

4b30bfe

* Implements feature #389 * Functionality is triggered only if `always_running` attribute is present * Uses RunsList, RunsCancel and RunNow methods from Jobs API

nfx unpinned this issue Jul 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] Restart running streaming job to redeploy configuration changes #389

[FEATURE] Restart running streaming job to redeploy configuration changes #389

MarekLani commented Oct 30, 2020 •

edited by nfx

Loading

nfx commented Oct 30, 2020 •

edited

Loading

stikkireddy commented Oct 30, 2020 •

edited

Loading

MarekLani commented Oct 30, 2020 •

edited

Loading

nfx commented Oct 30, 2020 •

edited

Loading

MarekLani commented Oct 30, 2020

nfx commented Oct 30, 2020

MarekLani commented Oct 30, 2020

nfx commented Oct 30, 2020

MarekLani commented Oct 30, 2020

nfx commented Oct 30, 2020

nfx commented Dec 4, 2020

MarekLani commented Dec 8, 2020

MarekLani commented Dec 10, 2020 •

edited

Loading

stikkireddy commented Jan 27, 2021 •

edited

Loading

MarekLani commented Jan 28, 2021 •

edited

Loading

travisemichael commented Feb 12, 2021

nfx commented Feb 14, 2021

travisemichael commented Feb 16, 2021

nfx commented Feb 16, 2021

nfx commented Jul 7, 2021

[FEATURE] Restart running streaming job to redeploy configuration changes #389

[FEATURE] Restart running streaming job to redeploy configuration changes #389

Comments

MarekLani commented Oct 30, 2020 • edited by nfx Loading

Terraform Version

Affected Resource(s)

Expected Behavior

Actual Behavior

Steps to Reproduce

nfx commented Oct 30, 2020 • edited Loading

stikkireddy commented Oct 30, 2020 • edited Loading

MarekLani commented Oct 30, 2020 • edited Loading

nfx commented Oct 30, 2020 • edited Loading

MarekLani commented Oct 30, 2020

nfx commented Oct 30, 2020

MarekLani commented Oct 30, 2020

nfx commented Oct 30, 2020

MarekLani commented Oct 30, 2020

nfx commented Oct 30, 2020

nfx commented Dec 4, 2020

MarekLani commented Dec 8, 2020

MarekLani commented Dec 10, 2020 • edited Loading

stikkireddy commented Jan 27, 2021 • edited Loading

MarekLani commented Jan 28, 2021 • edited Loading

travisemichael commented Feb 12, 2021

nfx commented Feb 14, 2021

travisemichael commented Feb 16, 2021

nfx commented Feb 16, 2021

nfx commented Jul 7, 2021

MarekLani commented Oct 30, 2020 •

edited by nfx

Loading

nfx commented Oct 30, 2020 •

edited

Loading

stikkireddy commented Oct 30, 2020 •

edited

Loading

MarekLani commented Oct 30, 2020 •

edited

Loading

nfx commented Oct 30, 2020 •

edited

Loading

MarekLani commented Dec 10, 2020 •

edited

Loading

stikkireddy commented Jan 27, 2021 •

edited

Loading

MarekLani commented Jan 28, 2021 •

edited

Loading