Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Restart running streaming job to redeploy configuration changes #389

Closed
MarekLani opened this issue Oct 30, 2020 · 20 comments · Fixed by #715
Closed

[FEATURE] Restart running streaming job to redeploy configuration changes #389

MarekLani opened this issue Oct 30, 2020 · 20 comments · Fixed by #715
Milestone

Comments

@MarekLani
Copy link

MarekLani commented Oct 30, 2020

Hello,

I am trying to force re-creation of job, but apparently there is no single field in the databricks job resource, which edited value would cause re-creation of the job on databricks side. Even job name can be edited without redeploying the job. Would it be possible to add some field, which would be used only for the needs of terraform and its value would have effect on re-creation of the databricks job?

Terraform Version

N/A

Affected Resource(s)

  • databricks_job

Expected Behavior

I want to have build specific part of setup usually resources are teared down and redeployed if I change their name. Of course I understand that databricks jobs API does not work that way and it enables editing of the name without re-creating of the job. Nevertheless if I want to recreate the job from terraform file basically I have no option to do that. Scenario is that I have databricks streaming job which runs continuously and is set to 1 concurrent run. If I make any change to the job terraform configuration or DBFS file on which the job is based I have no way how to get these changes to already running job or force the job run to restart from terraform. I also should not use Databricks API to re-create the job, because terraform state would get out of sync. And as there is no stop/restart job command in Databricks API I am simply unable to create new run of the job with new configuration without re-creating the job. I know there is reset command in Databricks API, but that requires json configuration of the job to be passed as parameter, which would again put me out of sync with terraform state.

Actual Behavior

There is no way how to force re-creation of job just from terraform file. Workaround is to use terraform taint command from command line, which forces redeploy of the resource, nevertheless I would like to avoid this approach.

Steps to Reproduce

N/A

@nfx
Copy link
Contributor

nfx commented Oct 30, 2020

@MarekLani what you describe here is actually expected behavior or all of terraform resources. Are you sure you're using it the right way?..

What's the use case?

@stikkireddy
Copy link
Contributor

stikkireddy commented Oct 30, 2020

This can be done via tainting the resource. On your next apply the resource will be recreated with new id and new things. As the api goes all the fields are editable and updated in place so there is no field i believe afaik that forces a recreate. your best option is terraform taint. https://www.terraform.io/docs/commands/taint.html

@MarekLani
Copy link
Author

MarekLani commented Oct 30, 2020

@nfx @stikkireddy thanks for replies. I ended up using taint from command line. Nevertheless when I want to have build specific part of setup usually resources are teared down and redeployed if I change their name. Of course I understand that databricks jobs API does not work that way and it enables editing of the name without re-creating of the job. Nevertheless if I want to recreate the job from terraform file basically I have no option to do that. Scenario is that I have databricks streaming job which runs continuously and is set to 1 concurrent run. If I make any change to the job terraform configuration or DBFS file on which the job is based I have no way how to get these changes to already running job or force the job run to restart from terraform. I also should not use Databricks API to re-create the job, because terraform state would get out of sync. And as there is no stop/restart job command in Databricks API I am simply unable to create new run of the job with new configuration without re-creating the job. I know there is reset command in Databricks API, but that requires json configuration of the job to be passed as parameter, which would again put me out of sync with terraform state.

@nfx
Copy link
Contributor

nfx commented Oct 30, 2020

@MarekLani thanks for more context.

If my understanding is correct, you're building CD process with terraform for a streaming job. This is a feature request, rather than a bug.

  1. There's an artifact, that you store on DBFS. Is it spark_jar_task, spark_python_task or spark_notebook_task?
  2. Is the artifact supplied through the library?
  3. Does the artifact have -latest version modifier or each time you deploy a different version, it has a different name?

If we have different name of the artifact - e.g. version, we can implement a graceful job restart, if it's running. With additional fields. We won't be re-creating a job, because job run history would be lost and that's not the logical behavior.

Please provide more information.

@nfx nfx changed the title [ISSUE] Add terraform only field to Databricks Job resource to enable re-creation [FEATURE] Restart running streaming job to redeploy configuration changes Oct 30, 2020
@MarekLani
Copy link
Author

thanks @nfx and sorry I might have changed this entry to feature when creating.

I am using spark_python_task referencing DBFS file created this way:

resource "databricks_dbfs_file" "streaming_task" {
	content              = filebase64("../../../src/streaming/processing/main.py")
	content_b64_md5      = md5(filebase64("../../../src/streaming/processing/main.py"))
	path                 = "/sri/terraformdbfs/streaming/main.py"
	overwrite            = true
	mkdirs               = true
	validate_remote_file = true
}
#Then used in databricks job definition following way:

resource "databricks_job" "streaming_job" {
...
         spark_python_task {
		python_file = "dbfs:${databricks_dbfs_file.streaming_task.path}"
		parameters = [
...

Can you please share bit more on how the version might be applied? It sounds like ideal solution. Also didn't realize the loss of history, that is indeed not desired behavior.

@nfx
Copy link
Contributor

nfx commented Oct 30, 2020

@MarekLani and does src/streaming/processing/main.py have all the code required for processing? or is it in the library?

@MarekLani
Copy link
Author

it makes use of pypi and maven libraries, but core processing logic is only in the main.py

@nfx
Copy link
Contributor

nfx commented Oct 30, 2020

terraform needs to know if something has changed or not. can you upload each new version of a notebook with different suffix? e.g. /Productions/streaming-stuff-${var.version}/main.py?..

Then the parameter i'm thinking to introduce is continuous = true or always_running = true or something like that.

@MarekLani
Copy link
Author

yes I can include part of build id as version. If I can provide my vote I wold use always_running. So idea is that this would enable graceful restart?

@nfx
Copy link
Contributor

nfx commented Oct 30, 2020

something like that. I'll check with couple of more folks, because we might need exactly the same "virtual option" for always-on clusters (e.g. i see pattern that auto-starts clusters for business hours).

@nfx nfx added the Medium Size label Nov 3, 2020
@nfx
Copy link
Contributor

nfx commented Dec 4, 2020

Hi @MarekLani , somewhere after 0.3.0 we could do restart of jobs if underlying file changes (thanks to changed in databricks_dbfs_file from #417 ). How graceful the restart of streaming job has to be? What happens if you press cancel on streaming job UI? does the stream gracefully end? If this is the case, then it might be relatively easy to implement.

data "databricks_node_type" "smallest" {
    local_disk = true
}

data "databricks_me" "me" {
}

resource "databricks_dbfs_file" "this" {
    path = "/home/${data.databricks_me.me.user_name}/foo.py"
    source = "t.py"
}

resource "databricks_job" "this" {
    name = "File Run"
    
    timeout_seconds = 3600
    max_retries = 1
    max_concurrent_runs = 1

    spark_python_task {
        python_file = databricks_dbfs_file.this.id
    }
    
    new_cluster  {
        num_workers   = 2
        spark_version = "6.6.x-scala2.11"
        node_type_id  = data.databricks_node_type.smallest.id
        aws_attributes {}
    }

    email_notifications {}
}

@MarekLani
Copy link
Author

Hi @nfx thanks for response. Pardon my delay, but as I am responsible purely for CD part of the project I needed to connect with the rest of my team working on the actual job logic. We need to get understanding of the behavior of libraries we are using to connect to the queues and how much resiliency there is provided out of the box and will get back to you. However my guess would be your suggested approach should be enough.

@MarekLani
Copy link
Author

MarekLani commented Dec 10, 2020

So @nfx the approach with cancelling the job you described should be fine for us. We are connecting to Azure Event Hubs while libraries we use should have resilient checkpointing implemented. So position is not checkpointed until DF processing is not finished and thus it will meet our at least once delivery requirement.

However just of of curiosity I would be interested to hear your ideas on whether more graceful approach would be possible and what would it take? This probably goes down to what interfaces Databricks API offers at this point.

Thank you.

@stikkireddy
Copy link
Contributor

stikkireddy commented Jan 27, 2021

@MarekLani if you are still trying to restart a job after the configuration changes. What you are trying to actually do is:

  1. Stop an existing run of a job
  2. Wait for its life cycle to be in a terminated state
  3. Start a new run of that job

Something that you may be able to do is use a null resource and make it trigger from your databricks_job configs: https://registry.terraform.io/providers/hashicorp/null/latest/docs/resources/resource

With a null resource and a trigger that listens to the appropriate configs of your job that you are expecting to change, then you can set up a local provisioner that runs a python script that executes the 3 above steps taking advantage of the Runs api. https://docs.databricks.com/dev-tools/api/latest/jobs.html#runs-cancel

This provider at this point in time does not track or manage state of runs only the job level config and there is nothing in the platform that automatically restarts a run if config changes AFAIK.

@MarekLani
Copy link
Author

MarekLani commented Jan 28, 2021

@stikkireddy thank you, this sounds as a fine approach to really restart the job only when needed instead of doing it on each run of the terraform. Thank you. However still leaving this one opened, as it would be nice to be able to allow for restart directly within the terraform resource without need of touching databricks API.

@nfx nfx added this to the v0.4.0 milestone Feb 11, 2021
@travisemichael
Copy link

Hopefully I can piggyback on this thread and ask a related question.

I have the same use case as above (a streaming job with max concurrent runs = 1) but I would also like to be able to trigger a run of the job on a fresh deployment. Would this also be included in the v0.4.0 milestone? Should I file another feature request?

@nfx
Copy link
Contributor

nfx commented Feb 14, 2021

@travisemichael just to confirm - fresh deployment would mean fresh JAR/Python file/Notebook with different name/eventually notebook project with newer version, correct?..

tagging @lennartcl to see if we can get more graceful streaming stops than those provided by cancel run api call.

@nfx nfx pinned this issue Feb 14, 2021
@travisemichael
Copy link

@nfx What I mean by a fresh deployment would be a brand new Job. It would be nice to have it set up so that on the first deploy the job will start a single run, and on subsequent deploys the previous run will be restarted as described above.

@nfx
Copy link
Contributor

nfx commented Feb 16, 2021

@travisemichael makes sense for always_running jobs

nfx added a commit that referenced this issue Jul 7, 2021
* Implements feature #389
* Functionality is triggered only if `always_running` attribute is present
* Uses RunsList, RunsCancel and RunNow methods from Jobs API
@nfx nfx linked a pull request Jul 7, 2021 that will close this issue
@nfx
Copy link
Contributor

nfx commented Jul 7, 2021

@travisemichael @MarekLani i've started working on this feature in the linked pull request

nfx added a commit that referenced this issue Jul 8, 2021
* Implements feature #389
* Functionality is triggered only if `always_running` attribute is present
* Uses RunsList, RunsCancel and RunNow methods from Jobs API
nfx added a commit that referenced this issue Jul 8, 2021
* Implements feature #389
* Functionality is triggered only if `always_running` attribute is present
* Uses RunsList, RunsCancel and RunNow methods from Jobs API
@nfx nfx closed this as completed in #715 Jul 8, 2021
nfx added a commit that referenced this issue Jul 8, 2021
* Implements feature #389
* Functionality is triggered only if `always_running` attribute is present
* Uses RunsList, RunsCancel and RunNow methods from Jobs API
@nfx nfx unpinned this issue Jul 13, 2021
michael-berk pushed a commit to michael-berk/terraform-provider-databricks that referenced this issue Feb 15, 2023
* Implements feature databricks#389
* Functionality is triggered only if `always_running` attribute is present
* Uses RunsList, RunsCancel and RunNow methods from Jobs API
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants