Support Argo Resource Template in DSL #429

hongye-sun · 2018-11-30T17:35:03Z

Copied from: #415

Related question, how can the DSL be easily extended to orchestrate custom Kubernetes resources? It would be nice if I could just "inline" custom resources (similar to what Argo supports argoproj/argo-workflows#606).

I think the desired semantics would be something like

Create a custom resource
Wait for the specified conditions to be reached

qimingj · 2018-11-30T17:49:25Z

Yes. We can. The question is how we want to expose them and how we can gather output.

A custom resource is not container specific, so there is no output file mapping. The only thing that argo supports is to query the job status (and with a jqfilter or json path https://github.com/argoproj/argo/blob/master/examples/k8s-jobs.yaml) and get the output. Not all k8s CRD reports output that way (including tf-job), so it is hard to pass output of the step to downstream components.

I would like to support two levels of K8sOp. The low level is arbitrary resource:

op = K8sOp(resource_spec, jsonPath='...')

We should also provide resource specific high level ops:

op = TfJobOp(container_image='...', ...)

This was deferred since TfJob does not output anything in job status. We can in theory customize how tf-job runs the container and then inject the output into job status, but that was non-trivial.

hongye-sun · 2018-11-30T18:38:53Z

@jlewi Shall we instead fix tf-job to report the status to k8s?

I feel that K8sOp itself is still very useful even with the limitations.

jlewi · 2018-12-12T20:06:19Z

@hongye-sun I'm not sure I understand the question. TFJob already reports its status using conditions.

hongye-sun · 2018-12-12T20:22:17Z

@jlewi my question is based on Qiming's comment on tfjob's status. The goal I think is to pass the output model data from tfjob to subsequent steps in the pipeline. I am not familiar with tfjob. Is there an easy way that user uses argo resource template feature and pass the model data to next step today?

swiftdiaries · 2019-01-08T21:26:02Z

TFJob doesn't support passing the model on to the next step. It manages the lifecycle of the training alone.
One way would be to use Argo Artifacts (#336), that seems to be WIP.

jlewi · 2019-01-22T18:23:28Z

@qimingj Why do you need to customize anything about TFJob and K8s job?

These are existing resources that don't have any explicit notion of inputs/outputs. Users use environment variables and command line arguments to pass information to their code. Its up to the code to define what the inputs/outputs are. TFJob and K8s job don't try to impose structure on the code by forcing them to conform to some standard for inputs/outputs.

Can pipelines easily orchestrate K8s resources? e.g. can I specify a step in my pipeline that contains the spec of some K8s resource? I think the desired semantics would be

Create the K8s object
Wait for some condition to be reached

From a DSL perspective, I'd expect a convenient Python library for generating the K8s resource e.g.

job = batch.Job()
job.template.containers[0].image = ....
....

Ideally the python client libraries would be autogenerated from the resource specs. I don't know what the current state is of the requisite K8s python client library tooling.

qimingj · 2019-01-23T01:02:32Z

DSL today requires K8s python client library. Accepting K8s resource specs is a great idea.

You are right that it's up to the code to define the inputs/outputs. I think the gap (not just tf-job, but k8s in general) is that we need a way for component to communicate back to pipeline system. For example, a runtime value (such as training accuracy), some UI metadata (so pipeline UI can visualize), or some artifacts (materialized data) etc. Some of these values may be known before the pipeline runs (e.g. a model dir), but some are not (e.g. accuracy).
For example, a trainer may choose to output its accuracy number and pass to downstream components or even DSL condition:

...
trainer = TfJobOp(...)
with dsl.Condition(trainer.accuracy > 0.8):
pusher = PusherOp(trainer.model_path)
...

It is possible that "trainer.model_path" is known beforehand, but certainly not "trainer.accuracy".

Argo provides a way for this type of communication: if the results are included in the K8s job status, then argo can parse it and extract the values. I was thinking if TF Job is more "argo-friendly" by inserting results into job status, it would make such K8s spec on par with containers.

Or, like you said we can declare supporting arbitrary K8s job, but these jobs cannot output any run time values, cannot create any visualizations, etc, until we figure out something else?

jlewi · 2019-01-23T13:14:14Z

Why do we need a way to communicate back to pipelines components?

Lets suppose we have a TensorFlow program that takes two arguments --input and --output and now we want to train that using TFJob or K8s job. Why can't we do something like the following

def train_my_model(input, output):
    job_spec = ....build K8s spec setting command line arguments -- input and output
   k8s_client.create(job)
   k8s_client.wait_for(job)
   events = read_tf_events_file(output)
   return events.get_accuracy()

Now just use func_to_container_op to turn this into an op for the DSL.

swiftdiaries · 2019-01-23T19:43:18Z

I was looking into Argo Events and this seems like a good fit. I'm currently experimenting with monitoring TFJob resources using this.
https://github.com/argoproj/argo-events/blob/master/docs/tutorial.md#resource

As for why we need a way to communicate back to pipelines,
One use case: with that we can have parallel branches of the same pipeline running with different sets of hyperparameters and we can choose models with the best accuracy metrics but we still need a way to fetch the best trained model for a deploy step. For the deploy step to run, we need to pass the path to the model that has the best accuracy. At the bare minimum, we'll need to pass back one output parameter.

vicaire · 2019-05-31T21:17:32Z

@hongye-sun, should we close? Are you expecting more work on this issue?

elikatsis · 2019-09-04T09:43:41Z

Hi everyone,

Looks like this issue is covered by #926 and ResourceOps especially, which fully implements Argo's resource template.

So, pinging @hongye-sun, do you think this is covered and we should close it?

hongye-sun · 2019-09-04T16:04:45Z

Yes, it has been covered. Thanks.

hongye-sun added the area/sdk/dsl label Nov 30, 2018

hongye-sun assigned qimingj Nov 30, 2018

hongye-sun mentioned this issue Nov 30, 2018

Make low level APIs for DSL ContainerOp by leveraging k8s and argo library #415

Closed

jlewi mentioned this issue Jan 23, 2019

example TFX taxi support on-prem cluster #721

Closed

vicaire assigned hongye-sun and unassigned qimingj Mar 26, 2019

vicaire added priority/p1 kind/feature labels Mar 26, 2019

vkoukis mentioned this issue Mar 28, 2019

Extend the DSL with support for Persistent Volumes and Snapshots #801

Closed

This was referenced May 16, 2019

support for git artifacts #1326

Closed

ResourceOp to support pvolumes #1345

Closed

Ark-kun assigned elikatsis Sep 3, 2019

hongye-sun closed this as completed Sep 4, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Argo Resource Template in DSL #429

Support Argo Resource Template in DSL #429

hongye-sun commented Nov 30, 2018

qimingj commented Nov 30, 2018

hongye-sun commented Nov 30, 2018

jlewi commented Dec 12, 2018

hongye-sun commented Dec 12, 2018

swiftdiaries commented Jan 8, 2019

jlewi commented Jan 22, 2019

qimingj commented Jan 23, 2019

jlewi commented Jan 23, 2019

swiftdiaries commented Jan 23, 2019

vicaire commented May 31, 2019

elikatsis commented Sep 4, 2019

hongye-sun commented Sep 4, 2019

Support Argo Resource Template in DSL #429

Support Argo Resource Template in DSL #429

Comments

hongye-sun commented Nov 30, 2018

qimingj commented Nov 30, 2018

hongye-sun commented Nov 30, 2018

jlewi commented Dec 12, 2018

hongye-sun commented Dec 12, 2018

swiftdiaries commented Jan 8, 2019

jlewi commented Jan 22, 2019

qimingj commented Jan 23, 2019

jlewi commented Jan 23, 2019

swiftdiaries commented Jan 23, 2019

vicaire commented May 31, 2019

elikatsis commented Sep 4, 2019

hongye-sun commented Sep 4, 2019