Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Argo Resource Template in DSL #429

Closed
hongye-sun opened this issue Nov 30, 2018 · 12 comments
Closed

Support Argo Resource Template in DSL #429

hongye-sun opened this issue Nov 30, 2018 · 12 comments

Comments

@hongye-sun
Copy link
Contributor

Copied from: #415

Related question, how can the DSL be easily extended to orchestrate custom Kubernetes resources? It would be nice if I could just "inline" custom resources (similar to what Argo supports argoproj/argo-workflows#606).

I think the desired semantics would be something like

  1. Create a custom resource
  2. Wait for the specified conditions to be reached
@qimingj
Copy link
Contributor

qimingj commented Nov 30, 2018

Yes. We can. The question is how we want to expose them and how we can gather output.

A custom resource is not container specific, so there is no output file mapping. The only thing that argo supports is to query the job status (and with a jqfilter or json path https://github.com/argoproj/argo/blob/master/examples/k8s-jobs.yaml) and get the output. Not all k8s CRD reports output that way (including tf-job), so it is hard to pass output of the step to downstream components.

I would like to support two levels of K8sOp. The low level is arbitrary resource:

op = K8sOp(resource_spec, jsonPath='...')

We should also provide resource specific high level ops:

op = TfJobOp(container_image='...', ...)

This was deferred since TfJob does not output anything in job status. We can in theory customize how tf-job runs the container and then inject the output into job status, but that was non-trivial.

@hongye-sun
Copy link
Contributor Author

@jlewi Shall we instead fix tf-job to report the status to k8s?

I feel that K8sOp itself is still very useful even with the limitations.

@jlewi
Copy link
Contributor

jlewi commented Dec 12, 2018

@hongye-sun I'm not sure I understand the question. TFJob already reports its status using conditions.

@hongye-sun
Copy link
Contributor Author

@jlewi my question is based on Qiming's comment on tfjob's status. The goal I think is to pass the output model data from tfjob to subsequent steps in the pipeline. I am not familiar with tfjob. Is there an easy way that user uses argo resource template feature and pass the model data to next step today?

@swiftdiaries
Copy link
Member

TFJob doesn't support passing the model on to the next step. It manages the lifecycle of the training alone.
One way would be to use Argo Artifacts (#336), that seems to be WIP.

@jlewi
Copy link
Contributor

jlewi commented Jan 22, 2019

@qimingj Why do you need to customize anything about TFJob and K8s job?

These are existing resources that don't have any explicit notion of inputs/outputs. Users use environment variables and command line arguments to pass information to their code. Its up to the code to define what the inputs/outputs are. TFJob and K8s job don't try to impose structure on the code by forcing them to conform to some standard for inputs/outputs.

Can pipelines easily orchestrate K8s resources? e.g. can I specify a step in my pipeline that contains the spec of some K8s resource? I think the desired semantics would be

  1. Create the K8s object
  2. Wait for some condition to be reached

From a DSL perspective, I'd expect a convenient Python library for generating the K8s resource e.g.

job = batch.Job()
job.template.containers[0].image = ....
....

Ideally the python client libraries would be autogenerated from the resource specs. I don't know what the current state is of the requisite K8s python client library tooling.

@qimingj
Copy link
Contributor

qimingj commented Jan 23, 2019

DSL today requires K8s python client library. Accepting K8s resource specs is a great idea.

You are right that it's up to the code to define the inputs/outputs. I think the gap (not just tf-job, but k8s in general) is that we need a way for component to communicate back to pipeline system. For example, a runtime value (such as training accuracy), some UI metadata (so pipeline UI can visualize), or some artifacts (materialized data) etc. Some of these values may be known before the pipeline runs (e.g. a model dir), but some are not (e.g. accuracy).
For example, a trainer may choose to output its accuracy number and pass to downstream components or even DSL condition:

...
trainer = TfJobOp(...)
with dsl.Condition(trainer.accuracy > 0.8):
pusher = PusherOp(trainer.model_path)
...

It is possible that "trainer.model_path" is known beforehand, but certainly not "trainer.accuracy".

Argo provides a way for this type of communication: if the results are included in the K8s job status, then argo can parse it and extract the values. I was thinking if TF Job is more "argo-friendly" by inserting results into job status, it would make such K8s spec on par with containers.

Or, like you said we can declare supporting arbitrary K8s job, but these jobs cannot output any run time values, cannot create any visualizations, etc, until we figure out something else?

@jlewi
Copy link
Contributor

jlewi commented Jan 23, 2019

Why do we need a way to communicate back to pipelines components?

Lets suppose we have a TensorFlow program that takes two arguments --input and --output and now we want to train that using TFJob or K8s job. Why can't we do something like the following

def train_my_model(input, output):
    job_spec = ....build K8s spec setting command line arguments -- input and output
   k8s_client.create(job)
   k8s_client.wait_for(job)
   events = read_tf_events_file(output)
   return events.get_accuracy()

Now just use func_to_container_op to turn this into an op for the DSL.

@swiftdiaries
Copy link
Member

I was looking into Argo Events and this seems like a good fit. I'm currently experimenting with monitoring TFJob resources using this.
https://github.com/argoproj/argo-events/blob/master/docs/tutorial.md#resource

As for why we need a way to communicate back to pipelines,
One use case: with that we can have parallel branches of the same pipeline running with different sets of hyperparameters and we can choose models with the best accuracy metrics but we still need a way to fetch the best trained model for a deploy step. For the deploy step to run, we need to pass the path to the model that has the best accuracy. At the bare minimum, we'll need to pass back one output parameter.

@vicaire
Copy link
Contributor

vicaire commented May 31, 2019

@hongye-sun, should we close? Are you expecting more work on this issue?

@elikatsis
Copy link
Member

Hi everyone,

Looks like this issue is covered by #926 and ResourceOps especially, which fully implements Argo's resource template.

So, pinging @hongye-sun, do you think this is covered and we should close it?

@hongye-sun
Copy link
Contributor Author

Yes, it has been covered. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants