Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PR for Issue 795 (outdated Pipelines SDK guide) #971

Merged
merged 14 commits into from
Aug 7, 2019
76 changes: 42 additions & 34 deletions content/docs/pipelines/sdk/build-component.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,57 +60,65 @@ local file, such as `/output.txt`. In the Python class that defines your
pipeline (see [below](#define-pipeline)) you can
specify how to map the content of local files to component outputs.

## Create a Python class for your component
## Create a Python function to wrap your component

Define a Python class to describe the interactions with the Docker container
Define a Python function to describe the interactions with the Docker container
image that contains your pipeline component. For example, the following
Python class describes a component that trains an XGBoost model:
Python function describes a component that trains an XGBoost model:

```python
class TrainerOp(dsl.ContainerOp):

def __init__(self, name, project, region, cluster_name, train_data, eval_data,
target, analysis, workers, rounds, output, is_classification=True):
def dataproc_train_op(
project,
region,
cluster_name,
train_data,
eval_data,
target,
analysis,
workers,
rounds,
output,
is_classification=True
):
if is_classification:
config='gs://ml-pipeline-playground/trainconfcla.json'
else:
config='gs://ml-pipeline-playground/trainconfreg.json'

super(TrainerOp, self).__init__(
name=name,
image='gcr.io/ml-pipeline/ml-pipeline-dataproc-train:7775692adf28d6f79098e76e839986c9ee55dd61',
arguments=[
'--project', project,
'--region', region,
'--cluster', cluster_name,
'--train', train_data,
'--eval', eval_data,
'--analysis', analysis,
'--target', target,
'--package', 'gs://ml-pipeline-playground/xgboost4j-example-0.8-SNAPSHOT-jar-with-dependencies.jar',
'--workers', workers,
'--rounds', rounds,
'--conf', config,
'--output', output,
],
file_outputs={'output': '/output.txt'})
return dsl.ContainerOp(
name='Dataproc - Train XGBoost model',
image='gcr.io/ml-pipeline/ml-pipeline-dataproc-train:ac833a084b32324b56ca56e9109e05cde02816a4',
arguments=[
'--project', project,
'--region', region,
'--cluster', cluster_name,
'--train', train_data,
'--eval', eval_data,
'--analysis', analysis,
'--target', target,
'--package', 'gs://ml-pipeline-playground/xgboost4j-example-0.8-SNAPSHOT-jar-with-dependencies.jar',
'--workers', workers,
'--rounds', rounds,
'--conf', config,
'--output', output,
],
file_outputs={
'output': '/output.txt',
}
)

```

The above class is an extract from the
The function must returns a dsl.ContainerOp from the
OfficePop marked this conversation as resolved.
Show resolved Hide resolved
[XGBoost Spark pipeline sample](https://github.com/kubeflow/pipelines/blob/master/samples/xgboost-spark/xgboost-training-cm.py).

Note:

* Each component must inherit from
OfficePop marked this conversation as resolved.
Show resolved Hide resolved

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove this line as well.

[`dsl.ContainerOp`](https://github.com/kubeflow/pipelines/blob/master/sdk/python/kfp/dsl/_container_op.py).
* In the `init` arguments, you can include Python native types (such as `str`
and `int`) and
[`dsl.PipelineParam`](https://github.com/kubeflow/pipelines/blob/master/sdk/python/kfp/dsl/_pipeline_param.py)
types. Each `dsl.PipelineParam` represents a parameter whose value is usually
only known at run time. The parameter can be a one for which the user provides
a value at pipeline run time, or it can be an output from an upstream
component.
* Allowed arguments for `dataproc_train_op` include both Python scalar types (such as `str` and ` int`) and [`dsl.PipelineParam`](https://github.com/kubeflow/pipelines/blob/master/sdk/python/kfp/dsl/_pipeline_param.py) types when constructing
OfficePop marked this conversation as resolved.
Show resolved Hide resolved
`dsl.ContainerOp`. Each `dsl.PipelineParam` represents a parameter whose value is usually only known at run time. The value is
either provided by the user at pipeline run time or received as an output from an upstream component.
* Although the value of each `dsl.PipelineParam` is only available at run time,
you can still use the parameters inline in the `arguments` by using `%s`
variable substitution. At run time the argument contains the value of the
Expand All @@ -121,7 +129,7 @@ Note:
component. To reference the output in code:

```python
op = TrainerOp(...)
op = dataproc_train_op(...)
op.outputs['label']
```

Expand Down