Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SDK - Lightweight - Added support for file outputs #2221

Conversation

Ark-kun
Copy link
Contributor

@Ark-kun Ark-kun commented Sep 24, 2019

Lightweight components now allow function to mark some outputs that it wants to produce by writing data to files, not returning it as in-memory data objects.
This is useful when the data is expected to be big.

Example 1 (writing big amount of data to output file with provided path):

@func_to_container_op
def write_big_data(big_file_path: OutputPath(str)):
    with open(big_file_path) as big_file:
        for i in range(1000000):
            big_file.write('Hello world\n')

Example 2 (writing big amount of data to provided output file stream):

@func_to_container_op
def write_big_data(big_file: OutputTextFile(str)):
    for i in range(1000000):
        big_file.write('Hello world\n')

This change is Reviewable

Lightweight components now allow function to mark some outputs that it wants to produce by writing data to files, not returning it as in-memory data objects.
This is useful when the data is expected to be big.

Example 1 (writing big amount of data to output file with provided path):
```python
@func_to_container_op
def write_big_data(big_file_path: OutputPath(str)):
    with open(big_file_path) as big_file:
        for i in range(1000000):
            big_file.write('Hello world\n')

```
Example 2 (writing big amount of data to provided output file stream):
```python
@func_to_container_op
def write_big_data(big_file: OutputTextFile(str)):
    for i in range(1000000):
        big_file.write('Hello world\n')
```
@numerology
Copy link

Good job! General question: is that possible to use OutputPath/OutputTextFile/OutputBinaryFile with return statement and type hints? Or, can we merge OutputPath and InputPath into one class, say ArtifactPath, and use it in both the component producing it and the component consuming it. I vaguely feel it would be a more consistent experience. WDYT?

@Ark-kun
Copy link
Contributor Author

Ark-kun commented Sep 24, 2019

is that possible to use OutputPath/OutputTextFile/OutputBinaryFile with return statement and type hints?

This is not possible since the input/output paths must be known to the system at compile time.
Thus the input/output paths must be passed into the function by the system as opposed to function generating and returning the paths at runtime. So, a bit confusingly, the component output paths are inputs for the program/function.

Or, can we merge OutputPath and InputPath into one class, say ArtifactPath, and use it in both the component producing it and the component consuming it.

The function signature needs to tell the system which path parameters are inputs and which are outputs. InputPath and OutputPath are just dummy markers to convey that information to the system.

@Ark-kun
Copy link
Contributor Author

Ark-kun commented Sep 24, 2019

An example of function using both file inputs and outputs:

@func_to_container_op
def write_big_data(input_file: InputTextFile(str), output_file: OutputTextFile(str)):
    while True:
        line = input_file.readline()
        if line is None:
            break
        output_file.write('Hello ' + line)

@numerology
Copy link

/lgtm

@Ark-kun
Copy link
Contributor Author

Ark-kun commented Sep 25, 2019

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Ark-kun

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kevinbache
Copy link
Contributor

in example 2, i find it weird that we're using the passed location parameter as the object we're calling .write on.

class OutputPath:
'''When creating component from function, OutputPath should be used as function parameter annotation to tell the system that the function wants to output data by writing it into a file with the given path instead of returning the data from the function.'''
def __init__(self, type=None):
self.type = type
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

type?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the input/output type.

Before
def func(data: list):
after
def func(data_file: InputFile(list)):

@k8s-ci-robot k8s-ci-robot merged commit 3caba4e into kubeflow:master Sep 25, 2019
@timofal
Copy link

timofal commented Sep 18, 2020

@Ark-kun Hello!
I spent a lot of time trying to figure out if I can pass big binary file between steps in case I manually write docker images for each step, i.e. component. Can I?

In other words: you are showing how to pass big binary file using@func_to_container_op. I need the same, but I do build manually-written docker images for each step. Also I don't want to use storage like GCS to save intermediate data. I want k8s to pass data on it's own. Is it possible?

I appreciate any advises.

@numerology
Copy link

Hi @timofal

I think the canonical way of approaching your use case is the following

  • When using @func_to_container_op one can specify the base_image as the image you've just built. Make sure it's stored in a registry which KFP have access to;

  • In the producer python function, use OutputPath to annotate the location where the data will be written; and in the consumer python function, use InputPath to annotate the location where the data is read from.

@timofal
Copy link

timofal commented Sep 18, 2020

@numerology
Thank you for your response.
I believe using @func_to_container_op will be very inconvenient to me because I have exotic dependencies in this step, bunch of bash scripts and probably something else. All this stuff is already dockerized and works fine.

I checked documentation. Docker image for @func_to_container_op is meant to be base image. I probably can implement python function that will find my entry point in base image and start process, but it looks as weird workaround. Is there a way to get big file sharing without using @func_to_container_op?

@numerology
Copy link

@timofal

Another way is perhaps to write a component yaml spec which refers to your docker image and use similar placeholders there.

See examples in our first party components:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants