Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] StructuredDataset file_format becomes an empty str through dataclass attribute access #6096

Open
2 tasks done
JiangJiaWei1103 opened this issue Dec 8, 2024 · 2 comments · May be fixed by flyteorg/flytekit#3027
Open
2 tasks done
Assignees
Labels
bug Something isn't working flytekit FlyteKit Python related issue

Comments

@JiangJiaWei1103
Copy link
Contributor

JiangJiaWei1103 commented Dec 8, 2024

Describe the bug

In this case, a workflow runs with an input dataclass, which contains a StructuredDataset attribute. Following shows a simple definition:

@dataclass
class DC:
    # StructuredDataset with local uri
    lsd: StructuredDataset = field(default_factory=lambda: StructuredDataset(uri=LOCAL_URI, file_format=FILE_FORMAT))

    # StructuredDataset with remote uri
    rsd: StructuredDataset = field(default_factory=lambda: StructuredDataset(uri=REMOTE_URI, file_format=FILE_FORMAT))

When we run the workflow remotely, we observe that the file_format field becomes an empty string, as illustrated in the following screenshot:

Screenshot 2024-12-08 at 12 38 13 PM

Initial Thoughts

We think that msgpack serialization doesn't process file_format properly, because the file_format is an empty string right after inputs.pb is loaded as input_proto:

Screenshot 2024-12-08 at 1 10 23 PM

If I've not misunderstood it, \240 (0xA0 in hex) is a fixstr with a length of zero, which means file_format is an empty string.

Expected behavior

file_format of StructuredDataset should keep the original input value (i.e., "parquet" in this case).

Additional context to reproduce

Run the following script to trigger the remote run of the workflow:

from dataclasses import dataclass, field
from pathlib import Path

import pandas as pd
from flytekit import task, workflow, ImageSpec
from flytekit.types.structured import StructuredDataset


# Build image
flytekit_hash = "adc1061709b2cff74c2e66dd65399d6a59954023"
flytekit = f"git+https://github.com/flyteorg/flytekit.git@{flytekit_hash}"
image = ImageSpec(
    packages=[flytekit, "pandas", "pyarrow"],
    apt_packages=["git"],
    registry="localhost:30000",
)


# Define constants
LOCAL_URI = "./df.parquet"
REMOTE_URI = "s3://my-s3-bucket/s3_flyte_dir/df.parquet"
FILE_FORMAT = "parquet"


@dataclass
class DC:
    # StructuredDataset with local uri
    lsd: StructuredDataset = field(default_factory=lambda: StructuredDataset(uri=LOCAL_URI, file_format=FILE_FORMAT))

    # StructuredDataset with remote uri
    rsd: StructuredDataset = field(default_factory=lambda: StructuredDataset(uri=REMOTE_URI, file_format=FILE_FORMAT))


@task(container_image=image)
def direct_sd(sd: StructuredDataset) -> StructuredDataset:
    """Pass through a StructuredDataset without any action."""
    print(f"SD | {sd}")
    print(f"Literal SD | {sd._literal_sd}")
    print(f"DF\n{'-'*30}\n{sd.open(pd.DataFrame).all()}")
    return sd


@workflow
def wf1(sd: StructuredDataset) -> StructuredDataset:
    """Pass through a StructuredDataset without any action."""
    return direct_sd(sd=sd)


@workflow
def wf2(dc: DC) -> StructuredDataset:
    """Pass through a StructuredDataset with attr access."""
    return direct_sd(sd=dc.rsd)


if __name__ == "__main__":
    from flytekit.clis.sdk_in_container import pyflyte
    from click.testing import CliRunner

    # Configure the current run
    script_path = str(Path(__file__).absolute())
    dc = '{"dc": {"rsd": {"uri": "s3://my-s3-bucket/s3_flyte_dir/df.parquet", "file_format": "parquet"}}}'

    runner = CliRunner()
    result = runner.invoke(pyflyte.main, ["run", "--remote", script_path, "wf2", "--dc", dc])
    print(result.output)

Screenshots

No response

Are you sure this issue hasn't been raised already?

  • Yes

Have you read the Code of Conduct?

  • Yes
@JiangJiaWei1103 JiangJiaWei1103 added bug Something isn't working untriaged This issues has not yet been looked at by the Maintainers labels Dec 8, 2024
@Future-Outlier
Copy link
Member

thank you!
ping me when this is done <3

@JiangJiaWei1103
Copy link
Contributor Author

No problem, thanks bro!

@eapolinario eapolinario removed the untriaged This issues has not yet been looked at by the Maintainers label Dec 12, 2024
@eapolinario eapolinario added the flytekit FlyteKit Python related issue label Dec 19, 2024
@davidmirror-ops davidmirror-ops moved this from Backlog to Assigned in Flyte Issues/PRs maintenance Jan 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working flytekit FlyteKit Python related issue
Projects
Status: Assigned
Status: Backlog
Development

Successfully merging a pull request may close this issue.

3 participants