Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Fix StructuredDataset empty-str file_format in dc attr access #3027

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

JiangJiaWei1103
Copy link
Contributor

@JiangJiaWei1103 JiangJiaWei1103 commented Jan 1, 2025

Tracking issue

Closes flyteorg/flyte#6096.

Why are the changes needed?

When users create a StructuredDataset with a specified file_format (e.g., parquet), the file_format information will be accidentally discarded in this case during async_to_literal call. To be concrete, StructuredDatasetType's format attribute is set to GENERIC_FORMAT, which is an empty string "".

What changes were proposed in this pull request?

Override StructuredDatasetType's format attribute when users explicitly set file_format of python native StructuredDataset.

How was this patch tested?

This patch is tested through the newly added integration test and double checked by observing the flyte console I/O and the task pod stdout.

Setup process

For local run, the setup process is summarized as follows:

git clone https://github.com/flyteorg/flytekit.git
gh pr checkout 3027
make setup && pip install -e .

After installation, run the following command:

pytest -svvv tests/flytekit/integration/remote/test_remote.py::test_sd_attr

Screenshots

The following results are expected:

  • Flyte console input
    Screenshot 2025-01-04 at 2 44 54 PM

  • Flyte console output
    Screenshot 2025-01-04 at 2 45 03 PM

  • Task pod stdout
    Screenshot 2025-01-04 at 2 44 43 PM

Check all the applicable boxes

  • I updated the documentation accordingly.
  • All new and existing tests passed.
  • All commits are signed-off.

Related PRs

Docs link

Summary by Bito

Fixed a critical bug in StructuredDataset where user-specified file_format information was being lost during async_to_literal conversion. The fix implements format attribute override in StructuredDatasetType to preserve user-defined formats. Changes include transformer engine modifications and integration tests.

Unit tests added: True

Estimated effort to review (1-5, lower is better): 2

Signed-off-by: JiaWei Jiang <waynechuang97@gmail.com>
@flyte-bot
Copy link
Contributor

Code Review Agent Run Status

  • Limitations and other issues: ❌ Failure - The AI Code Review Agent skipped reviewing this change because it is configured to exclude certain pull requests based on the source/target branch or the pull request status. You can change the settings here, or contact the agent instance creator at eduardo@union.ai.

Copy link

codecov bot commented Jan 1, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 82.79%. Comparing base (e9a7da1) to head (cdb56c0).
Report is 7 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #3027      +/-   ##
==========================================
+ Coverage   79.52%   82.79%   +3.26%     
==========================================
  Files         201        3     -198     
  Lines       21250      186   -21064     
  Branches     2729        0    -2729     
==========================================
- Hits        16900      154   -16746     
+ Misses       3574       32    -3542     
+ Partials      776        0     -776     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@flyte-bot
Copy link
Contributor

Code Review Agent Run Status

  • Limitations and other issues: ❌ Failure - The AI Code Review Agent skipped reviewing this change because it is configured to exclude certain pull requests based on the source/target branch or the pull request status. You can change the settings here, or contact the agent instance creator at eduardo@union.ai.

Signed-off-by: JiaWei Jiang <waynechuang97@gmail.com>
@flyte-bot
Copy link
Contributor

Code Review Agent Run Status

  • Limitations and other issues: ❌ Failure - The AI Code Review Agent skipped reviewing this change because it is configured to exclude certain pull requests based on the source/target branch or the pull request status. You can change the settings here, or contact the agent instance creator at eduardo@union.ai.

Signed-off-by: JiaWei Jiang <waynechuang97@gmail.com>
@JiangJiaWei1103 JiangJiaWei1103 changed the title [WIP] Fix StructuredDataset empty-str file_format in dc attr access [BUG] Fix StructuredDataset empty-str file_format in dc attr access Jan 4, 2025
@JiangJiaWei1103
Copy link
Contributor Author

Follow-up

During fixing this bug, we observe another two phenomenon, which are briefly described as follows:

  1. pyflyte run can't handle the dataclass input as a json string, e.g.,
pyflyte run --remote test.py wf --dc '{"dc": {"sd": {"uri": "s3://my-s3-bucket/df.parquet", "file_format": "parquet"}}}'

There are two sub-cases:

  • DC's sd is defined without default_factory
    • Get KeyError: 'sd'
  • DC's sd is defined with default_factory
    • wf always takes the default_factory setup as input no matter what's specified in the json string
  1. uri information loss, which is just similar to what we've solved in this patch
    Screenshot 2025-01-04 at 2 44 43 PM
    As can be observed, StructuredDataset uri becomes None.

We plan to solve these issues in the future which can make attribute access result more accurate. Thanks!

@flyte-bot
Copy link
Contributor

flyte-bot commented Jan 4, 2025

Code Review Agent Run #2ae7a3

Actionable Suggestions - 0
Review Details
  • Files reviewed - 3 · Commit Range: cdb56c0..46b7b62
    • flytekit/types/structured/structured_dataset.py
    • tests/flytekit/integration/remote/test_remote.py
    • tests/flytekit/integration/remote/workflows/basic/sd_attr.py
  • Files skipped - 0
  • Tools
    • Whispers (Secret Scanner) - ✔︎ Successful
    • Detect-secrets (Secret Scanner) - ✔︎ Successful
    • MyPy (Static Code Analysis) - ✔︎ Successful
    • Astral Ruff (Static Code Analysis) - ✔︎ Successful

AI Code Review powered by Bito Logo

@flyte-bot
Copy link
Contributor

Changelist by Bito

This pull request implements the following key changes.

Key Change Files Impacted
Bug Fix - Fix StructuredDataset File Format Retention

structured_dataset.py - Added logic to retain user-specified file_format in StructuredDataset

test_remote.py - Added integration test for StructuredDataset file_format attribute

sd_attr.py - Added new test workflow for StructuredDataset attribute validation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In review
Development

Successfully merging this pull request may close these issues.

[BUG] StructuredDataset file_format becomes an empty str through dataclass attribute access
2 participants