Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[docs] Add custom Dataset classes and Materializers in ZenML #3091

Merged
merged 11 commits into from
Oct 18, 2024

Conversation

htahir1
Copy link
Contributor

@htahir1 htahir1 commented Oct 16, 2024

Describe changes

Added docs for some big data use-cases

Pre-requisites

Please ensure you have done the following:

  • I have read the CONTRIBUTING.md document.
  • If my change requires a change to docs, I have updated the documentation accordingly.
  • I have added tests to cover my changes.
  • I have based my new branch on develop and the open PR is targeting develop. If your branch wasn't based on develop read Contribution guide on rebasing branch to develop.
  • If my changes require changes to the dashboard, these changes are communicated/requested.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Other (add details above)

@github-actions github-actions bot added the internal To filter out internal PRs and issues label Oct 16, 2024
docs/book/how-to/handle-data-artifacts/datasets.md Outdated Show resolved Hide resolved
@@ -0,0 +1,229 @@
---
description: Learn about how to manage big data with ZenML.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same description as the other document.

docs/book/how-to/handle-data-artifacts/manage-big-data.md Outdated Show resolved Hide resolved

return df
```

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could also add a 4th section here on using numba to speed things up even more. See https://numba.pydata.org/numba-doc/0.12/tutorial_numpy_and_numba.html?external_link=true for the official docs and https://www.perplexity.ai/search/does-numba-help-make-numpy-ope-osEdiV2wSwSd4LRj55tuXg for a little example, but it can make a huge speed difference.

docs/book/how-to/handle-data-artifacts/manage-big-data.md Outdated Show resolved Hide resolved
ray_pipeline(input_data="path/to/your/data.csv")
```

As with Spark, you'll need to have Ray installed in your environment and ensure that the necessary Ray dependencies are available when running your pipeline.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you test these work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really but in theory it should work? Its hard to test spark

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even more reason to make sure it actually works and not let users go through all that trouble of setting it up only to end up with a non-working solution?

return BigQueryDataset(table_id="project.dataset.transformed_table", df=transformed_df)

@pipeline
def etl_pipeline(mode: str = "develop") -> Dataset:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not actually return Dataset, I would just remove the annotation entirely

docs/book/how-to/handle-data-artifacts/datasets.md Outdated Show resolved Hide resolved
metadata = {"data_path": dataset.data_path}
with fileio.open(os.path.join(self.uri, "metadata.json"), "w") as f:
json.dump(metadata, f)
if dataset.df is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this call dataset.read_data() which returns a dataframe, and then save that one?

Also, by having this path somehow on the dataset, it sort of goes against ZenML materializers. This csv file is most likely not stored in the artifact store but locally somewhere, which means this materializer is useless in remote scenarios.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead, this should IMO be like this:

def save(self, dataset: CSVDataset) -> None:
  data_path = os.path.join(self.uri, "data.csv")
  # now write it to a temp location and copy to the artifact store


def load(self, ...):
  # copy somewhere locally
  return CSVDataset(local_temp_path)

htahir1 and others added 5 commits October 17, 2024 13:29
@htahir1 htahir1 requested review from schustmi and strickvl October 17, 2024 12:09

# Copy the CSV file from the artifact store to the temporary location
with fileio.open(os.path.join(self.uri, "data.csv"), "rb") as source_file:
with open(temp_path, "wb") as target_file:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of opening up the file again, you can simply nest the two with, tempfile.NamedTemporaryFile already opens up a file.


def save(self, dataset: CSVDataset) -> None:
# Ensure we have data to save
if dataset.df is None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

df = dataset.read_data()

with ...:
  df.to_csv

Less lines and passes mypy

return processed_data
```

2. **Create specialized steps**: Implement separate steps for different dataset types to handle specific processing requirements while keeping your code modular. This approach allows you to tailor your processing to the unique characteristics of each data source.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this the opposite of the first best practice?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

took it out


```python
@step
def process_data(dataset: Dataset) -> ProcessedData:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not supposed to be runnable right? I can't seem to find the ProcessedData class anywhere

@htahir1 htahir1 requested a review from schustmi October 18, 2024 14:01
Comment on lines +148 to +149
@step(output_materializer=CSVDatasetMaterializer)
def extract_data_local(data_path: str = "data/raw_data.csv") -> CSVDataset:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either the return type should be Dataset or the output materializer is not necessary, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to be clear what materializer gets picked up... I think we need to do it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case the output is the exact class, I don't think the materializer needs to be specified.
I think what we really need is to actually run this code.

@htahir1 htahir1 merged commit b8e4faa into develop Oct 18, 2024
8 of 9 checks passed
@htahir1 htahir1 deleted the doc/handle-big-data branch October 18, 2024 14:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
internal To filter out internal PRs and issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants