-
Notifications
You must be signed in to change notification settings - Fork 466
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[docs] Add custom Dataset classes and Materializers in ZenML #3091
Conversation
@@ -0,0 +1,229 @@ | |||
--- | |||
description: Learn about how to manage big data with ZenML. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same description as the other document.
|
||
return df | ||
``` | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could also add a 4th section here on using numba
to speed things up even more. See https://numba.pydata.org/numba-doc/0.12/tutorial_numpy_and_numba.html?external_link=true for the official docs and https://www.perplexity.ai/search/does-numba-help-make-numpy-ope-osEdiV2wSwSd4LRj55tuXg for a little example, but it can make a huge speed difference.
ray_pipeline(input_data="path/to/your/data.csv") | ||
``` | ||
|
||
As with Spark, you'll need to have Ray installed in your environment and ensure that the necessary Ray dependencies are available when running your pipeline. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did you test these work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really but in theory it should work? Its hard to test spark
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even more reason to make sure it actually works and not let users go through all that trouble of setting it up only to end up with a non-working solution?
return BigQueryDataset(table_id="project.dataset.transformed_table", df=transformed_df) | ||
|
||
@pipeline | ||
def etl_pipeline(mode: str = "develop") -> Dataset: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not actually return Dataset, I would just remove the annotation entirely
metadata = {"data_path": dataset.data_path} | ||
with fileio.open(os.path.join(self.uri, "metadata.json"), "w") as f: | ||
json.dump(metadata, f) | ||
if dataset.df is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this call dataset.read_data()
which returns a dataframe, and then save that one?
Also, by having this path somehow on the dataset, it sort of goes against ZenML materializers. This csv file is most likely not stored in the artifact store but locally somewhere, which means this materializer is useless in remote scenarios.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead, this should IMO be like this:
def save(self, dataset: CSVDataset) -> None:
data_path = os.path.join(self.uri, "data.csv")
# now write it to a temp location and copy to the artifact store
def load(self, ...):
# copy somewhere locally
return CSVDataset(local_temp_path)
Co-authored-by: Michael Schuster <schustmi@users.noreply.github.com> Co-authored-by: Alex Strick van Linschoten <strickvl@users.noreply.github.com>
…doc/handle-big-data
|
||
# Copy the CSV file from the artifact store to the temporary location | ||
with fileio.open(os.path.join(self.uri, "data.csv"), "rb") as source_file: | ||
with open(temp_path, "wb") as target_file: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of opening up the file again, you can simply nest the two with
, tempfile.NamedTemporaryFile
already opens up a file.
|
||
def save(self, dataset: CSVDataset) -> None: | ||
# Ensure we have data to save | ||
if dataset.df is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
df = dataset.read_data()
with ...:
df.to_csv
Less lines and passes mypy
return processed_data | ||
``` | ||
|
||
2. **Create specialized steps**: Implement separate steps for different dataset types to handle specific processing requirements while keeping your code modular. This approach allows you to tailor your processing to the unique characteristics of each data source. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this the opposite of the first best practice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
took it out
|
||
```python | ||
@step | ||
def process_data(dataset: Dataset) -> ProcessedData: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not supposed to be runnable right? I can't seem to find the ProcessedData
class anywhere
@step(output_materializer=CSVDatasetMaterializer) | ||
def extract_data_local(data_path: str = "data/raw_data.csv") -> CSVDataset: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either the return type should be Dataset
or the output materializer is not necessary, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to be clear what materializer gets picked up... I think we need to do it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case the output is the exact class, I don't think the materializer needs to be specified.
I think what we really need is to actually run this code.
Describe changes
Added docs for some big data use-cases
Pre-requisites
Please ensure you have done the following:
develop
and the open PR is targetingdevelop
. If your branch wasn't based on develop read Contribution guide on rebasing branch to develop.Types of changes