[docs] Add custom Dataset classes and Materializers in ZenML #3091

htahir1 · 2024-10-16T13:15:04Z

Describe changes

Added docs for some big data use-cases

Pre-requisites

Please ensure you have done the following:

I have read the CONTRIBUTING.md document.
If my change requires a change to docs, I have updated the documentation accordingly.
I have added tests to cover my changes.
I have based my new branch on develop and the open PR is targeting develop. If your branch wasn't based on develop read Contribution guide on rebasing branch to develop.
If my changes require changes to the dashboard, these changes are communicated/requested.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)
Other (add details above)

docs/book/how-to/handle-data-artifacts/datasets.md

strickvl · 2024-10-16T13:39:46Z

docs/book/how-to/handle-data-artifacts/manage-big-data.md

@@ -0,0 +1,229 @@
+---
+description: Learn about how to manage big data with ZenML.


Same description as the other document.

docs/book/how-to/handle-data-artifacts/manage-big-data.md

strickvl · 2024-10-16T13:43:01Z

docs/book/how-to/handle-data-artifacts/manage-big-data.md

+
+    return df
+```
+


You could also add a 4th section here on using numba to speed things up even more. See https://numba.pydata.org/numba-doc/0.12/tutorial_numpy_and_numba.html?external_link=true for the official docs and https://www.perplexity.ai/search/does-numba-help-make-numpy-ope-osEdiV2wSwSd4LRj55tuXg for a little example, but it can make a huge speed difference.

docs/book/how-to/handle-data-artifacts/manage-big-data.md

strickvl · 2024-10-16T13:44:23Z

docs/book/how-to/handle-data-artifacts/manage-big-data.md

+ray_pipeline(input_data="path/to/your/data.csv")
+```
+
+As with Spark, you'll need to have Ray installed in your environment and ensure that the necessary Ray dependencies are available when running your pipeline.


did you test these work?

Not really but in theory it should work? Its hard to test spark

Even more reason to make sure it actually works and not let users go through all that trouble of setting it up only to end up with a non-working solution?

schustmi · 2024-10-16T14:16:09Z

docs/book/how-to/handle-data-artifacts/datasets.md

+    return BigQueryDataset(table_id="project.dataset.transformed_table", df=transformed_df)
+
+@pipeline
+def etl_pipeline(mode: str = "develop") -> Dataset:


This does not actually return Dataset, I would just remove the annotation entirely

docs/book/how-to/handle-data-artifacts/datasets.md

schustmi · 2024-10-16T14:21:14Z

docs/book/how-to/handle-data-artifacts/datasets.md

+        metadata = {"data_path": dataset.data_path}
+        with fileio.open(os.path.join(self.uri, "metadata.json"), "w") as f:
+            json.dump(metadata, f)
+        if dataset.df is not None:


Shouldn't this call dataset.read_data() which returns a dataframe, and then save that one?

Also, by having this path somehow on the dataset, it sort of goes against ZenML materializers. This csv file is most likely not stored in the artifact store but locally somewhere, which means this materializer is useless in remote scenarios.

Instead, this should IMO be like this:

def save(self, dataset: CSVDataset) -> None: data_path = os.path.join(self.uri, "data.csv") # now write it to a temp location and copy to the artifact store def load(self, ...): # copy somewhere locally return CSVDataset(local_temp_path)

Co-authored-by: Michael Schuster <schustmi@users.noreply.github.com> Co-authored-by: Alex Strick van Linschoten <strickvl@users.noreply.github.com>

…doc/handle-big-data

schustmi · 2024-10-18T13:03:32Z

docs/book/how-to/handle-data-artifacts/datasets.md

+
+        # Copy the CSV file from the artifact store to the temporary location
+        with fileio.open(os.path.join(self.uri, "data.csv"), "rb") as source_file:
+            with open(temp_path, "wb") as target_file:


Instead of opening up the file again, you can simply nest the two with, tempfile.NamedTemporaryFile already opens up a file.

schustmi · 2024-10-18T13:05:44Z

docs/book/how-to/handle-data-artifacts/datasets.md

+
+    def save(self, dataset: CSVDataset) -> None:
+        # Ensure we have data to save
+        if dataset.df is None:


df = dataset.read_data() with ...: df.to_csv

Less lines and passes mypy

schustmi · 2024-10-18T13:07:46Z

docs/book/how-to/handle-data-artifacts/datasets.md

+    return processed_data
+```
+
+2. **Create specialized steps**: Implement separate steps for different dataset types to handle specific processing requirements while keeping your code modular. This approach allows you to tailor your processing to the unique characteristics of each data source.


Isn't this the opposite of the first best practice?

took it out

schustmi · 2024-10-18T13:09:25Z

docs/book/how-to/handle-data-artifacts/datasets.md

+
+```python
+@step
+def process_data(dataset: Dataset) -> ProcessedData:


This is not supposed to be runnable right? I can't seem to find the ProcessedData class anywhere

schustmi · 2024-10-18T14:03:22Z

docs/book/how-to/handle-data-artifacts/datasets.md

+@step(output_materializer=CSVDatasetMaterializer)
+def extract_data_local(data_path: str = "data/raw_data.csv") -> CSVDataset:


Either the return type should be Dataset or the output materializer is not necessary, no?

I wanted to be clear what materializer gets picked up... I think we need to do it?

In this case the output is the exact class, I don't think the materializer needs to be specified.
I think what we really need is to actually run this code.

Add custom Dataset classes and Materializers in ZenML

0477173

htahir1 requested review from strickvl, schustmi and wjayesh October 16, 2024 13:15

github-actions bot added the internal To filter out internal PRs and issues label Oct 16, 2024

htahir1 added 3 commits October 16, 2024 15:22

Add distributed computing frameworks support in ZenML pipelines

aa11cd5

Update heading level for "Important Considerations".

b1a63e8

Update handling of very large datasets in ZenML

d4c3e46

strickvl requested changes Oct 16, 2024

View reviewed changes

schustmi requested changes Oct 16, 2024

View reviewed changes

htahir1 and others added 5 commits October 17, 2024 13:29

Apply suggestions from code review

4df4c51

Co-authored-by: Michael Schuster <schustmi@users.noreply.github.com> Co-authored-by: Alex Strick van Linschoten <strickvl@users.noreply.github.com>

Update etl_pipeline function parameter default value

2b40504

Merge branch 'doc/handle-big-data' of github.com:zenml-io/zenml into …

47fe766

…doc/handle-big-data

Model dataset materialization with temporary files

57a22be

Add Dask and Numba pipelines to ZenML

3d632b1

htahir1 requested review from schustmi and strickvl October 17, 2024 12:09

strickvl approved these changes Oct 17, 2024

View reviewed changes

schustmi requested changes Oct 18, 2024

View reviewed changes

Refactor dataset loading and saving methods

2c0edbc

htahir1 requested a review from schustmi October 18, 2024 14:01

schustmi reviewed Oct 18, 2024

View reviewed changes

Merge remote-tracking branch 'origin/develop' into doc/handle-big-data

49fdd2e

schustmi approved these changes Oct 18, 2024

View reviewed changes

htahir1 merged commit b8e4faa into develop Oct 18, 2024
8 of 9 checks passed

htahir1 deleted the doc/handle-big-data branch October 18, 2024 14:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[docs] Add custom Dataset classes and Materializers in ZenML #3091

[docs] Add custom Dataset classes and Materializers in ZenML #3091

htahir1 commented Oct 16, 2024

strickvl Oct 16, 2024

strickvl Oct 16, 2024

strickvl Oct 16, 2024

htahir1 Oct 17, 2024

schustmi Oct 18, 2024

schustmi Oct 16, 2024

schustmi Oct 16, 2024

schustmi Oct 16, 2024

schustmi Oct 18, 2024

schustmi Oct 18, 2024

schustmi Oct 18, 2024

htahir1 Oct 18, 2024

schustmi Oct 18, 2024

schustmi Oct 18, 2024

htahir1 Oct 18, 2024

schustmi Oct 18, 2024

		@@ -0,0 +1,229 @@
		---
		description: Learn about how to manage big data with ZenML.

		@step(output_materializer=CSVDatasetMaterializer)
		def extract_data_local(data_path: str = "data/raw_data.csv") -> CSVDataset:


		return df
		```

[docs] Add custom Dataset classes and Materializers in ZenML #3091

[docs] Add custom Dataset classes and Materializers in ZenML #3091

Conversation

htahir1 commented Oct 16, 2024

Describe changes

Pre-requisites

Types of changes

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment