Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix docs #18

Merged
merged 1 commit into from
Apr 29, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,12 +35,14 @@ Users can incorporate their logic for custom data transformation and then use th
distributed computing framework to scalably apply the transform to their data.

Features of the toolkit:

- Collection of [scalable transformations](transforms) to expedite user onboarding
- [Data processing library](data-processing-lib) designed to facilitate effortless addition and deployment of new scalable transformations
- Operate efficiently and seamlessly from laptop-scale to cluster-scale supporting data processing at any data size
- [Kube Flow Pipelines](https://www.kubeflow.org/docs/components/pipelines/v1/introduction/)-based [workflow automation](kfp) of transforms.

Data modalities supported:

* Code - support for code datasets as downloaded .zip files of github repositories converted to .
[parquet](https://arrow.apache.org/docs/python/parquet.html) files.
* Language - Future releases will provide transforms specific to natural language, and like code transforms will operate on parquet files.
Expand Down
3 changes: 3 additions & 0 deletions data-processing-lib/doc/advanced-transform-tutorial.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ removes duplicate documents across all files. In this tutorial, we will show the
the operation of our _noop_ transform.

The complete task involves the following:

* EdedupTransform - class that implements the specific transformation
* EdedupRuntime - class that implements custom TransformRuntime to create supporting Ray objects and enhance job output
statistics
Expand All @@ -39,6 +40,7 @@ First, let's define the transform class. To do this we extend
the base abstract/interface class
[AbstractTableTransform](../src/data_processing/transform/table_transform.py),
which requires definition of the following:

* an initializer (i.e. `init()`) that accepts a dictionary of configuration
data. For this example, the configuration data will only be defined by
command line arguments (defined below).
Expand Down Expand Up @@ -138,6 +140,7 @@ First, let's define the transform runtime class. To do this we extend
the base abstract/interface class
[DefaultTableTransformRuntime](../src/data_processing/ray/transform_runtime.py),
which requires definition of the following:

* an initializer (i.e. `init()`) that accepts a dictionary of configuration
data. For this example, the configuration data will only be defined by
command line arguments (defined below).
Expand Down
2 changes: 2 additions & 0 deletions data-processing-lib/doc/overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ more complex transformations requiring coordination among transforming nodes.
This might include operations such as de-duplication, merging, and splitting.
The framework uses a plugin-model for the primary functions. The key ones for
developers of data transformation are:

* [Transformation](../src/data_processing/transform/table_transform.py) - a simple, easily-implemented interface defines
the specifics of a given data transformation.
* [Transform Configuration](../src/data_processing/ray/transform_runtime.py) - defines
Expand All @@ -18,6 +19,7 @@ command line arguments specific to transform, and the short name for the transfo
This might include provisioning of shared memory objects or creation of additional actors.

To learn more consider the following:

* [Transform Tutorials](transform-tutorials.md)
* [Testing transformers with S3](using_s3_transformers.md)
* [Architecture Deep Dive](architecture.md)
Expand Down
3 changes: 3 additions & 0 deletions data-processing-lib/doc/simplest-transform-tutorial.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,11 +15,13 @@ in a single run of the transform.
the operation of our _noop_ transform.

We will **not** be showing the following:

* The creation of a custom TransformRuntime that would enable more global
state and/or coordination among the transforms running in other ray actors.
This will be covered in an advanced tutorial.

The complete task involves the following:

* NOOPTransform - class that implements the specific transformation
* NOOPTableTransformConfiguration - class that provides configuration for the
NOOPTransform, specifically the command line arguments used to configure it.
Expand All @@ -37,6 +39,7 @@ First, let's define the transform class. To do this we extend
the base abstract/interface class
[AbstractTableTransform](../src/data_processing/transform/table_transform.py),
which requires definition of the following:

* an initializer (i.e. `init()`) that accepts a dictionary of configuration
data. For this example, the configuration data will only be defined by
command line arguments (defined below).
Expand Down
5 changes: 5 additions & 0 deletions data-processing-lib/doc/transform-tutorials.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ In support of this model the class
[AbstractTableTransform](../src/data_processing/transform/table_transform.py)
is expected to be extended when implementing a transform.
The following methods are defined:

* ```__init__(self, config:dict)``` - an initializer through which the transform can be created
with implementation-specific configuration. For example, the location of a model, maximum number of
rows in a table, column(s) to use, etc.
Expand All @@ -37,6 +38,7 @@ not need this feature, a default implementation is provided to return an empty l

### Running in Ray
When a transform is run using the Ray-based framework a number of other capabilities are involved:

* [Transform Runtime](../src/data_processing/ray/transform_runtime.py) - this provides the ability for the
transform implementor to create additional Ray resources
and include them in the configuration used to create a transform
Expand All @@ -53,6 +55,7 @@ This also provide the ability to supplement the statics collected by
implement `main()` that makes use of a Transform Configuration to start the Ray runtime and execute the transforms.

Roughly speaking the following steps are completed to establish transforms in the RayWorkers

1. Launcher parses the CLI parameters using an ArgumentParser configured with its own CLI parameters
along with those of the Transform Configuration,
2. Launcher passes the Transform Configuration and CLI parameters to the [RayOrchestrator](../src/data_processing/ray/transform_orchestrator.py)
Expand Down Expand Up @@ -171,6 +174,7 @@ With these basic concepts in mind, we start with a simple example and
progress to more complex transforms.
Before getting started you may want to consult the
[transform project root readme](../../transforms/README.md) documentation.

* [Simplest transform](simplest-transform-tutorial.md) -
Here we will take a simple example to show the basics of creating a simple transform
that takes a single input Table, and produces a single Table.
Expand All @@ -180,6 +184,7 @@ resources (models, configuration, etc) for a transform.
* [Porting from GUF 0.1.6](transform-porting.md)

Once a transform has been built, testing can be enabled with the testing framework:

* [Transform Testing](testing-transforms.md) - shows how to test a transform
independent of the Ray framework.
* [End-to-End Testing](testing-e2e-transform.md) - shows how to test the
Expand Down