DRAFT PROPOSAL: Add Pipelines Framework to Parsons #980
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Overview
This PR implements a draft of the pipelines API that we discussed at the January contributors meeting. The goal of the pipelines API is to add an extra level of abstraction on top of the existing Parsons' connectors and Table to make it easier to write and manage ETL pipelines.
Justification
The contributors recently had a discussion about what we can do to take Parsons to the next level to make it even easier to build ETL scripts. Another big goal that would benefit TMC and the whole space significantly is to make it easier for new data staff who don't have a handle on control flow and data structures to assemble ETL scripts.
In my opinion, Parsons connectors already make it very easy to extract and load data. The two things that it does not do much to help the user with are:
During the call we discussed adding a Pipelines system to Parsons. This system would exist on top of the connectors and Parsons Table, and possibly would hide those details from a new user completely. The idea is this:
Update - Revision 2
Based on contributor feedback we identified two distinct user groups with different sets of issues on the original proposal.
New Engineers/Analysts
There was concern that the syntax was still too complex for analysts/engineers (new users) and would prove too big a barrier for them to use the framework successfully. To accommodate those concerns, the following changes were made in revision 2:
Removal of
lambda
syntax.The original syntax used lambda functions, a difficult concept, heavily.
CompoundPipe
constructor was introduced to replace the use oflambda
to build up a more complex pipe from more basic pipes:Simplification of Pipeline syntax
The previous pipe-chain syntax in the Pipeline constructor was a repeated function call, which would have been confusing to newer users:
This has been replaced with a much simpler syntax where the pipes are just listed in the
Pipeline
constructor:Power Users
The main concern for experienced engineers/analysts with a high degree of Python skill (power users) was that the framework didn't offer them enough value for the cost of construction. In particular, data orchestration was wanting here.
Prefect integration
To meet that need, this revision of Pipelines has incorporated Prefect under the hood with no additional complexity to anyone using the pipeline framework, be they writing pipes or just assembling pipelines.
Each pipe is defined exactly the same as before, but the framework constructs a Prefect task for that pipe behind the scenes.
Each Pipeline is transformed into a Prefect Flow, and is logged in the Prefect Cloud when run:
In addition to providing modern data orchestration out of the box, the Prefect integration will make it possible for us to integrate the extensive Prefect Integrations library with Pipelines.
Revision 2 Demo
This code sets up and runs the same series of pipelines as before:
Here are the runs on my Prefect cloud
The declarative Pipeline syntax is easy to comprehend next to the Prefect viewer makes it easy to see the data flow
Prefect error handling shows you which pipe failed, what the error was, and even what time it failed.
Original Draft - Revision 1
Prototype
This prototype is in the
/pipelines
folder in the PR branch. All of the code is contained in theparsons_pipelines.py
file, which is a runnable file that executes three pipelines based on a stored CSV file and the open-source Lord of the Rings API.Here are some highlights:
Declarative pipelines syntax, made by composing pipes.
Pipe definitions are normal Parsons' code
Load data in via pipes, either using Parsons or
requests
Trivially compose & name commonly-combined pipes
Group pipelines together in a Dashboard to facilitate logging, reporting, etc
Call Dashboard with a logger to get free logging of the output of every step in every pipeline
Generate an HTML report with your pipelines' results
Individual pipelines can be run to retrieve their data and captured as a Parsons table
Next Steps
If the Parsons contributors decide to move forward with the proposal, I believe these features are necessary to implement an MVP of the pipelines framework:
These features would be nice to implement, but I don't believe they need to be completed to justify releasing the framework: