-
-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce new StreamingRunner #44
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This cherry-picks from #41 but improves the syntax a bit.
This cherry-picks from #41 but with a more DRY code.
A much better choice than what was originally implemented in #41, since: - it allows to decide which runner to pick on a per-ETL basis - it will work inside sidekiq (vs only on command line)
Closed
This is a UX measure to make sure nobody will unknowingly shadow an existing "config" variable with the new system. Basically this avoids a breaking change in the behaviour.
thbar
changed the title
Yielding Runner (improved version with config-based opt-in)
StreamingRunner (improved version with config-based opt-in)
Dec 24, 2017
I've realized that lazy is not useful yet at this stage, and also brings an extra cost. I will switch back to lazy when I'll have an actual need (future work on parallelization etc).
I think the previous commit should make the runner work on all supported platforms, unmodified.
thbar
changed the title
StreamingRunner (improved version with config-based opt-in)
Introduce new StreamingRunner
Dec 25, 2017
This is amazing feature! Thank you so much for taking the time to add it! ❤️ |
@vfonic thanks for your feedback! Much appreciated ^_^ 💙 💚 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR introduces a new "Runner" implementation named
StreamingRunner
. The "Runner" is the part of Kiba which is responsible for carrying out the actual data processing.Previously, a given "class transform" receiving one input row was only able to generate one output row, or no output row.
With this new runner, leveraging Ruby's
Enumerator
, each "class transform" can generate N output rows, by callingyield row
, in addition to the row returned byprocess
as usual.While this can appear like something simple, it has a massive consequence, which is the ability to separate concerns when writing reusable Kiba components. This leads the way to writing more focused, generic & composable transforms when working with Kiba.
Let's pick an example.
Imagine you have created a Kiba source able to extract XML elements from a group of files on disk (with each file containing N elements).
It would typically look like this:
Such a class has 4 responsabilities:
With Kiba 1 you can achieve some level of splitting here by using the decorator technique I outlined here, but this can only take you so far.
With the new Kiba runner, you can first rewrite the code above as 4 independent components (1 source & 3 transforms):
Which can then be used:
While it can appear to be more complicated at first, each of these 4 components can now be mix-and-matched with other components in completely unrelated scenarios.
For instance:
DirectoryLister
could be used to list anything (JSON files etc).EnumerableExploder
, similarly, could be used for pretty much anything.This opens the door to provide more composable & more reusable components to Kiba users, or as part of Kiba Common or Kiba Pro.
How to enable the
StreamingRunner
Using the new runner is opt-in. You'll have to do:
From there, you can write "yielding class transforms", like:
Limitations
You can only
yield
rows from "class transforms". Callingyield
from a "block transform" will generate an error.Benchmark
Real-life benchmarks I made showed that the impact (on real-life examples) is negligible at this point.
A non-real-life benchmark used in 82bc8e5 shows that the new runner takes 5% to 30% more time. It's a non-real-life example because there are no real sources nor real destinations.
Notes on current implementation
This was initiated in a private repo, later published as #41, which has been reworked & finalized here.