Introduce new StreamingRunner #44

thbar · 2017-12-22T23:36:08Z

This PR introduces a new "Runner" implementation named StreamingRunner. The "Runner" is the part of Kiba which is responsible for carrying out the actual data processing.

Previously, a given "class transform" receiving one input row was only able to generate one output row, or no output row.

With this new runner, leveraging Ruby's Enumerator, each "class transform" can generate N output rows, by calling yield row, in addition to the row returned by process as usual.

While this can appear like something simple, it has a massive consequence, which is the ability to separate concerns when writing reusable Kiba components. This leads the way to writing more focused, generic & composable transforms when working with Kiba.

Let's pick an example.

Imagine you have created a Kiba source able to extract XML elements from a group of files on disk (with each file containing N elements).

It would typically look like this:

class MySource
  attr_reader :dir_pattern

  def initialize(dir_pattern:)
    @dir_pattern = dir_pattern
  end

  def each
    Dir[dir_pattern].sort.each do |file|
      doc = Nokogiri::XML(IO.binread(file))
      doc.search('/invoices/invoice').each do |item|
        yield(item)
      end
    end
  end
end

Such a class has 4 responsabilities:

The directory iteration
The XML parsing
The XML node research
The act of exploding each detected sub-node as an independent row

With Kiba 1 you can achieve some level of splitting here by using the decorator technique I outlined here, but this can only take you so far.

With the new Kiba runner, you can first rewrite the code above as 4 independent components (1 source & 3 transforms):

class DirectoryLister
  def initialize(dir_pattern:)

  def each
    Dir[dir_pattern].sort.each do |filename|
      yield(filename)
    end
  end
end

class XMLReader
  def process(filename)
    Nokogiri::XML(IO.binread(filename))
  end
end

class XMLSearcher
  def initialize(selector:)

  def process(doc)
    doc.search(selector)
  end
end

class EnumerableExploder
  def process(row)
    row.each do |item|
      yield(item)
    end
    nil # tell the pipeline to ignore the final value
  end
end

Which can then be used:

source DirectoryLister, dir_pattern: '*.csv'
transform XMLReader
transform XMLSearcher, selector: '/invoices/invoice'
transform EnumerableExploder

While it can appear to be more complicated at first, each of these 4 components can now be mix-and-matched with other components in completely unrelated scenarios.

For instance:

The DirectoryLister could be used to list anything (JSON files etc).
The EnumerableExploder, similarly, could be used for pretty much anything.

This opens the door to provide more composable & more reusable components to Kiba users, or as part of Kiba Common or Kiba Pro.

How to enable the `StreamingRunner`

Using the new runner is opt-in. You'll have to do:

extend Kiba::DSLExtensions::Config

config :kiba, runner: Kiba::StreamingRunner

From there, you can write "yielding class transforms", like:

class MyTransform
  def process(row)
    yield {key: 1}
    yield {key: 2}
    {key: 3}
  end
end

Limitations

You can only yield rows from "class transforms". Calling yield from a "block transform" will generate an error.

Benchmark

Real-life benchmarks I made showed that the impact (on real-life examples) is negligible at this point.

A non-real-life benchmark used in 82bc8e5 shows that the new runner takes 5% to 30% more time. It's a non-real-life example because there are no real sources nor real destinations.

Notes on current implementation

This was initiated in a private repo, later published as #41, which has been reworked & finalized here.

This cherry-picks from #41 but improves the syntax a bit.

This cherry-picks from #41 but with a more DRY code.

A much better choice than what was originally implemented in #41, since: - it allows to decide which runner to pick on a per-ETL basis - it will work inside sidekiq (vs only on command line)

This will be useful to dry #44.

This is a UX measure to make sure nobody will unknowingly shadow an existing "config" variable with the new system. Basically this avoids a breaking change in the behaviour.

I've realized that lazy is not useful yet at this stage, and also brings an extra cost. I will switch back to lazy when I'll have an actual need (future work on parallelization etc).

I think the previous commit should make the runner work on all supported platforms, unmodified.

vfonic · 2018-11-05T14:44:28Z

This is amazing feature! Thank you so much for taking the time to add it! ❤️

thbar · 2018-11-05T16:32:16Z

@vfonic thanks for your feedback! Much appreciated ^_^ 💙 💚

thbar added 8 commits December 22, 2017 23:23

Add test ArrayDestination

2d09468

Implement YieldingTransform tests

ca31e00

This cherry-picks from #41 but improves the syntax a bit.

Implement YieldingRunner

d30cc0b

This cherry-picks from #41 but with a more DRY code.

Attempt to fix JRuby 1.7 build

e389a32

Implement a mix-inspired config mechanism

55694c6

Implement config-based runner choice

c79c326

A much better choice than what was originally implemented in #41, since: - it allows to decide which runner to pick on a per-ETL basis - it will work inside sidekiq (vs only on command line)

Use config-based engine choice

aff7a29

Clean up tests

9b3d107

thbar added enhancement wip labels Dec 22, 2017

thbar self-assigned this Dec 22, 2017

thbar mentioned this pull request Dec 22, 2017

Yielding Runner #41

Closed

thbar added 2 commits December 23, 2017 23:36

Merge branch 'master' into yielding-runner

04126fd

Merge branch 'master' into yielding-runner

749bb94

thbar added a commit that referenced this pull request Dec 23, 2017

Extract the logic closing the destinations

b370767

This will be useful to dry #44.

thbar added 11 commits December 23, 2017 23:55

Merge branch 'master' into yielding-runner

d27fa6a

Remove close (already called by regular runner)

9f5e126

Extract method

2694981

Add minitest-focus to development dependencies

1b209c7

Rename for clarity

97418ac

Group code on one line

4e31d88

Rewrite in a slightly more functional fashion

48b8cbd

Rename for clarity

66a4371

Remove duplicate lazy invokation

4939cab

Rename YieldingRunner to StreamingRunner

1190565

Make config access opt-in

d230720

This is a UX measure to make sure nobody will unknowingly shadow an existing "config" variable with the new system. Basically this avoids a breaking change in the behaviour.

thbar changed the title ~~Yielding Runner (improved version with config-based opt-in)~~ StreamingRunner (improved version with config-based opt-in) Dec 24, 2017

thbar added 3 commits December 24, 2017 11:06

Use merge! for more concise code

9435737

Use Enumerator instead of Enumerator::Lazy

8d6761c

I've realized that lazy is not useful yet at this stage, and also brings an extra cost. I will switch back to lazy when I'll have an actual need (future work on parallelization etc).

Remove JRUBY_OPTS

96839b4

I think the previous commit should make the runner work on all supported platforms, unmodified.

thbar added 2 commits December 25, 2017 00:42

Merge branch 'master' into yielding-runner

797a7a9

Remove whitespace

627dbf2

thbar changed the title ~~StreamingRunner (improved version with config-based opt-in)~~ Introduce new StreamingRunner Dec 25, 2017

thbar removed the wip label Dec 25, 2017

thbar merged commit 8e1b8ab into master Dec 25, 2017

thbar deleted the yielding-runner branch December 25, 2017 00:26

thbar mentioned this pull request May 16, 2019

Make StreamingRunner the default #75

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce new StreamingRunner #44

Introduce new StreamingRunner #44

thbar commented Dec 22, 2017 •

edited

Loading

vfonic commented Nov 5, 2018

thbar commented Nov 5, 2018

Introduce new StreamingRunner #44

Introduce new StreamingRunner #44

Conversation

thbar commented Dec 22, 2017 • edited Loading

How to enable the StreamingRunner

Limitations

Benchmark

Notes on current implementation

vfonic commented Nov 5, 2018

thbar commented Nov 5, 2018

thbar commented Dec 22, 2017 •

edited

Loading

How to enable the `StreamingRunner`