Yielding Runner #41

thbar · 2017-04-22T17:16:35Z

In this PR, a work-in-progress "alternate runner", which is actually a candidate runner for Kiba 2.

Then new candidate runner brings 2 massive benefits.

Ability to yield multiple rows from a given class transform

The new runner allows "class transforms" (transforms written as classes, rather than blocks) to yield an arbitrary number of rows.

It allows the source presented in this blog post to be rewritten as a transform like this:

class TagSplitter
  def process(row)
    row.fetch(:buyers).split(':').each do |buyer|
      yield(row.deep_copy.merge(buyer: buyer))
    end
  end
end

More importantly, it allows to stack up such transforms, each yielding sub-rows.

Increased ability to write reusable Kiba components

Let's pick an example 😄

Imagine you have created a Kiba source able to extract XML elements from a group of files on disk (with each file containing N elements).

It would typically look like this:

class MySource
  attr_reader :dir_pattern

  def initialize(dir_pattern:)
    @dir_pattern = dir_pattern
  end

  def each
    Dir[dir_pattern].sort.each do |file|
      doc = Nokogiri::XML(IO.binread(file))
      doc.search('/invoices/invoice').each do |item|
        yield(item)
      end
    end
  end
end

Such a class has 4 responsabilities:

The directory iteration
The XML parsing
The XML node research
The act of exploding each detected sub-node as an independent row

With Kiba 1 you can achieve some level of splitting here by using the decorator technique I outlined here, but this can only take you so far.

With the new Kiba runner, you can first rewrite the code above as 4 independent components (1 source & 3 transforms):

class DirectoryLister
  def initialize(dir_pattern:)

  def each
    Dir[dir_pattern].sort.each do |filename|
      yield(filename)
    end
  end
end

class XMLReader
  def process(filename)
    Nokogiri::XML(IO.binread(filename))
  end
end

class XMLSearcher
  def initialize(selector:)

  def process(doc)
    doc.search(selector)
  end
end

class EnumerableExploder
  def process(row)
    row.each do |item|
      yield(item)
    end
    nil # tell the pipeline to ignore the final value
  end
end

Which can then be used:

source DirectoryLister, dir_pattern: '*.csv'
transform XMLReader
transform XMLSearcher, selector: '/invoices/invoice'
transform EnumerableExploder

While it can appear to be more complicated at first, each of these 4 components can now be mix-and-matched with other components in completely unrelated scenarios.

For instance:

The DirectoryLister could be used to list anything (JSON files etc).
The EnumerableExploder, similarly, could be used for pretty much anything.

This opens the door to provide more composable & more reusable components to Kiba users, or as part of Kiba Common or Kiba Pro.

Notes on current implementation

The new implementation relies on nested Enumerator::Lazy instances.

I must still benchmark this behaviour in terms of performance compared to Kiba 1, and also improve the code a bit more, before being able to decide if this will remain in Kiba 2 or not.

This is a candidate implementation that allows class transforms to yield rows.

Kiba 2 requires Ruby 2.0 + for Enumerator::Lazy

travis-ci/travis-ci#5861 (comment) 5

thbar · 2017-04-27T15:06:55Z

In 82bc8e5 I add a simple benchmark to compare runners.

This is a very simplistic benchmark which does not reflect a real-life use, where I wanted to get a feel of how costly using Enumerator could be.

ruby 2.3.2p217
===========
TestRunner#test_benchmark : 2.12 sec
TestAlternateRunner#test_benchmark : 3.21 sec

ruby 2.4.0p0
=========
TestRunner#test_benchmark : 1.89 sec
TestAlternateRunner#test_benchmark : 2.76 sec

jruby 1.9.7.0
=========

TestRunner#test_benchmark : 0.7 sec
TestAlternateRunner#test_benchmark : 2.33 sec

I will carry out more tests and will try to improve the alternate runner.

thbar · 2017-06-20T08:07:13Z

I've been using this on multiple internal reporting systems and things work just great. Will come back to refactor and clean-up.

Ideally if I can figure out a way to selectively enable the (more costly) lazy enumerator for specific transforms, it would be great (instead of fully switching runner class).

Extract runner tests to allow upcoming reuse in #41

This cherry-picks from #41 but improves the syntax a bit.

This cherry-picks from #41 but with a more DRY code.

A much better choice than what was originally implemented in #41, since: - it allows to decide which runner to pick on a per-ETL basis - it will work inside sidekiq (vs only on command line)

thbar · 2017-12-22T23:37:39Z

Closing in favour of #44 (which is still a WIP but improved already).

thbar · 2017-12-25T00:36:20Z

More polished version is now available on master after merging #44. Will land in Kiba 2 (soon released).

thbar added 17 commits April 14, 2017 23:07

Add note on how to compare with public repo

da18e98

Add alternate runner allowing yielding transforms

ad09fa0

Add a way to use the alternate runner

8a1e70d

Add helper for shared tests

1caa880

Extract shared tests

95bd93a

Make tests reusable

388de20

Apply tests for Kiba::AlternateRunner

a2b2482

Add tests for yielding transform

78e0362

Merge branch 'master' into wip-kiba-2

f2bb6c4

Extract runner test

2e818e2

Backport Kiba::AlternateRunner from private repo

82d0c78

This is a candidate implementation that allows class transforms to yield rows.

Refactor CLI to allow selection of alternate runner

d2f521c

Force JRuby to use 1.9 syntax

447ae5c

Fix JRUBY_OPTS

d9575a4

Kiba 2 requires Ruby 2.0 + for Enumerator::Lazy

Try to fix the build for JRuby

f3b4ace

travis-ci/travis-ci#5861 (comment) 5

Start work on Kiba 1 vs Kiba 2 benchmark

67aeb65

Add first benchmark

82bc8e5

thbar mentioned this pull request Dec 20, 2017

Add CSV destination (requires Ruby 2.3+) thbar/kiba-common#3

Merged

thbar added enhancement wip labels Dec 22, 2017

thbar changed the title ~~[WIP] Kiba 2 runner~~ Yielding Runner Dec 22, 2017

thbar self-assigned this Dec 22, 2017

thbar mentioned this pull request Dec 22, 2017

Extract runner tests to allow upcoming reuse #42

Merged

thbar added a commit that referenced this pull request Dec 22, 2017

Merge pull request #42 from thbar/extract-runner-tests

68302be

Extract runner tests to allow upcoming reuse in #41

thbar added a commit that referenced this pull request Dec 22, 2017

Refactor to support upcoming yielding runner tests (#41)

b59d3c0

Merge branch 'master' into wip-kiba-2

75b8339

thbar added a commit that referenced this pull request Dec 22, 2017

Implement YieldingTransform tests

ca31e00

This cherry-picks from #41 but improves the syntax a bit.

thbar added a commit that referenced this pull request Dec 22, 2017

Implement YieldingRunner

d30cc0b

This cherry-picks from #41 but with a more DRY code.

thbar mentioned this pull request Dec 22, 2017

Introduce new StreamingRunner #44

Merged

thbar closed this Dec 22, 2017

thbar deleted the wip-kiba-2 branch December 25, 2017 00:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Yielding Runner #41

Yielding Runner #41

thbar commented Apr 22, 2017 •

edited

Loading

thbar commented Apr 27, 2017

thbar commented Jun 20, 2017

thbar commented Dec 22, 2017

thbar commented Dec 25, 2017

Yielding Runner #41

Yielding Runner #41

Conversation

thbar commented Apr 22, 2017 • edited Loading

Ability to yield multiple rows from a given class transform

Increased ability to write reusable Kiba components

Notes on current implementation

thbar commented Apr 27, 2017

thbar commented Jun 20, 2017

thbar commented Dec 22, 2017

thbar commented Dec 25, 2017

thbar commented Apr 22, 2017 •

edited

Loading