Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Yielding Runner #41

Closed
wants to merge 18 commits into from
Closed

Yielding Runner #41

wants to merge 18 commits into from

Conversation

thbar
Copy link
Owner

@thbar thbar commented Apr 22, 2017

In this PR, a work-in-progress "alternate runner", which is actually a candidate runner for Kiba 2.

Then new candidate runner brings 2 massive benefits.

Ability to yield multiple rows from a given class transform

The new runner allows "class transforms" (transforms written as classes, rather than blocks) to yield an arbitrary number of rows.

It allows the source presented in this blog post to be rewritten as a transform like this:

class TagSplitter
  def process(row)
    row.fetch(:buyers).split(':').each do |buyer|
      yield(row.deep_copy.merge(buyer: buyer))
    end
  end
end

More importantly, it allows to stack up such transforms, each yielding sub-rows.

Increased ability to write reusable Kiba components

Let's pick an example 😄

Imagine you have created a Kiba source able to extract XML elements from a group of files on disk (with each file containing N elements).

It would typically look like this:

class MySource
  attr_reader :dir_pattern

  def initialize(dir_pattern:)
    @dir_pattern = dir_pattern
  end

  def each
    Dir[dir_pattern].sort.each do |file|
      doc = Nokogiri::XML(IO.binread(file))
      doc.search('/invoices/invoice').each do |item|
        yield(item)
      end
    end
  end
end

Such a class has 4 responsabilities:

  • The directory iteration
  • The XML parsing
  • The XML node research
  • The act of exploding each detected sub-node as an independent row

With Kiba 1 you can achieve some level of splitting here by using the decorator technique I outlined here, but this can only take you so far.

With the new Kiba runner, you can first rewrite the code above as 4 independent components (1 source & 3 transforms):

class DirectoryLister
  def initialize(dir_pattern:)

  def each
    Dir[dir_pattern].sort.each do |filename|
      yield(filename)
    end
  end
end

class XMLReader
  def process(filename)
    Nokogiri::XML(IO.binread(filename))
  end
end

class XMLSearcher
  def initialize(selector:)

  def process(doc)
    doc.search(selector)
  end
end

class EnumerableExploder
  def process(row)
    row.each do |item|
      yield(item)
    end
    nil # tell the pipeline to ignore the final value
  end
end   

Which can then be used:

source DirectoryLister, dir_pattern: '*.csv'
transform XMLReader
transform XMLSearcher, selector: '/invoices/invoice'
transform EnumerableExploder

While it can appear to be more complicated at first, each of these 4 components can now be mix-and-matched with other components in completely unrelated scenarios.

For instance:

  • The DirectoryLister could be used to list anything (JSON files etc).
  • The EnumerableExploder, similarly, could be used for pretty much anything.

This opens the door to provide more composable & more reusable components to Kiba users, or as part of Kiba Common or Kiba Pro.

Notes on current implementation

The new implementation relies on nested Enumerator::Lazy instances.

I must still benchmark this behaviour in terms of performance compared to Kiba 1, and also improve the code a bit more, before being able to decide if this will remain in Kiba 2 or not.

@thbar
Copy link
Owner Author

thbar commented Apr 27, 2017

In 82bc8e5 I add a simple benchmark to compare runners.

This is a very simplistic benchmark which does not reflect a real-life use, where I wanted to get a feel of how costly using Enumerator could be.

ruby 2.3.2p217
===========
TestRunner#test_benchmark : 2.12 sec
TestAlternateRunner#test_benchmark : 3.21 sec

ruby 2.4.0p0
=========
TestRunner#test_benchmark : 1.89 sec
TestAlternateRunner#test_benchmark : 2.76 sec

jruby 1.9.7.0
=========

TestRunner#test_benchmark : 0.7 sec
TestAlternateRunner#test_benchmark : 2.33 sec

I will carry out more tests and will try to improve the alternate runner.

@thbar
Copy link
Owner Author

thbar commented Jun 20, 2017

I've been using this on multiple internal reporting systems and things work just great. Will come back to refactor and clean-up.

Ideally if I can figure out a way to selectively enable the (more costly) lazy enumerator for specific transforms, it would be great (instead of fully switching runner class).

@thbar thbar changed the title [WIP] Kiba 2 runner Yielding Runner Dec 22, 2017
@thbar thbar self-assigned this Dec 22, 2017
thbar added a commit that referenced this pull request Dec 22, 2017
Extract runner tests to allow upcoming reuse in #41
thbar added a commit that referenced this pull request Dec 22, 2017
This cherry-picks from #41 but improves the syntax a bit.
thbar added a commit that referenced this pull request Dec 22, 2017
This cherry-picks from #41 but with a more DRY code.
thbar added a commit that referenced this pull request Dec 22, 2017
A much better choice than what was originally implemented in #41, since:
- it allows to decide which runner to pick on a per-ETL basis
- it will work inside sidekiq (vs only on command line)
@thbar
Copy link
Owner Author

thbar commented Dec 22, 2017

Closing in favour of #44 (which is still a WIP but improved already).

@thbar thbar closed this Dec 22, 2017
@thbar thbar deleted the wip-kiba-2 branch December 25, 2017 00:29
@thbar
Copy link
Owner Author

thbar commented Dec 25, 2017

More polished version is now available on master after merging #44. Will land in Kiba 2 (soon released).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant