-
-
Notifications
You must be signed in to change notification settings - Fork 88
SQL Bulk Lookup
A common requirement during ETL processing is to search a table column for a given (unique) value, find the corresponding row, then get the value of a second column in that same row.
This can be very valuable to:
- Lookup dimension keys using names e.g. find the department named
Sales Department
and get the primary key value e.g101
- Replace a given business key by the corresponding surrogate key (see wikipedia page on surrogate keys) in e.g. a datawarehouse.
- Help handle relationships conversion (foreign keys) between tables during migrations between databases (e.g. one could keep a
legacy_id
in a new product table, then an invoice line item referring to thelegacy_id
can easily replace that id with the new target id)
Kiba Pro SQLBulkLookup
provides an easy and efficient way to replace those relations. It handles a large group of rows (batch) at once to avoid N queries, but rather 1 per rows batch.
Currently tested against: PostgreSQL 9.5+, MySQL 5.5+, MRI Ruby 2.4-2.7.
Requirements: make sure to add those to your Gemfile
:
-
sequel
gem -
pg
gem (if you connect to Postgres) -
mysql2
gem (if you connect to MySQL)
Given the following products
table:
id | legacy_product_id |
---|---|
1 | 100 |
2 | 107 |
3 | 126 |
One can setup a transform like this:
require 'kiba-common/sources/enumerable' # for demo
require 'kiba-pro/transforms/sql_bulk_lookup'
job = Kiba.parse do
# fake a source which would refer to some db specific ids, lacking the primary key we need
source Kiba::Common::Sources::Enumerable, -> [{external_id: 100}, {external_id: 107}, {external_id: 126}]
transform Kiba::Pro::Transforms::SQLBulkLookup,
# NOTE: a live Sequel connection must be passed here
database: db,
table: :products,
# number of rows to batch in a single SQL lookup query
buffer_size: 2_500,
row_input: :external_id, # name of the Hash key above
sql_input: :legacy_product_id, # name of the SQL column we'll be searching for
sql_output: :id, # name of the SQL column we want to retrieve
row_output: :product_id # name of the Hash key to add in the row with the retrieved value
# SNIP
end
Such a transform will give us rows with the following content:
{external_id: 100, product_id: 1}
{external_id: 107, product_id: 2}
{external_id: 126, product_id: 3}
By default, to be safe, the transform will raise an error and interrupt the processing if:
- No matching record is found for the lookup criteria.
- More than one matching record is found for the lookup criteria.
- A matching record is found, but the column referred to by
sql_output:
has aNULL
value
This fail-fast behaviour is here by design to ensure that by default, you will not unknowingly create orphan records, which can be harmful in your reports or datawarehouses.
If your scenario can safely accept missing ids, you can turn on the following option:
transform Kiba::Pro::Transforms::SQLBulkLookup,
# SNIP
raise_if_no_match: false
There is no option at the moment to support cases where you would have more than one match. A future "joiner" component may handle that. Please get in touch if you are interested!
Home | Core Concepts | Defining jobs | Running jobs | Writing sources | Writing transforms | Writing destinations | Implementation Guidelines | Kiba Pro
This wiki is tracked by git and publicly editable. You are welcome to fix errors and typos. Any defacing or vandalism of content will result in your changes being reverted and you being blocked.