Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Insert batches #450

Merged
merged 7 commits into from
Mar 9, 2021
Merged

Insert batches #450

merged 7 commits into from
Mar 9, 2021

Conversation

c0c0n3
Copy link
Member

@c0c0n3 c0c0n3 commented Feb 10, 2021

Proposed changes

This PR enables the splitting of the SQL rows to insert into batches. If the size of the list of rows the Translator has to insert is greater than a configured value M, the rows get split into smaller batches (lists), each having a size no greater than M, and each batch gets inserted separately, i.e. the Translator issues a separate SQL bulk insert for each batch. We do this since some backends (e.g. Crate) limit how much data you can shovel in a single SQL (bulk) insert statement---see #445 about it.
Splitting happens as explained in the notes below, using a cost function to compute how much data each row to insert holds in memory and a maximum batch size M (= cost in bytes) read from the env---see the notes below about configuration.

Types of changes

  • Bugfix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)

Checklist

  • I have read the CONTRIBUTING doc
  • I have signed the CLA
  • I have added tests that prove my fix is effective or that my feature works
  • I have added necessary documentation (if appropriate)
  • Any dependent changes have been merged and published in downstream modules

Further comments

Splitting spec

We split a stream in batches so the cumulative cost of each batch is within a set cost goal. Given an input stream s and a cost function c, we want to produce a sequence of streams b such that joining the b streams yields s and, for each b stream of length > 1, mapping c to each element and summing the costs yields a value ≤ M, where M is a configured cost goal. In symbols:

    (1)    s = b[0] + b[1] + ...
    (2)    b[k] = [x1, x2, ...] ⟹ M ≥ c(x1) + c(x2) + ...

Notice it can happen that to make batches satisfying (1) and (2), some b[k] contains just one element x > M since that doesn't violate (1) and (2).

Implementation

We use Python streams to process data in constant space and linear time. Working with Python streams is anything but easy in my opinion so the implementation looks quite involved but the concept is fairly simple. In fact, for the mathematically inclined soul out there, the Python implementation is doing what this recursively defined function does (using Haskell-y syntax for lists), only in a more obscure way

ϕ []          = []
ϕ [x]         = [[x]]

ϕ [x, y, ...] = [ x:t, u, ...]       if c(x) + Σ c(t[i]) ≤ M
ϕ [x, y, ...] = [ [x], t, u, ...]    otherwise

   where  [t, u, ...] = ϕ [y, ...]

Notice this isn't a solution to #193 but is certainly one piece of the puzzle if we want to piece together a stream-based architecture. Why should we care? Well, even if we split the insert into batches, we still have two huge datasets in memory: the Python representation of the input NGSI JSON doc and its translation to tabular format. Ouch, not exactly a big-data friendly design. In an ideal world, the notify endpoint would work in constant space and linear time...

Configuration

There's a new INSERT_MAX_SIZE env var to turn on the splitting into batches. If set, this variable limits how much data you can shovel in a single SQL bulk insert to a value M---see above for the details of how data gets split into batches of at most size M. We read this variable in on each API call to the notify endpoint so it's sort of dynamic that way and will affect every later insert operation. Accepted values are sizes in bytes (B) or 2^10 multiples (KiB, MiB, GiB), e.g. 10 B, 1.2 KiB, 0.9 GiB. (Technically, anything bitmath can digest will do, e.g. MB, kB, and friends.) If the variable isn't set (or the set value isn't valid), the Translator processes SQL inserts normally without splitting data into batches.

@github-actions
Copy link
Contributor

github-actions bot commented Feb 10, 2021

CLA Assistant Lite bot All contributors have signed the CLA ✍️

@amotl amotl mentioned this pull request Feb 10, 2021
@c0c0n3 c0c0n3 requested a review from chicco785 March 9, 2021 17:25
@c0c0n3
Copy link
Member Author

c0c0n3 commented Mar 9, 2021

@chicco785 need your approval to merge :-)

Copy link
Contributor

@chicco785 chicco785 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@chicco785 chicco785 merged commit 41bb25f into master Mar 9, 2021
@github-actions github-actions bot locked and limited conversation to collaborators Mar 9, 2021
@c0c0n3 c0c0n3 deleted the batch-inserts branch March 9, 2021 18:03
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants