Aidbox bulk API #434
Nesmeshnoy
started this conversation in
General
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Bulk load is a challenge - there are many requirements, limitations and tradeoffs: performance, validation, transactional consistency. This is a proposal for the new Bulk API to give the user explicit options.
Validation Problem
There are two problems with validation during bulk upload:
The first is a performance, especially when you want to validate references and terminology, which are transformed into database queries
The second is to decide when upload operation should fail - on the first error, on nth errors, or try to inspect all errors for your dataset
Consistency Problem
You may want to rollback the whole upload on any errors. The bulk upload will take some time - during this time the writes & updates may happen, which will be overridden by the bulk dataset, i.e. lost.
Protocol & Performance Problems
We do not want to eat the whole memory on the server during the upload. This requires some kind of stream processing implementation. If we want to load a huge amount of data every operation (even just parsing JSON) may be a performance problem. The current state of HTTP does not support uploading huge files in a stream, most of the implementations (like AWS S3) split files into chunks and assemble the resulting file on the server. We solve this by inverting upload into streaming-friendly download.
Errors introspection problem
It may be challenging to investigate errors in bulk. Ideally, users want to see as many errors as possible to fix them in one iteration. There are maybe a lot of errors, so some grouping and introspection tools may simplify debug process.
Parts of solution
Basic steps of bulk upload may be:
These atomic steps may be composed into a complex operation like aidbox.bulk.import, which will consist of load, validate, if no errors: do merge, drop stage
The general idea is to explicitly introduce a Staging Table (Staging Resource) and basic operations on it:
Open questions
Beta Was this translation helpful? Give feedback.
All reactions