-
Notifications
You must be signed in to change notification settings - Fork 365
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
create entries batch multiple entries into single insert #550
Conversation
- merge all cataloger configuration parameters into params.Catalog Updated "cataloger.batch_read.read_entry_max_wait" changed to "cataloger.batch_read_entry_max_wait" "cataloger.batch_read.scan_timeout" changed to "cataloger.batch_read_scan_timeout" "cataloger.batch_read.batch_delay" changed to "cataloger.batch_read_delay" "cataloger.batch_read.entries_read_at_once" changed to "cataloger.batch_read_entries_at_once" "cataloger.batch_read.readers" changed to "cataloger.batch_read_readers" Added "cataloger.create_entries_insert_size" "cataloger.cache.enabled" "cataloger.cache.size" "cataloger.cache.expiry" "cataloger.cache.jitter"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, nice and clear!
Not sure I'm a fan of squashing the params out of cataloger.batchParams
, because I like parameter objects. We could even have a single config
call that passed the BatchParams
for a configuration, in line with what we did elsewhere.
Blocker is behaviour on creating the same object multiple times. Single clock is important but I won't block on it.
catalog/cataloger_create_entries.go
Outdated
for _, entry := range entries[i:j] { | ||
creationDate := entry.CreationDate | ||
if entry.CreationDate.IsZero() { | ||
creationDate = time.Now() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a different clock than on the database. I think we should use "now" on the database, like we now do on CreateEntry.
catalog/cataloger_create_entries.go
Outdated
if _, err := insertEntry(tx, branchID, &entries[i]); err != nil { | ||
return nil, fmt.Errorf("entry at %d: %w", i, err) | ||
// single insert per batch | ||
for i := 0; i < len(entries); i += c.CreateEntriesInsertSize { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When there are duplicate create-entries, this behaves unpredictably:
- If a dupe appears inside a batch, we receive an error from INSERT (and I assume a strange error message).
- If a dupe appears between batches, the latest version wins.
I don't particularly care whether it should succeed or fail, but if I had to rank them I'd go with:
- The behaviour on AWS S3.
- Latest version wins, always.
- A decipherable error, always, and no file is created.
- A decipherable error, always, and some of the files may be created.
In order to do any of these we will have to examine all entries in the batch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Going with "Latest version wins"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I stated changes but forgot to Request Changes. A wise man once said (almost):
You see, you know how to state the changes, you just don't know how to request the changes. And that's really the most important part of the request: the blocking. Anybody can just state the changes.
catalog params split into parts in code and configuration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Niiiice! Thanks.
ent := entriesMap[entries[i].Path] | ||
if &entries[i] == ent { | ||
entriesToInsert = append(entriesToInsert, ent) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC the goal here is to keep the insert the last object for each key in the order of the first appearance of each key. This is a bit odd. To keep the files in their expected ordering, I would write something like this (untested, of course):
for i := len(entries)-1, j := len(entries); i >= 0; i-- {
if &entries[i] == entriesMap[entries[i].Path] {
j--
entries[j] := entries[i]
}
}
entries = entries[j:]
It is a bit more subtle, but it gives a better ordering I think. If you don't care about the ordering, just iterate over the map.
If you want the last-last ordering above but without editing the slice in-place:
entriesToInsert := make([]*Entry, len(entriesMap))
for i := len(entries)-1, j := len(entries) - 1; i >= 0; i-- {
if &entries[i] == entriesMap[entries[i].Path] {
entriesToInsert[j] := entries[i]
j--
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
F2Fing, the existing code does exactly what I want. Sorry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
like the first loop - I used slice of pointers and keep just the right references in the new slice that we process.
Added
"cataloger.batch_write.insert_size"
"cataloger.cache.enabled"
"cataloger.cache.size"
"cataloger.cache.expiry"
"cataloger.cache.jitter"
Closes #547