Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

create entries batch multiple entries into single insert #550

Merged
merged 4 commits into from
Sep 2, 2020

Conversation

nopcoder
Copy link
Contributor

@nopcoder nopcoder commented Sep 1, 2020

  • merge all cataloger configuration parameters into params.Catalog

Added

"cataloger.batch_write.insert_size"
"cataloger.cache.enabled"
"cataloger.cache.size"
"cataloger.cache.expiry"
"cataloger.cache.jitter"

Closes #547

- merge all cataloger configuration parameters into params.Catalog

Updated

"cataloger.batch_read.read_entry_max_wait" changed to "cataloger.batch_read_entry_max_wait"
"cataloger.batch_read.scan_timeout" changed to "cataloger.batch_read_scan_timeout"
"cataloger.batch_read.batch_delay" changed to "cataloger.batch_read_delay"
"cataloger.batch_read.entries_read_at_once" changed to "cataloger.batch_read_entries_at_once"
"cataloger.batch_read.readers" changed to "cataloger.batch_read_readers"

Added

"cataloger.create_entries_insert_size"
"cataloger.cache.enabled"
"cataloger.cache.size"
"cataloger.cache.expiry"
"cataloger.cache.jitter"
@nopcoder nopcoder requested a review from johnnyaug September 1, 2020 14:49
Copy link
Contributor

@arielshaqed arielshaqed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, nice and clear!

Not sure I'm a fan of squashing the params out of cataloger.batchParams, because I like parameter objects. We could even have a single config call that passed the BatchParams for a configuration, in line with what we did elsewhere.

Blocker is behaviour on creating the same object multiple times. Single clock is important but I won't block on it.

for _, entry := range entries[i:j] {
creationDate := entry.CreationDate
if entry.CreationDate.IsZero() {
creationDate = time.Now()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a different clock than on the database. I think we should use "now" on the database, like we now do on CreateEntry.

if _, err := insertEntry(tx, branchID, &entries[i]); err != nil {
return nil, fmt.Errorf("entry at %d: %w", i, err)
// single insert per batch
for i := 0; i < len(entries); i += c.CreateEntriesInsertSize {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When there are duplicate create-entries, this behaves unpredictably:

  • If a dupe appears inside a batch, we receive an error from INSERT (and I assume a strange error message).
  • If a dupe appears between batches, the latest version wins.

I don't particularly care whether it should succeed or fail, but if I had to rank them I'd go with:

  1. The behaviour on AWS S3.
  2. Latest version wins, always.
  3. A decipherable error, always, and no file is created.
  4. A decipherable error, always, and some of the files may be created.

In order to do any of these we will have to examine all entries in the batch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going with "Latest version wins"

Copy link
Contributor

@arielshaqed arielshaqed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I stated changes but forgot to Request Changes. A wise man once said (almost):

You see, you know how to state the changes, you just don't know how to request the changes. And that's really the most important part of the request: the blocking. Anybody can just state the changes.

@nopcoder
Copy link
Contributor Author

nopcoder commented Sep 2, 2020

catalog params split into parts in code and configuration

@nopcoder nopcoder requested a review from arielshaqed September 2, 2020 12:22
Copy link
Contributor

@arielshaqed arielshaqed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Niiiice! Thanks.

ent := entriesMap[entries[i].Path]
if &entries[i] == ent {
entriesToInsert = append(entriesToInsert, ent)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC the goal here is to keep the insert the last object for each key in the order of the first appearance of each key. This is a bit odd. To keep the files in their expected ordering, I would write something like this (untested, of course):

for i := len(entries)-1, j := len(entries); i >= 0; i-- {
	if &entries[i] == entriesMap[entries[i].Path] {
		j--
		entries[j] := entries[i]
	}
}
entries = entries[j:]

It is a bit more subtle, but it gives a better ordering I think. If you don't care about the ordering, just iterate over the map.

If you want the last-last ordering above but without editing the slice in-place:

entriesToInsert := make([]*Entry, len(entriesMap))
for i := len(entries)-1, j := len(entries) - 1; i >= 0; i-- {
	if &entries[i] == entriesMap[entries[i].Path] {
		entriesToInsert[j] := entries[i]
		j--
	}
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

F2Fing, the existing code does exactly what I want. Sorry.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

like the first loop - I used slice of pointers and keep just the right references in the new slice that we process.

@nopcoder nopcoder merged commit 607e9e2 into master Sep 2, 2020
@nopcoder nopcoder deleted the chore/create-entries-batch branch September 2, 2020 13:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CreateEntries support batch insert
2 participants