Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify backfill process #21

Merged
merged 4 commits into from
Jun 28, 2022
Merged

Simplify backfill process #21

merged 4 commits into from
Jun 28, 2022

Conversation

duncanjbrown
Copy link
Contributor

@duncanjbrown duncanjbrown commented Jun 24, 2022

Depends on #20.

The old way 🪨

When bulk-importing existing records into BQ, we used to:

  • get $batch_size records from the table using find_in_batches
  • loop over them to build a heap of events in memory
  • enqueue one SendEvents job with all the events in
  • before the next batch, sleep for $sleep_time so we don't enqueue too many (??)

This wasn't great because we ran this via rake over ssh, and often the loop would outlive the ssh session timeout. Or the loop would crash for some reason with half the set enqueued. This meant we acquired some other "features", like being able to restart the process from a given ID (which would be helpfully logged when the process crashed).

The new way ✨

Break up the finding of the records with the building of the events

  • get $batch_size records from the table using find_in_batches
  • per batch, pluck all the IDs for the batch and give them as an array arg to a single LoadEntityBatch job

And in LoadEntityBatch:

  • load the records for the IDs
  • reduce them to a single payload with $batch_size events in
  • send that payload

Also, cache the analytics data from the YAML so we don't read and parse it once per record

Advantages 📈

  • atomic: either the batch succeeds or it doesn't, so we eliminate dupes
  • much simpler outer loop so no need for all that recovery code
  • sidekiq does all our retrying for us

Next 🔮

The outer loop should now be fast enough to run in-process from a web request, so we could try to build a little UI on top of it!

.tool-versions Outdated Show resolved Hide resolved
Clearer and less verbose
Instead of chewing through every batch in the course of running
LoadEntities, put the ids for each batch into an async job and let that
job put together a fat payload of all the import events for all those
ids - less network, and atomic retries (because it's all or nothing).

Remove 'sleep' option because it's unnecessary - risk of overwhelming
the queue is much smaller.
- custom queue
- retry behaviour
config_for doesn't cache (!)
@duncanjbrown duncanjbrown merged commit 41c55ab into main Jun 28, 2022
@duncanjbrown duncanjbrown deleted the load-entities branch June 28, 2022 14:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants