Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

migrate event table primary keys from integer to bigint #6032

Merged
merged 2 commits into from
Mar 27, 2020

Conversation

ryanpetrello
Copy link
Contributor

@ryanpetrello ryanpetrello commented Feb 21, 2020

see: #6010

tl;dr

  1. Rename old event table (_old_main_jobevent), copy its schema to new table (main_jobevent) but id --> bigint, Django migration is done
  2. AWX starts, you can’t see any events/stdout because main_jobevent is empty
  3. Dispatcher startup code notices the existence of the old table (that has all the data), shovels it into the new table in chunks until the old table is empty
  4. Old (empty) table is deleted, so the dispatcher startup code no longer attempts data shoveling

@ryanpetrello
Copy link
Contributor Author

ryanpetrello commented Feb 21, 2020

I have not actually tested how long this takes on large datasets (yet) - that's what I'm experimenting with next.

@ryanpetrello ryanpetrello changed the title migrate event table primary keys from integer to bigint WIP: migrate event table primary keys from integer to bigint Feb 21, 2020
@ryanpetrello ryanpetrello changed the title WIP: migrate event table primary keys from integer to bigint RFC: migrate event table primary keys from integer to bigint Feb 21, 2020
chunk = 10000
with connection.cursor() as cursor:
while offset < total_rows:
sql = f'INSERT INTO {tblname} SELECT * FROM _old_{tblname} ORDER BY id DESC OFFSET {offset} LIMIT {chunk};'
Copy link
Contributor Author

@ryanpetrello ryanpetrello Feb 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm doing this in chunked commits so that:

  1. Users with large amounts of data can see progress (i.e., don't have to wait for one giant INSERT to finish)
  2. If we handle chunks that insert into NEW and delete from OLD in discrete commits, and the process is interrupted at some point (i.e., the process restarts), then you can just re-run the task again.

Copy link
Contributor

@Ladas Ladas Mar 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be interesting to see the benchmarks without chunking. I'd guess the chunking slows this down, vs. just letting postgre to chunk it. (not sure how much though) Or to see speed of bigger chunks, e.g. 100k.

Copy link
Contributor Author

@ryanpetrello ryanpetrello Mar 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd show you benchmarks, but I gave up on getting final numbers, because the answer is "mind-numbingly slower" :D. The cost of a transaction per insert is very high.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on some feedback from @chrismeyersfsu, I did make the "chunk size" configurable so users can lower it if they like.

@@ -660,6 +660,25 @@ def update_host_smart_inventory_memberships():
smart_inventory.update_computed_fields()


@task()
def migrate_legacy_event_data(tblname, total_rows):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should probably remove this total_rows parameter and just calculate it by hand here, so that you don't have to know it beforehand to .apply_async() this task manually.

@ghjm
Copy link
Contributor

ghjm commented Feb 21, 2020

The process looks sound to me. I was going to suggest importing newest-first because that's what the customer is most likely to want to look at, but I see you've already done that. Perhaps it would be nice to emit a log event that says "all data converted successfully" that customers can look for to know there weren't any errors (or if in an abundance of caution, they don't want to put the system back into production till the conversion is done).

@ryanpetrello
Copy link
Contributor Author

ryanpetrello commented Feb 21, 2020

@ghjm good idea about the "finished" message; I added a log line for it.

https://github.com/ansible/awx/pull/6032/files#diff-9d4ea1dd908b35fb92eaede4bd10bb46R680

@ryanpetrello
Copy link
Contributor Author

One of my main goals with this is to implement the copying as an idempotent task you can launch. That way if the task is interrupted halfway through migration, you can just kick it off again and it'll pick up where it left off.

@ghjm
Copy link
Contributor

ghjm commented Feb 21, 2020

How do you re-launch it?

@ghjm
Copy link
Contributor

ghjm commented Feb 21, 2020

Maybe there should be some flag somewhere that notes the migration is unfinished, and a scheduled task that re-launches it if you rebooted or whatever while it was incomplete.

@ryanpetrello
Copy link
Contributor Author

ryanpetrello commented Feb 21, 2020

Maybe there should be some flag somewhere that notes the migration is unfinished, and a scheduled task that re-launches it if you rebooted or whatever while it was incomplete.

We could probably handle this with some sort of periodic task that used an advisory lock. Basically, wake up every minute or two, and if the "data is migrated!" flag isn't set, then try to obtain the lock and kick off the task again.

We'd need to be absolutely certain it never runs more than once at the same time, though.

@softwarefactory-project-zuul
Copy link
Contributor

Build failed.

@softwarefactory-project-zuul
Copy link
Contributor

Build failed.

@ryanpetrello
Copy link
Contributor Author

ryanpetrello commented Feb 21, 2020

@ghjm I've uploaded some changes that I think should make this work nicely. I'll spin up a 100M+ event table next week and restart services in the middle, and make sure event counts match before and after.

  1. When the dispatcher starts (or restarts) on any node (which happens post-upgrade), we look to see if any of the _old_<table> tables still exist, and if they do, we kick off the background migration task. If none of the tables exist, it's basically just a no-op that we pay on dispatcher restart.

  2. The background migration task is wrapped in a postgres advisory lock, so if several nodes in a cluster all restart the dispatcher at the same time, and all enqueue the same task, only one will win.

@softwarefactory-project-zuul
Copy link
Contributor

Build succeeded.

while total_rows:
with transaction.atomic():
cursor.execute(
f'INSERT INTO {tblname} SELECT * FROM _old_{tblname} ORDER BY id DESC LIMIT {chunk} RETURNING id;'
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This RETURNING statement yields the primary key of the last insert.

cursor.execute(
f'INSERT INTO {tblname} SELECT * FROM _old_{tblname} ORDER BY id DESC LIMIT {chunk} RETURNING id;'
)
last_insert_pk = cursor.fetchone()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is no last insert pk, that means we didn't insert anything, which means the old table is empty (and we're done migrating).

with advisory_lock(f'bigint_migration_{tblname}', wait=False) as acquired:
if acquired is False:
return
chunk = 1000000
Copy link
Member

@chrismeyersfsu chrismeyersfsu Mar 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make this a setting so support can edit it if a customer migration fails. I'm thinking of the case where Postgres gets overloaded in some way by the massive data-move size because the users job events are large.

@softwarefactory-project-zuul
Copy link
Contributor

Build succeeded.

@softwarefactory-project-zuul
Copy link
Contributor

Build succeeded (gate pipeline).

@softwarefactory-project-zuul softwarefactory-project-zuul bot merged commit 155a1d9 into ansible:devel Mar 27, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants