-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update import job to send unique timestamp and import_entity_table_check event #108
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking really good @ericaporter . Don't think you need to turn any other of the other smaller modules into services. Just a few more simple collaboration specs I think ?
class GenericChecksumCalculator | ||
include ServicePattern | ||
|
||
ALLOWED_ORDER_COLUMNS = %w[CREATED_AT UPDATED_AT].freeze |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought maybe ID
should be in ALLOWED_ORDER_COLUMNS
, but then the where clause would be wrong if that was done. Maybe ALLOWED_ORDER_COLUMNS
should be renamed to something like BUILD_WHERE_CLAUSE_FOR_ORDER_COLUMNS
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The collaboration specs for the services are a nice to have.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking good @ericaporter - With hindsight I can see why PG gets confused by the ID column ordering in the Checksum subquery. Well done for finding the problem and persisting with this.
The current DfE Analytics import job (aka Backfill) in ruby imports all BQ entities to BQ. The job needs further enhancements to allow DfE Analytics to do a reliable sync of a complete table ie update to the latest data for each row and delete any extraneous rows in BQ.
This PR includes the following changes:
Adds a single unique import entity id to ALL events for a given import as a numerical timestamp in the format: YYYYMMDDHHMMSS. The field for this in the event could be event_tag
For each entity, once the import has been completed, a new checksum event is sent (import_entity_table_check)
which contains the unique entity import id to the event_tag
A checksum event can now be ordered by ID if created_at / updated_at are missing. Note that this changed will also affect the nightly checksum job
The entity table check and checksum calculation logic has been refactored into service objects