Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update import job to send unique timestamp and import_entity_table_check event #108

Merged
merged 26 commits into from
Feb 29, 2024

Conversation

ericaporter
Copy link
Contributor

@ericaporter ericaporter commented Jan 24, 2024

The current DfE Analytics import job (aka Backfill) in ruby imports all BQ entities to BQ. The job needs further enhancements to allow DfE Analytics to do a reliable sync of a complete table ie update to the latest data for each row and delete any extraneous rows in BQ.

This PR includes the following changes:

  1. Adds a single unique import entity id to ALL events for a given import as a numerical timestamp in the format: YYYYMMDDHHMMSS. The field for this in the event could be event_tag

  2. For each entity, once the import has been completed, a new checksum event is sent (import_entity_table_check)
    which contains the unique entity import id to the event_tag

  3. A checksum event can now be ordered by ID if created_at / updated_at are missing. Note that this changed will also affect the nightly checksum job

  4. The entity table check and checksum calculation logic has been refactored into service objects

@ericaporter ericaporter changed the title Update import job to send unique timestamp and checksum Update import job to send unique timestamp and import_entity_table_check event Jan 24, 2024
@ericaporter ericaporter requested a review from asatwal January 24, 2024 09:57
Copy link
Collaborator

@asatwal asatwal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking really good @ericaporter . Don't think you need to turn any other of the other smaller modules into services. Just a few more simple collaboration specs I think ?

class GenericChecksumCalculator
include ServicePattern

ALLOWED_ORDER_COLUMNS = %w[CREATED_AT UPDATED_AT].freeze
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought maybe ID should be in ALLOWED_ORDER_COLUMNS, but then the where clause would be wrong if that was done. Maybe ALLOWED_ORDER_COLUMNS should be renamed to something like BUILD_WHERE_CLAUSE_FOR_ORDER_COLUMNS.

Copy link
Collaborator

@asatwal asatwal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The collaboration specs for the services are a nice to have.

Copy link
Collaborator

@asatwal asatwal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking good @ericaporter - With hindsight I can see why PG gets confused by the ID column ordering in the Checksum subquery. Well done for finding the problem and persisting with this.

@ericaporter ericaporter merged commit 7b7a313 into main Feb 29, 2024
7 checks passed
@ericaporter ericaporter deleted the update-import-job-entity-check branch February 29, 2024 14:21
@ericaporter ericaporter mentioned this pull request Feb 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants