Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nightly checksum and checksum mismatch detection assertion #85

Merged
merged 13 commits into from
Sep 27, 2023

Conversation

ericaporter
Copy link
Contributor

@ericaporter ericaporter commented Sep 13, 2023

Ticket - Nightly checksum and checksum mismatch detection assertion

Changes:

  1. To allow the addition of EntityTableCheckJob to InitialiseEvents, InitialiseEvents has been renamed to 'InitialisationEvents'
  2. EntityTableCheckJob has been added to InitialisationEvents
  3. To avoid running as a cron job, we are rescheduling the job to run again every 24hrs
  4. The current timestamp is already sent as occurred_at so it wasn't explicitly added
  5. Some Coordination is required with Dataform Analytics to expect the event and ensure the checksum is correct

N.B. Updates required to request specs - WIP

Ticket details:
As part of dfe-analytics, create a scheduled job that runs shortly before midnight each night
Make this job stream an event with type ‘entity_table_check’ (or similar) for each model listed in analytics.yml which contains the following fields in DATA:
entity_table_name
number_of_rows
Current Timestamp (When query was run)
checksum e.g. something that matches the output of the GoogleSQL

@ericaporter ericaporter requested a review from asatwal September 13, 2023 11:54
Copy link
Collaborator

@asatwal asatwal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work @ericaporter. A few comments for your perusal.

lib/dfe/analytics/entity_table_check_job.rb Show resolved Hide resolved
lib/dfe/analytics/entity_table_check_job.rb Outdated Show resolved Hide resolved
lib/dfe/analytics/entity_table_check_job.rb Outdated Show resolved Hide resolved
spec/dfe/analytics_spec.rb Outdated Show resolved Hide resolved
spec/dfe/analytics/entity_table_check_job_spec.rb Outdated Show resolved Hide resolved
Copy link
Collaborator

@asatwal asatwal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good - Just a few more comments

Comment on lines 29 to 39
{
number_of_rows: model.count,
checksum: checksum_data(model)[:checksum],
timestamp: checksum_data(model)[:timestamp]
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should only include records created or updated before checksum_calculated_at, that way we don't have a moving target and the window for any inserts and updates creeping in is very small.
Also ordering by updated_at would mean we would detect missing updates.
@stevenleggdfe on the dataform side you would also have to use the checksum_calculated_at in a similar way.

Suggested change
{
number_of_rows: model.count,
checksum: checksum_data(model)[:checksum],
timestamp: checksum_data(model)[:timestamp]
}
checksum_calculated_at = Time.now.in_time_zone(TIME_ZONE).iso8601(6)
table_ids =
model
.where('updated_at < ?', checksum_calculated_at)
.order(updated: :asc)
.pluck(:id).join
{
row_count: table_ids.count,
checksum: Digest::SHA256.hexdigest(table_ids),
checksum_calculated_at: checksum_calculated_at
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, that works for me. I like the thinking.

Copy link
Contributor Author

@ericaporter ericaporter Sep 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@asatwal I don't think table_ids.count will work here, (or table_ids.size) after we join them, if the ids are double or treble figures etc., I think we would be getting the number of digits and not the distinct number of records. I've updated it to use table_ids count prior to ordering and joining and tried to reuse this for the checksum instead of doing two separate lookups. I've also had to parse the checksum_calculated_at in order to compare with updated_at

Copy link
Collaborator

@asatwal asatwal Sep 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops my bad ! Indeed and well spotted. Is this better:

        checksum_calculated_at = Time.now.in_time_zone(TIME_ZONE).iso8601(6)
        table_ids =
          model
          .where('updated_at < ?', checksum_calculated_at)
          .order(updated: :asc)
          .pluck(:id)

        {
          row_count: table_ids.count,
          checksum: Digest::SHA256.hexdigest(table_ids.join),
          checksum_calculated_at: checksum_calculated_at
        }

Your change is fine though and works and actually that parsing is needed. I'll let you decide which one is more efficient.

lib/dfe/analytics/entity_table_check_job.rb Outdated Show resolved Hide resolved
lib/dfe/analytics/entity_table_check_job.rb Outdated Show resolved Hide resolved
spec/dfe/analytics/entity_table_check_job_spec.rb Outdated Show resolved Hide resolved
spec/dfe/analytics/entity_table_check_job_spec.rb Outdated Show resolved Hide resolved
Copy link
Collaborator

@asatwal asatwal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few comments on the new config item.

Also we'll need to add some blurb to the README.md file - Just some brief notes on how to enable and what it does.

Thanks

self.class.set(wait_until: WAIT_TIME).perform_later
end

def allow_reschedule?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For config, there is already a pattern in lib\dfe\analytics.rb. I suggest we follow exactly the same pattern as the config item pseudonymise_web_request_user_id, except we call the config item entity_table_checks_enabled and also disable by default. Disabling by default means that we would enable these checks service by service. This is the safest options, although it will be more work to enable this for all the services.

Any opinions on disabling by default @ericaporter @stevenleggdfe ?

Following the above pattern, we should have a method called DfE::Analytics.entity_table_checks_enabled? to check the config item.

Also we can remove the allow_reschedule? method here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@asatwal @ericaporter I'd be up for disabling by default personally unless we're worried about load on the database generating the checksum?

Also if we're adding configuration for this then can we add the configuration parameter to initialise_analytics as well so that dfe-analytics-dataform can switch the checksum checking on and off depending?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Defaulting it to false will give us more control and make sure this works as expected in production 👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, initially anyway. I wonder if in time we may want to get more forceful with teams and switch the default to true - but only once we're confident it works and doesn't have a significant performance impact.

end

@@initialise_event_sent = true # rubocop:disable Style:ClassVars
DfE::Analytics::EntityTableCheckJob.perform_later
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should check if the job is enabled here:

DfE::Analytics::EntityTableCheckJob.perform_later if DfE::Analytics.entity_table_checks_enabled?

end

private

def initialisation_data
def initialise_analytics_data
{
analytics_version: DfE::Analytics::VERSION,
config: {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be useful to send the entity_table_checks_enabled config item here in the initialise event. I'm not sure though if Analytics dataform will use it though. What do you think @stevenleggdfe ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes please @asatwal @ericaporter ! This would enable us to switch checks on/off in Dataform depending

})])
end

it 'does not send the event if updated_at is greater than checksum_calculated_at' do
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯

Copy link
Collaborator

@asatwal asatwal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Looks good. Nice work.

@ericaporter ericaporter merged commit 9caf19f into main Sep 27, 2023
@ericaporter ericaporter deleted the nightly-checksum branch September 27, 2023 14:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants