-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nightly checksum and checksum mismatch detection assertion #85
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work @ericaporter. A few comments for your perusal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good - Just a few more comments
{ | ||
number_of_rows: model.count, | ||
checksum: checksum_data(model)[:checksum], | ||
timestamp: checksum_data(model)[:timestamp] | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should only include records created or updated before checksum_calculated_at
, that way we don't have a moving target and the window for any inserts and updates creeping in is very small.
Also ordering by updated_at
would mean we would detect missing updates.
@stevenleggdfe on the dataform side you would also have to use the checksum_calculated_at in a similar way.
{ | |
number_of_rows: model.count, | |
checksum: checksum_data(model)[:checksum], | |
timestamp: checksum_data(model)[:timestamp] | |
} | |
checksum_calculated_at = Time.now.in_time_zone(TIME_ZONE).iso8601(6) | |
table_ids = | |
model | |
.where('updated_at < ?', checksum_calculated_at) | |
.order(updated: :asc) | |
.pluck(:id).join | |
{ | |
row_count: table_ids.count, | |
checksum: Digest::SHA256.hexdigest(table_ids), | |
checksum_calculated_at: checksum_calculated_at | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, that works for me. I like the thinking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@asatwal I don't think table_ids.count will work here, (or table_ids.size) after we join them, if the ids are double or treble figures etc., I think we would be getting the number of digits and not the distinct number of records. I've updated it to use table_ids count prior to ordering and joining and tried to reuse this for the checksum instead of doing two separate lookups. I've also had to parse the checksum_calculated_at in order to compare with updated_at
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops my bad ! Indeed and well spotted. Is this better:
checksum_calculated_at = Time.now.in_time_zone(TIME_ZONE).iso8601(6)
table_ids =
model
.where('updated_at < ?', checksum_calculated_at)
.order(updated: :asc)
.pluck(:id)
{
row_count: table_ids.count,
checksum: Digest::SHA256.hexdigest(table_ids.join),
checksum_calculated_at: checksum_calculated_at
}
Your change is fine though and works and actually that parsing is needed. I'll let you decide which one is more efficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few comments on the new config item.
Also we'll need to add some blurb to the README.md file - Just some brief notes on how to enable and what it does.
Thanks
self.class.set(wait_until: WAIT_TIME).perform_later | ||
end | ||
|
||
def allow_reschedule? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For config, there is already a pattern in lib\dfe\analytics.rb
. I suggest we follow exactly the same pattern as the config item pseudonymise_web_request_user_id
, except we call the config item entity_table_checks_enabled
and also disable by default. Disabling by default means that we would enable these checks service by service. This is the safest options, although it will be more work to enable this for all the services.
Any opinions on disabling by default @ericaporter @stevenleggdfe ?
Following the above pattern, we should have a method called DfE::Analytics.entity_table_checks_enabled?
to check the config item.
Also we can remove the allow_reschedule?
method here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@asatwal @ericaporter I'd be up for disabling by default personally unless we're worried about load on the database generating the checksum?
Also if we're adding configuration for this then can we add the configuration parameter to initialise_analytics as well so that dfe-analytics-dataform can switch the checksum checking on and off depending?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Defaulting it to false will give us more control and make sure this works as expected in production 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, initially anyway. I wonder if in time we may want to get more forceful with teams and switch the default to true - but only once we're confident it works and doesn't have a significant performance impact.
end | ||
|
||
@@initialise_event_sent = true # rubocop:disable Style:ClassVars | ||
DfE::Analytics::EntityTableCheckJob.perform_later |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should check if the job is enabled here:
DfE::Analytics::EntityTableCheckJob.perform_later if DfE::Analytics.entity_table_checks_enabled?
end | ||
|
||
private | ||
|
||
def initialisation_data | ||
def initialise_analytics_data | ||
{ | ||
analytics_version: DfE::Analytics::VERSION, | ||
config: { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be useful to send the entity_table_checks_enabled
config item here in the initialise event. I'm not sure though if Analytics dataform will use it though. What do you think @stevenleggdfe ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes please @asatwal @ericaporter ! This would enable us to switch checks on/off in Dataform depending
})]) | ||
end | ||
|
||
it 'does not send the event if updated_at is greater than checksum_calculated_at' do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💯
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Looks good. Nice work.
…eckJob.perform later conditional
fca3b8c
to
4f4cfae
Compare
Ticket - Nightly checksum and checksum mismatch detection assertion
Changes:
EntityTableCheckJob
toInitialiseEvents
,InitialiseEvents
has been renamed to 'InitialisationEvents'InitialisationEvents
occurred_at
so it wasn't explicitly addedN.B. Updates required to request specs - WIP
Ticket details:
As part of dfe-analytics, create a scheduled job that runs shortly before midnight each night
Make this job stream an event with type ‘entity_table_check’ (or similar) for each model listed in analytics.yml which contains the following fields in DATA:
entity_table_name
number_of_rows
Current Timestamp (When query was run)
checksum e.g. something that matches the output of the GoogleSQL