-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update lookup to SQL if postgres db #96
Conversation
b188602
to
838864a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few more comments for your perusal. I think it's a good idea to merge this PR before the scheduling one.
checksum_sql_query = <<-SQL | ||
SELECT COUNT(*) as row_count, | ||
MD5(STRING_AGG(CHECKSUM_TABLE.ID, '')) as checksum, | ||
CURRENT_TIMESTAMP(6) as checksum_calculated_at |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will the CURRENT_TIMESTAMP(6)
here be the same as CURRENT_TIMESTAMP(6)
in the where clause. I thought it possibly could be different if the query takes a long time to run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@asatwal - My thought process was that this SELECT would be transactional and so the timestamp would be the same throughout, but your suggestion might be a better way. Maybe @stevenleggdfe might have some insight?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@asatwal @ericaporter what would you think about MAX(updated_at) AS checksum_calculated_at
instead? Or would that hit performance too heavily?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like MAX(updated_at)
would be much more accurate than getting the timestamp at the beginning of the calculation (my suggestion), but I guess could potentially cause performance issues. I am wondering if @asatwal's suggestion might be somewhere in between these two where trade offs are concerned?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated_at is not an indexed column on most database tables, so this may well cause performance problems on large tables. I think we need to avoid any chance of performance problems after our last experience.
I don't believe it will make much of difference to accuracy as once we have a timestamp and any transactions updated after this timestamp will be excluded, both in our calculations and the BQ End. Can't think of a scenario where MAX(UPDATED_AT) would be more accurate, unless I'm not thinking hard enough :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I'm thinking real edge cases like if some update is processing in parallel and hasn't been written to the db yet. Not sure if that's even possible though or whether that would cause a problem anyway. So if it creates performance issues let's just stick with current_timestamp(6) I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I'm thinking real edge cases like if some update is processing in parallel and hasn't been written to the db yet. Not sure if that's even possible though or whether that would cause a problem anyway. So if it creates performance issues let's just stick with current_timestamp(6) I think.
Indeed, maybe a possibility that you've taken the current_timestamp
and a long parallel process that was running before you took the timestamp finishes and writes to the database just as you're calculating the checksum. Even in that case the updated_at
should be slightly later than your current_timestamp
assuming that updated_at
is calculated on the write to the database and not before. I think it's a risk we can take for now.
Rails.logger.info('DfE::Analytics: Entity checksum: Only Postgres databases supported on PRODUCTION') | ||
return | ||
end | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be safer to get the current timestamp like this:
checksum_calculated_at =
ActiveRecord::Base.connection.execute('SELECT CURRENT_TIMESTAMP(6)')[0]['current_timestamp']
And then pass this timestamp into the query like it was done previous. Please see comment below on why I think this may be safer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@asatwal I had to adapt this a little to make it compatible with SQLite3 as well as Postgres, kept the concept though
[result['row_count'].to_i, result['checksum'], result['checksum_calculated_at']] | ||
end | ||
|
||
def fetch_generic_checksum_data(model) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
current_timestamp could be passed into this method rather than calculating. Please see comment above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now passing in the current_timestamp to both fetches
SELECT #{model.table_name}.id::TEXT as ID | ||
FROM #{model.table_name} | ||
WHERE #{model.table_name}.updated_at < CURRENT_TIMESTAMP(6) | ||
ORDER BY #{model.table_name}.updated_at ASC |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ericaporter suggest omitting the ORDER BY - I'd assume this adds additional processing that we don't use?
I see why it's here but I think you can't rely on the ORDER in this subquery carrying through to the STRING_AGG in the parent query. It's non deterministic to my knowledge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@stevenleggdfe Ah good to know! So omit the subquery ORDER and update the STRING_AGG to be something likeMD5(STRING_AGG(CHECKSUM_TABLE.ID, '' ORDER BY CHECKSUM_TABLE.UPDATED_AT))
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, assuming that works in Postgres. That would be how I'd do it in GoogleSQL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ericaporter suggest omitting the ORDER BY - I'd assume this adds additional processing that we don't use?
I see why it's here but I think you can't rely on the ORDER in this subquery carrying through to the STRING_AGG in the parent query. It's non deterministic to my knowledge.
Good spot @stevenleggdfe
def fetch_postgresql_checksum_data(model) | ||
checksum_sql_query = <<-SQL | ||
SELECT COUNT(*) as row_count, | ||
MD5(STRING_AGG(CHECKSUM_TABLE.ID, '')) as checksum, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ericaporter I think this is missing an ORDER BY inside the STRING_AGG() - not in the child query (see comment below)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comments
d19b54a
to
18a6b46
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No further comments from but not mine to approve :-)
@@ -25,18 +21,74 @@ def perform | |||
reschedule_job | |||
end | |||
|
|||
def build_event_for(model) | |||
DfE::Analytics::Event.new |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ericaporter @stevenleggdfe - If the row_count is zero ie the table is empty, do we want to handle that situation ? Either not send the event at all OR make sure the code doesn't error both in dfe-analytics and in dataform.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@stevenleggdfe what should I avoid for dataform?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question @asatwal .
My preferred approach would be that if a table is empty we do send a checksum event to BigQuery with a row_count
of 0
and a NULL
value (not an empty value
array - BQ can't store that - or an empty string as MD5("")
produces a non null or empty value in GoogleSQL) for the checksum. Is that feasible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note in Google SQL SELECT MD5("")
produces 1B2M2Y8AsgTpgAmY7PhCfg==
!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be important in cases where all the rows have been deleted in the database but for some reason we still have them in BQ because something went wrong with the deletion events.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So for the Postgres SQL if run locally:
SELECT count (*) as row_count, MD5(STRING_AGG(CHECKSUM_TABLE.ID, '' ORDER BY CHECKSUM_TABLE.UPDATED_AT)) as checksum
FROM
(SELECT trainees.id::TEXT as ID,
updated_at as UPDATED_AT
FROM trainees
WHERE updated_at < '2020-11-05 09:25:19.863415') CHECKSUM_TABLE;
Result:
row_count checksum
0 NULL
In Ruby:
Digest::MD5.hexdigest('')
=> "d41d8cd98f00b204e9800998ecf8427e"
In Postgres SQL:
SELECT MD5('');
Result:
md5
d41d8cd98f00b204e9800998ecf8427e
Which is different from BQ. Which makes me think do we have the same MD5 in Google SQL as Ruby and Postgres ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@asatwal good spot. It's not the same it appears.
However I think the difference is that we're missing some conversion to/from hexadecimal.
In GoogleSQL SELECT TO_HEX(MD5(""))
produces d41d8cd98f00b204e9800998ecf8427e
, which matches.
I'll update the dfe-analytics-dataform logic to match.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually in GoogleSQL you need the following to get the same result:
SELECT TO_HEX(MD5(''));
Result:
md5
d41d8cd98f00b204e9800998ecf8427e
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ericaporter oh sorry, my bad! []
is fine. [NULL]
is not. [""]
is not great either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have kept MD5(COALESCE(STRING_AGG(CHECKSUM_TABLE.ID, '' ORDER BY CHECKSUM_TABLE.UPDATED_AT ASC), ''))
and Digest::MD5.hexdigest(table_ids.join)
and both will return [0, “d41d8cd98f00b204e9800998ecf8427e”]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 LGTM
425cd13
to
d61a829
Compare
If the production database is Postgres, to speed up the query, the entity table data / checksum lookup is updated to be entirely in SQL so that it is executed within the db. This will reduce the memory burden on Rails and also prevent the retrieval of a large number of IDs.
It was necessary to keep the original Active Record lookup given that the test db is SQLite3 and the updated SQL is incompatible.