Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update lookup to SQL if postgres db #96

Merged
merged 8 commits into from
Nov 17, 2023
Merged

Conversation

ericaporter
Copy link
Contributor

@ericaporter ericaporter commented Nov 2, 2023

If the production database is Postgres, to speed up the query, the entity table data / checksum lookup is updated to be entirely in SQL so that it is executed within the db. This will reduce the memory burden on Rails and also prevent the retrieval of a large number of IDs.

It was necessary to keep the original Active Record lookup given that the test db is SQLite3 and the updated SQL is incompatible.

@ericaporter ericaporter force-pushed the update-entity-table-lookup branch from b188602 to 838864a Compare November 3, 2023 15:34
Copy link
Collaborator

@asatwal asatwal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more comments for your perusal. I think it's a good idea to merge this PR before the scheduling one.

lib/dfe/analytics/entity_table_check_job.rb Outdated Show resolved Hide resolved
lib/dfe/analytics/entity_table_check_job.rb Outdated Show resolved Hide resolved
lib/dfe/analytics/entity_table_check_job.rb Outdated Show resolved Hide resolved
lib/dfe/analytics/entity_table_check_job.rb Show resolved Hide resolved
checksum_sql_query = <<-SQL
SELECT COUNT(*) as row_count,
MD5(STRING_AGG(CHECKSUM_TABLE.ID, '')) as checksum,
CURRENT_TIMESTAMP(6) as checksum_calculated_at
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will the CURRENT_TIMESTAMP(6) here be the same as CURRENT_TIMESTAMP(6) in the where clause. I thought it possibly could be different if the query takes a long time to run.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@asatwal - My thought process was that this SELECT would be transactional and so the timestamp would be the same throughout, but your suggestion might be a better way. Maybe @stevenleggdfe might have some insight?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@asatwal @ericaporter what would you think about MAX(updated_at) AS checksum_calculated_at instead? Or would that hit performance too heavily?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like MAX(updated_at) would be much more accurate than getting the timestamp at the beginning of the calculation (my suggestion), but I guess could potentially cause performance issues. I am wondering if @asatwal's suggestion might be somewhere in between these two where trade offs are concerned?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated_at is not an indexed column on most database tables, so this may well cause performance problems on large tables. I think we need to avoid any chance of performance problems after our last experience.

I don't believe it will make much of difference to accuracy as once we have a timestamp and any transactions updated after this timestamp will be excluded, both in our calculations and the BQ End. Can't think of a scenario where MAX(UPDATED_AT) would be more accurate, unless I'm not thinking hard enough :-)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I'm thinking real edge cases like if some update is processing in parallel and hasn't been written to the db yet. Not sure if that's even possible though or whether that would cause a problem anyway. So if it creates performance issues let's just stick with current_timestamp(6) I think.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I'm thinking real edge cases like if some update is processing in parallel and hasn't been written to the db yet. Not sure if that's even possible though or whether that would cause a problem anyway. So if it creates performance issues let's just stick with current_timestamp(6) I think.

Indeed, maybe a possibility that you've taken the current_timestamp and a long parallel process that was running before you took the timestamp finishes and writes to the database just as you're calculating the checksum. Even in that case the updated_at should be slightly later than your current_timestamp assuming that updated_at is calculated on the write to the database and not before. I think it's a risk we can take for now.

Rails.logger.info('DfE::Analytics: Entity checksum: Only Postgres databases supported on PRODUCTION')
return
end

Copy link
Collaborator

@asatwal asatwal Nov 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be safer to get the current timestamp like this:

checksum_calculated_at = 
  ActiveRecord::Base.connection.execute('SELECT CURRENT_TIMESTAMP(6)')[0]['current_timestamp']

And then pass this timestamp into the query like it was done previous. Please see comment below on why I think this may be safer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@asatwal I had to adapt this a little to make it compatible with SQLite3 as well as Postgres, kept the concept though

[result['row_count'].to_i, result['checksum'], result['checksum_calculated_at']]
end

def fetch_generic_checksum_data(model)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

current_timestamp could be passed into this method rather than calculating. Please see comment above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now passing in the current_timestamp to both fetches

SELECT #{model.table_name}.id::TEXT as ID
FROM #{model.table_name}
WHERE #{model.table_name}.updated_at < CURRENT_TIMESTAMP(6)
ORDER BY #{model.table_name}.updated_at ASC
Copy link
Contributor

@stevenleggdfe stevenleggdfe Nov 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ericaporter suggest omitting the ORDER BY - I'd assume this adds additional processing that we don't use?

I see why it's here but I think you can't rely on the ORDER in this subquery carrying through to the STRING_AGG in the parent query. It's non deterministic to my knowledge.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stevenleggdfe Ah good to know! So omit the subquery ORDER and update the STRING_AGG to be something likeMD5(STRING_AGG(CHECKSUM_TABLE.ID, '' ORDER BY CHECKSUM_TABLE.UPDATED_AT)) ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, assuming that works in Postgres. That would be how I'd do it in GoogleSQL.

Copy link
Collaborator

@asatwal asatwal Nov 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ericaporter suggest omitting the ORDER BY - I'd assume this adds additional processing that we don't use?

I see why it's here but I think you can't rely on the ORDER in this subquery carrying through to the STRING_AGG in the parent query. It's non deterministic to my knowledge.

Good spot @stevenleggdfe

def fetch_postgresql_checksum_data(model)
checksum_sql_query = <<-SQL
SELECT COUNT(*) as row_count,
MD5(STRING_AGG(CHECKSUM_TABLE.ID, '')) as checksum,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ericaporter I think this is missing an ORDER BY inside the STRING_AGG() - not in the child query (see comment below)

Copy link
Contributor

@stevenleggdfe stevenleggdfe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comments

@ericaporter ericaporter force-pushed the update-entity-table-lookup branch from d19b54a to 18a6b46 Compare November 9, 2023 16:07
Copy link
Contributor

@stevenleggdfe stevenleggdfe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No further comments from but not mine to approve :-)

@@ -25,18 +21,74 @@ def perform
reschedule_job
end

def build_event_for(model)
DfE::Analytics::Event.new
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ericaporter @stevenleggdfe - If the row_count is zero ie the table is empty, do we want to handle that situation ? Either not send the event at all OR make sure the code doesn't error both in dfe-analytics and in dataform.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stevenleggdfe what should I avoid for dataform?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question @asatwal .

My preferred approach would be that if a table is empty we do send a checksum event to BigQuery with a row_count of 0 and a NULL value (not an empty value array - BQ can't store that - or an empty string as MD5("") produces a non null or empty value in GoogleSQL) for the checksum. Is that feasible?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note in Google SQL SELECT MD5("") produces 1B2M2Y8AsgTpgAmY7PhCfg== !

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be important in cases where all the rows have been deleted in the database but for some reason we still have them in BQ because something went wrong with the deletion events.

Copy link
Collaborator

@asatwal asatwal Nov 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So for the Postgres SQL if run locally:

SELECT count (*)  as row_count, MD5(STRING_AGG(CHECKSUM_TABLE.ID, '' ORDER BY CHECKSUM_TABLE.UPDATED_AT)) as checksum
FROM 
(SELECT trainees.id::TEXT as ID,
        updated_at as UPDATED_AT
 FROM trainees
 WHERE updated_at < '2020-11-05 09:25:19.863415') CHECKSUM_TABLE;

Result:
row_count checksum
0 NULL

In Ruby:
Digest::MD5.hexdigest('')
=> "d41d8cd98f00b204e9800998ecf8427e"

In Postgres SQL:

SELECT MD5('');

Result:
md5
d41d8cd98f00b204e9800998ecf8427e

Which is different from BQ. Which makes me think do we have the same MD5 in Google SQL as Ruby and Postgres ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@asatwal good spot. It's not the same it appears.

However I think the difference is that we're missing some conversion to/from hexadecimal.

In GoogleSQL SELECT TO_HEX(MD5("")) produces d41d8cd98f00b204e9800998ecf8427e, which matches.

I'll update the dfe-analytics-dataform logic to match.

Copy link
Collaborator

@asatwal asatwal Nov 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually in GoogleSQL you need the following to get the same result:

SELECT TO_HEX(MD5(''));

Result:
md5
d41d8cd98f00b204e9800998ecf8427e

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ericaporter oh sorry, my bad! [] is fine. [NULL] is not. [""] is not great either.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have kept MD5(COALESCE(STRING_AGG(CHECKSUM_TABLE.ID, '' ORDER BY CHECKSUM_TABLE.UPDATED_AT ASC), '')) and Digest::MD5.hexdigest(table_ids.join) and both will return [0, “d41d8cd98f00b204e9800998ecf8427e”]

Copy link
Collaborator

@asatwal asatwal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 LGTM

@ericaporter ericaporter force-pushed the update-entity-table-lookup branch from 425cd13 to d61a829 Compare November 17, 2023 14:47
@ericaporter ericaporter merged commit 3ad193f into main Nov 17, 2023
7 checks passed
@ericaporter ericaporter deleted the update-entity-table-lookup branch November 17, 2023 18:19
@ericaporter ericaporter mentioned this pull request Nov 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants