Update lookup to SQL if postgres db #96

ericaporter · 2023-11-02T19:58:53Z

If the production database is Postgres, to speed up the query, the entity table data / checksum lookup is updated to be entirely in SQL so that it is executed within the db. This will reduce the memory burden on Rails and also prevent the retrieval of a large number of IDs.

It was necessary to keep the original Active Record lookup given that the test db is SQLite3 and the updated SQL is incompatible.

asatwal

A few more comments for your perusal. I think it's a good idea to merge this PR before the scheduling one.

lib/dfe/analytics/entity_table_check_job.rb

asatwal · 2023-11-07T17:09:36Z

lib/dfe/analytics/entity_table_check_job.rb

+        checksum_sql_query = <<-SQL
+          SELECT COUNT(*) as row_count,
+            MD5(STRING_AGG(CHECKSUM_TABLE.ID, '')) as checksum,
+            CURRENT_TIMESTAMP(6) as checksum_calculated_at


Will the CURRENT_TIMESTAMP(6) here be the same as CURRENT_TIMESTAMP(6) in the where clause. I thought it possibly could be different if the query takes a long time to run.

@asatwal - My thought process was that this SELECT would be transactional and so the timestamp would be the same throughout, but your suggestion might be a better way. Maybe @stevenleggdfe might have some insight?

@asatwal @ericaporter what would you think about MAX(updated_at) AS checksum_calculated_at instead? Or would that hit performance too heavily?

It looks like MAX(updated_at) would be much more accurate than getting the timestamp at the beginning of the calculation (my suggestion), but I guess could potentially cause performance issues. I am wondering if @asatwal's suggestion might be somewhere in between these two where trade offs are concerned?

updated_at is not an indexed column on most database tables, so this may well cause performance problems on large tables. I think we need to avoid any chance of performance problems after our last experience.

I don't believe it will make much of difference to accuracy as once we have a timestamp and any transactions updated after this timestamp will be excluded, both in our calculations and the BQ End. Can't think of a scenario where MAX(UPDATED_AT) would be more accurate, unless I'm not thinking hard enough :-)

I guess I'm thinking real edge cases like if some update is processing in parallel and hasn't been written to the db yet. Not sure if that's even possible though or whether that would cause a problem anyway. So if it creates performance issues let's just stick with current_timestamp(6) I think.

I guess I'm thinking real edge cases like if some update is processing in parallel and hasn't been written to the db yet. Not sure if that's even possible though or whether that would cause a problem anyway. So if it creates performance issues let's just stick with current_timestamp(6) I think.

Indeed, maybe a possibility that you've taken the current_timestamp and a long parallel process that was running before you took the timestamp finishes and writes to the database just as you're calculating the checksum. Even in that case the updated_at should be slightly later than your current_timestamp assuming that updated_at is calculated on the write to the database and not before. I think it's a risk we can take for now.

asatwal · 2023-11-07T17:22:08Z

lib/dfe/analytics/entity_table_check_job.rb

+          Rails.logger.info('DfE::Analytics: Entity checksum: Only Postgres databases supported on PRODUCTION')
+          return
+        end
+


Might be safer to get the current timestamp like this:

checksum_calculated_at = ActiveRecord::Base.connection.execute('SELECT CURRENT_TIMESTAMP(6)')[0]['current_timestamp']

And then pass this timestamp into the query like it was done previous. Please see comment below on why I think this may be safer.

@asatwal I had to adapt this a little to make it compatible with SQLite3 as well as Postgres, kept the concept though

asatwal · 2023-11-07T17:27:11Z

lib/dfe/analytics/entity_table_check_job.rb

+        [result['row_count'].to_i, result['checksum'], result['checksum_calculated_at']]
+      end
+
+      def fetch_generic_checksum_data(model)


current_timestamp could be passed into this method rather than calculating. Please see comment above.

Now passing in the current_timestamp to both fetches

stevenleggdfe · 2023-11-08T11:40:52Z

lib/dfe/analytics/entity_table_check_job.rb

+            SELECT #{model.table_name}.id::TEXT as ID
+            FROM #{model.table_name}
+            WHERE #{model.table_name}.updated_at < CURRENT_TIMESTAMP(6)
+            ORDER BY #{model.table_name}.updated_at ASC


@ericaporter suggest omitting the ORDER BY - I'd assume this adds additional processing that we don't use?

I see why it's here but I think you can't rely on the ORDER in this subquery carrying through to the STRING_AGG in the parent query. It's non deterministic to my knowledge.

@stevenleggdfe Ah good to know! So omit the subquery ORDER and update the STRING_AGG to be something likeMD5(STRING_AGG(CHECKSUM_TABLE.ID, '' ORDER BY CHECKSUM_TABLE.UPDATED_AT)) ?

Yes, assuming that works in Postgres. That would be how I'd do it in GoogleSQL.

@ericaporter suggest omitting the ORDER BY - I'd assume this adds additional processing that we don't use?

I see why it's here but I think you can't rely on the ORDER in this subquery carrying through to the STRING_AGG in the parent query. It's non deterministic to my knowledge.

Good spot @stevenleggdfe

stevenleggdfe · 2023-11-08T11:42:58Z

lib/dfe/analytics/entity_table_check_job.rb

+      def fetch_postgresql_checksum_data(model)
+        checksum_sql_query = <<-SQL
+          SELECT COUNT(*) as row_count,
+            MD5(STRING_AGG(CHECKSUM_TABLE.ID, '')) as checksum,


@ericaporter I think this is missing an ORDER BY inside the STRING_AGG() - not in the child query (see comment below)

stevenleggdfe

See comments

stevenleggdfe

No further comments from but not mine to approve :-)

asatwal · 2023-11-10T09:25:30Z

lib/dfe/analytics/entity_table_check_job.rb

@@ -25,18 +21,74 @@ def perform
        reschedule_job
      end

+      def build_event_for(model)
+        DfE::Analytics::Event.new


@ericaporter @stevenleggdfe - If the row_count is zero ie the table is empty, do we want to handle that situation ? Either not send the event at all OR make sure the code doesn't error both in dfe-analytics and in dataform.

@stevenleggdfe what should I avoid for dataform?

Good question @asatwal .

My preferred approach would be that if a table is empty we do send a checksum event to BigQuery with a row_count of 0 and a NULL value (not an empty value array - BQ can't store that - or an empty string as MD5("") produces a non null or empty value in GoogleSQL) for the checksum. Is that feasible?

Note in Google SQL SELECT MD5("") produces 1B2M2Y8AsgTpgAmY7PhCfg== !

This might be important in cases where all the rows have been deleted in the database but for some reason we still have them in BQ because something went wrong with the deletion events.

So for the Postgres SQL if run locally:

SELECT count (*) as row_count, MD5(STRING_AGG(CHECKSUM_TABLE.ID, '' ORDER BY CHECKSUM_TABLE.UPDATED_AT)) as checksum FROM (SELECT trainees.id::TEXT as ID, updated_at as UPDATED_AT FROM trainees WHERE updated_at < '2020-11-05 09:25:19.863415') CHECKSUM_TABLE;

Result:
row_count checksum
0 NULL

In Ruby:
Digest::MD5.hexdigest('')
=> "d41d8cd98f00b204e9800998ecf8427e"

In Postgres SQL:

SELECT MD5('');

Result:
md5
d41d8cd98f00b204e9800998ecf8427e

Which is different from BQ. Which makes me think do we have the same MD5 in Google SQL as Ruby and Postgres ?

@asatwal good spot. It's not the same it appears.

However I think the difference is that we're missing some conversion to/from hexadecimal.

In GoogleSQL SELECT TO_HEX(MD5("")) produces d41d8cd98f00b204e9800998ecf8427e, which matches.

I'll update the dfe-analytics-dataform logic to match.

Actually in GoogleSQL you need the following to get the same result:

SELECT TO_HEX(MD5(''));

Result:
md5
d41d8cd98f00b204e9800998ecf8427e

@ericaporter oh sorry, my bad! [] is fine. [NULL] is not. [""] is not great either.

I have kept MD5(COALESCE(STRING_AGG(CHECKSUM_TABLE.ID, '' ORDER BY CHECKSUM_TABLE.UPDATED_AT ASC), '')) and Digest::MD5.hexdigest(table_ids.join) and both will return [0, “d41d8cd98f00b204e9800998ecf8427e”]

asatwal

👍 LGTM

…n prod

ericaporter force-pushed the update-entity-table-lookup branch from b188602 to 838864a Compare November 3, 2023 15:34

asatwal reviewed Nov 3, 2023

View reviewed changes

ericaporter requested a review from stevenleggdfe November 7, 2023 09:06

asatwal reviewed Nov 7, 2023

View reviewed changes

stevenleggdfe reviewed Nov 8, 2023

View reviewed changes

stevenleggdfe requested changes Nov 8, 2023

View reviewed changes

ericaporter force-pushed the update-entity-table-lookup branch from d19b54a to 18a6b46 Compare November 9, 2023 16:07

ericaporter requested review from asatwal and stevenleggdfe November 9, 2023 16:10

stevenleggdfe reviewed Nov 9, 2023

View reviewed changes

asatwal reviewed Nov 10, 2023

View reviewed changes

asatwal approved these changes Nov 13, 2023

View reviewed changes

stevenleggdfe approved these changes Nov 13, 2023

View reviewed changes

ericaporter added 8 commits November 17, 2023 14:34

Update lookup to SQL if postgres db

86d35fa

Refactor lookup

c4f0c0d

Update Postgres checksum timestamp, add conditional for postgres db o…

75cbe11

…n prod

Revert to original conditional, rubocop unhappy

1202d4d

Update timestamp to pass into both postgres and SQLite3/other database

04b7040

Sanitize table name

20c7c00

Handle row_count equals zero

8731369

Rubocp updates and add ASC

d61a829

ericaporter force-pushed the update-entity-table-lookup branch from 425cd13 to d61a829 Compare November 17, 2023 14:47

ericaporter merged commit 3ad193f into main Nov 17, 2023
7 checks passed

ericaporter deleted the update-entity-table-lookup branch November 17, 2023 18:19

ericaporter mentioned this pull request Nov 17, 2023

V1.11.2 release #97

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update lookup to SQL if postgres db #96

Update lookup to SQL if postgres db #96

ericaporter commented Nov 2, 2023 •

edited

Loading

asatwal left a comment

asatwal Nov 7, 2023

ericaporter Nov 7, 2023

stevenleggdfe Nov 8, 2023

ericaporter Nov 8, 2023

asatwal Nov 8, 2023

stevenleggdfe Nov 8, 2023

asatwal Nov 9, 2023

asatwal Nov 7, 2023 •

edited

Loading

ericaporter Nov 9, 2023

asatwal Nov 7, 2023

ericaporter Nov 9, 2023

stevenleggdfe Nov 8, 2023 •

edited

Loading

ericaporter Nov 8, 2023

stevenleggdfe Nov 8, 2023

asatwal Nov 9, 2023 •

edited

Loading

stevenleggdfe Nov 8, 2023

stevenleggdfe left a comment

stevenleggdfe left a comment

asatwal Nov 10, 2023

ericaporter Nov 10, 2023

stevenleggdfe Nov 10, 2023

stevenleggdfe Nov 10, 2023

stevenleggdfe Nov 10, 2023

asatwal Nov 10, 2023 •

edited

Loading

stevenleggdfe Nov 10, 2023

asatwal Nov 10, 2023 •

edited

Loading

stevenleggdfe Nov 10, 2023

ericaporter Nov 13, 2023

asatwal left a comment

Update lookup to SQL if postgres db #96

Update lookup to SQL if postgres db #96

Conversation

ericaporter commented Nov 2, 2023 • edited Loading

asatwal left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

asatwal Nov 7, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stevenleggdfe Nov 8, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

asatwal Nov 9, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stevenleggdfe left a comment

Choose a reason for hiding this comment

stevenleggdfe left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

asatwal Nov 10, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

asatwal Nov 10, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

asatwal left a comment

Choose a reason for hiding this comment

ericaporter commented Nov 2, 2023 •

edited

Loading

asatwal Nov 7, 2023 •

edited

Loading

stevenleggdfe Nov 8, 2023 •

edited

Loading

asatwal Nov 9, 2023 •

edited

Loading

asatwal Nov 10, 2023 •

edited

Loading

asatwal Nov 10, 2023 •

edited

Loading