Allow ordering by id if created_at/updated_at is null #164

ericaporter · 2024-09-05T08:29:12Z

Ticket
Publish have a table "course_subjects" that has been long term out of sync with Big Query
Investigation showed this was because they have a number of records where updated_at is null and they were being excluded from the row_count during entity table check job / import entity rake task causing the row_count and checksum to not match
This is thought to apply to several other tables in Publish and to at least one other service

This PR updates:

the entity_table_check_job to firstly order by created_at if updated_at has null values, and finally to default to ID if updated_at/created_at are missing or have null values
the where clause to include order_by records with null values so that we get an accurate row count

stevenleggdfe · 2024-09-06T08:21:55Z

lib/dfe/analytics/services/postgres_checksum_calculator.rb

@@ -65,7 +67,8 @@ def build_select_and_order_clause(order_column, table_name_sanitized)
        def build_where_clause(order_column, table_name_sanitized, checksum_calculated_at_sanitized)
          return '' unless WHERE_CLAUSE_ORDER_COLUMNS.map(&:downcase).include?(order_column.downcase)

-          "WHERE DATE_TRUNC('milliseconds', #{table_name_sanitized}.#{order_column.downcase}) < DATE_TRUNC('milliseconds', #{checksum_calculated_at_sanitized}::timestamp)"
+          # Add IS NULL to include records with null updated_at / created_at values
+          "WHERE (#{table_name_sanitized}.#{order_column.downcase} IS NULL OR DATE_TRUNC('milliseconds', #{table_name_sanitized}.#{order_column.downcase}) < DATE_TRUNC('milliseconds', #{checksum_calculated_at_sanitized}::timestamp))"


@ericaporter will this addition ever actually do anything, since if it is true then the WHERE clause wouldn't operate, or would be switched to created_at if created_at does not contain null values?

I think that's fair, I can remove. This was the first thing I updated but it won't do anything if we are excluding order columns with null values, as much as I like a belt and braces approach...

OK to keep as belt and braces I believe.

stevenleggdfe · 2024-09-06T08:24:46Z

lib/dfe/analytics/services/entity_table_checks.rb

@@ -78,6 +78,14 @@ def determine_order_column(entity_name, columns)
          end
        end

+        def null_values_in_column?(column)
+          connection.select_value(<<-SQL).to_i.positive?
+            SELECT COUNT(*)


@ericaporter maybe check whether being specific as to any field name instead of using * in the COUNT() is more efficient? Certainly in column stores avoiding COUNT(*) is a huge efficiency boost but this is running in Postgres which is a row store.

e.g. SELECT COUNT(#{column})...

@stevenleggdfe I opted for COUNT(*) because I thought COUNT{column} would ignore null values...

So it does! That is one scary difference between PostgreSQL and GoogleSQL. Ignore me :-)

Maybe we can do COUNT(ID), because ID should never be NULL. The concern here is that we may make the Checksum inefficient especially on large tables.

@asatwal @stevenleggdfe we do another COUNT(*) in the main checksum calculation and I considered having both as count(id) but I couldn't work out of there was a possibility we might have null id values that we should be including. If the performance benefits would outweigh this likelihood, I'm happy to update, but as this fix is about not excluding nulls, I felt it was tempting fate a little 😄

Given Postgres is a row store I'd keep COUNT(*). I'm not aware of much by way of performance increase you'd get from COUNT(id) in a row store.

asatwal

Looks good @ericaporter - Just a few comments.

asatwal · 2024-09-09T09:38:21Z

lib/dfe/analytics/services/entity_table_checks.rb

@@ -78,6 +78,14 @@ def determine_order_column(entity_name, columns)
          end
        end

+        def null_values_in_column?(column)
+          connection.select_value(<<-SQL).to_i.positive?
+            SELECT COUNT(*)


Maybe we can do COUNT(ID), because ID should never be NULL. The concern here is that we may make the Checksum inefficient especially on large tables.

asatwal · 2024-09-09T10:35:55Z

lib/dfe/analytics/services/postgres_checksum_calculator.rb

@@ -65,7 +67,8 @@ def build_select_and_order_clause(order_column, table_name_sanitized)
        def build_where_clause(order_column, table_name_sanitized, checksum_calculated_at_sanitized)
          return '' unless WHERE_CLAUSE_ORDER_COLUMNS.map(&:downcase).include?(order_column.downcase)

-          "WHERE DATE_TRUNC('milliseconds', #{table_name_sanitized}.#{order_column.downcase}) < DATE_TRUNC('milliseconds', #{checksum_calculated_at_sanitized}::timestamp)"
+          # Add IS NULL to include records with null updated_at / created_at values
+          "WHERE (#{table_name_sanitized}.#{order_column.downcase} IS NULL OR DATE_TRUNC('milliseconds', #{table_name_sanitized}.#{order_column.downcase}) < DATE_TRUNC('milliseconds', #{checksum_calculated_at_sanitized}::timestamp))"


OK to keep as belt and braces I believe.

asatwal

👍

ericaporter force-pushed the allow-row-count-to-use-ids branch from 4e0668f to f62bc40 Compare September 5, 2024 18:21

ericaporter requested review from asatwal and stevenleggdfe September 5, 2024 18:50

Allow ordering by id if created_at/updated_at is null

50f041e

ericaporter force-pushed the allow-row-count-to-use-ids branch from 092f147 to 50f041e Compare September 6, 2024 08:14

stevenleggdfe reviewed Sep 6, 2024

View reviewed changes

asatwal reviewed Sep 9, 2024

View reviewed changes

asatwal approved these changes Sep 10, 2024

View reviewed changes

ericaporter merged commit e742845 into main Sep 10, 2024
5 checks passed

ericaporter deleted the allow-row-count-to-use-ids branch September 10, 2024 14:26

ericaporter mentioned this pull request Sep 10, 2024

release v1.14.2 #165

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow ordering by id if created_at/updated_at is null #164

Allow ordering by id if created_at/updated_at is null #164

ericaporter commented Sep 5, 2024 •

edited

Loading

stevenleggdfe Sep 6, 2024

ericaporter Sep 6, 2024 •

edited

Loading

asatwal Sep 9, 2024

stevenleggdfe Sep 6, 2024

ericaporter Sep 6, 2024

stevenleggdfe Sep 6, 2024

asatwal Sep 9, 2024

ericaporter Sep 9, 2024

stevenleggdfe Sep 9, 2024

asatwal left a comment

asatwal Sep 9, 2024

asatwal Sep 9, 2024

asatwal left a comment

Allow ordering by id if created_at/updated_at is null #164

Allow ordering by id if created_at/updated_at is null #164

Conversation

ericaporter commented Sep 5, 2024 • edited Loading

Choose a reason for hiding this comment

ericaporter Sep 6, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

asatwal left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

asatwal left a comment

Choose a reason for hiding this comment

ericaporter commented Sep 5, 2024 •

edited

Loading

ericaporter Sep 6, 2024 •

edited

Loading