Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow ordering by id if created_at/updated_at is null #164

Merged
merged 1 commit into from
Sep 10, 2024

Conversation

ericaporter
Copy link
Contributor

@ericaporter ericaporter commented Sep 5, 2024

Ticket
Publish have a table "course_subjects" that has been long term out of sync with Big Query
Investigation showed this was because they have a number of records where updated_at is null and they were being excluded from the row_count during entity table check job / import entity rake task causing the row_count and checksum to not match
This is thought to apply to several other tables in Publish and to at least one other service

This PR updates:

  1. the entity_table_check_job to firstly order by created_at if updated_at has null values, and finally to default to ID if updated_at/created_at are missing or have null values
  2. the where clause to include order_by records with null values so that we get an accurate row count

@ericaporter ericaporter force-pushed the allow-row-count-to-use-ids branch from 092f147 to 50f041e Compare September 6, 2024 08:14
@@ -65,7 +67,8 @@ def build_select_and_order_clause(order_column, table_name_sanitized)
def build_where_clause(order_column, table_name_sanitized, checksum_calculated_at_sanitized)
return '' unless WHERE_CLAUSE_ORDER_COLUMNS.map(&:downcase).include?(order_column.downcase)

"WHERE DATE_TRUNC('milliseconds', #{table_name_sanitized}.#{order_column.downcase}) < DATE_TRUNC('milliseconds', #{checksum_calculated_at_sanitized}::timestamp)"
# Add IS NULL to include records with null updated_at / created_at values
"WHERE (#{table_name_sanitized}.#{order_column.downcase} IS NULL OR DATE_TRUNC('milliseconds', #{table_name_sanitized}.#{order_column.downcase}) < DATE_TRUNC('milliseconds', #{checksum_calculated_at_sanitized}::timestamp))"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ericaporter will this addition ever actually do anything, since if it is true then the WHERE clause wouldn't operate, or would be switched to created_at if created_at does not contain null values?

Copy link
Contributor Author

@ericaporter ericaporter Sep 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's fair, I can remove. This was the first thing I updated but it won't do anything if we are excluding order columns with null values, as much as I like a belt and braces approach...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK to keep as belt and braces I believe.

@@ -78,6 +78,14 @@ def determine_order_column(entity_name, columns)
end
end

def null_values_in_column?(column)
connection.select_value(<<-SQL).to_i.positive?
SELECT COUNT(*)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ericaporter maybe check whether being specific as to any field name instead of using * in the COUNT() is more efficient? Certainly in column stores avoiding COUNT(*) is a huge efficiency boost but this is running in Postgres which is a row store.

e.g. SELECT COUNT(#{column})...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stevenleggdfe I opted for COUNT(*) because I thought COUNT{column} would ignore null values...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it does! That is one scary difference between PostgreSQL and GoogleSQL. Ignore me :-)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can do COUNT(ID), because ID should never be NULL. The concern here is that we may make the Checksum inefficient especially on large tables.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@asatwal @stevenleggdfe we do another COUNT(*) in the main checksum calculation and I considered having both as count(id) but I couldn't work out of there was a possibility we might have null id values that we should be including. If the performance benefits would outweigh this likelihood, I'm happy to update, but as this fix is about not excluding nulls, I felt it was tempting fate a little 😄

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given Postgres is a row store I'd keep COUNT(*). I'm not aware of much by way of performance increase you'd get from COUNT(id) in a row store.

Copy link
Collaborator

@asatwal asatwal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good @ericaporter - Just a few comments.

@@ -78,6 +78,14 @@ def determine_order_column(entity_name, columns)
end
end

def null_values_in_column?(column)
connection.select_value(<<-SQL).to_i.positive?
SELECT COUNT(*)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can do COUNT(ID), because ID should never be NULL. The concern here is that we may make the Checksum inefficient especially on large tables.

@@ -65,7 +67,8 @@ def build_select_and_order_clause(order_column, table_name_sanitized)
def build_where_clause(order_column, table_name_sanitized, checksum_calculated_at_sanitized)
return '' unless WHERE_CLAUSE_ORDER_COLUMNS.map(&:downcase).include?(order_column.downcase)

"WHERE DATE_TRUNC('milliseconds', #{table_name_sanitized}.#{order_column.downcase}) < DATE_TRUNC('milliseconds', #{checksum_calculated_at_sanitized}::timestamp)"
# Add IS NULL to include records with null updated_at / created_at values
"WHERE (#{table_name_sanitized}.#{order_column.downcase} IS NULL OR DATE_TRUNC('milliseconds', #{table_name_sanitized}.#{order_column.downcase}) < DATE_TRUNC('milliseconds', #{checksum_calculated_at_sanitized}::timestamp))"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK to keep as belt and braces I believe.

Copy link
Collaborator

@asatwal asatwal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@ericaporter ericaporter merged commit e742845 into main Sep 10, 2024
5 checks passed
@ericaporter ericaporter deleted the allow-row-count-to-use-ids branch September 10, 2024 14:26
@ericaporter ericaporter mentioned this pull request Sep 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants