-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow ordering by id if created_at/updated_at is null #164
Conversation
4e0668f
to
f62bc40
Compare
092f147
to
50f041e
Compare
@@ -65,7 +67,8 @@ def build_select_and_order_clause(order_column, table_name_sanitized) | |||
def build_where_clause(order_column, table_name_sanitized, checksum_calculated_at_sanitized) | |||
return '' unless WHERE_CLAUSE_ORDER_COLUMNS.map(&:downcase).include?(order_column.downcase) | |||
|
|||
"WHERE DATE_TRUNC('milliseconds', #{table_name_sanitized}.#{order_column.downcase}) < DATE_TRUNC('milliseconds', #{checksum_calculated_at_sanitized}::timestamp)" | |||
# Add IS NULL to include records with null updated_at / created_at values | |||
"WHERE (#{table_name_sanitized}.#{order_column.downcase} IS NULL OR DATE_TRUNC('milliseconds', #{table_name_sanitized}.#{order_column.downcase}) < DATE_TRUNC('milliseconds', #{checksum_calculated_at_sanitized}::timestamp))" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ericaporter will this addition ever actually do anything, since if it is true then the WHERE clause wouldn't operate, or would be switched to created_at if created_at does not contain null values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's fair, I can remove. This was the first thing I updated but it won't do anything if we are excluding order columns with null values, as much as I like a belt and braces approach...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK to keep as belt and braces I believe.
@@ -78,6 +78,14 @@ def determine_order_column(entity_name, columns) | |||
end | |||
end | |||
|
|||
def null_values_in_column?(column) | |||
connection.select_value(<<-SQL).to_i.positive? | |||
SELECT COUNT(*) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ericaporter maybe check whether being specific as to any field name instead of using * in the COUNT() is more efficient? Certainly in column stores avoiding COUNT(*) is a huge efficiency boost but this is running in Postgres which is a row store.
e.g. SELECT COUNT(#{column})
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@stevenleggdfe I opted for COUNT(*) because I thought COUNT{column} would ignore null values...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So it does! That is one scary difference between PostgreSQL and GoogleSQL. Ignore me :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can do COUNT(ID), because ID should never be NULL. The concern here is that we may make the Checksum inefficient especially on large tables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@asatwal @stevenleggdfe we do another COUNT(*) in the main checksum calculation and I considered having both as count(id) but I couldn't work out of there was a possibility we might have null id values that we should be including. If the performance benefits would outweigh this likelihood, I'm happy to update, but as this fix is about not excluding nulls, I felt it was tempting fate a little 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given Postgres is a row store I'd keep COUNT(*). I'm not aware of much by way of performance increase you'd get from COUNT(id) in a row store.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good @ericaporter - Just a few comments.
@@ -78,6 +78,14 @@ def determine_order_column(entity_name, columns) | |||
end | |||
end | |||
|
|||
def null_values_in_column?(column) | |||
connection.select_value(<<-SQL).to_i.positive? | |||
SELECT COUNT(*) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can do COUNT(ID), because ID should never be NULL. The concern here is that we may make the Checksum inefficient especially on large tables.
@@ -65,7 +67,8 @@ def build_select_and_order_clause(order_column, table_name_sanitized) | |||
def build_where_clause(order_column, table_name_sanitized, checksum_calculated_at_sanitized) | |||
return '' unless WHERE_CLAUSE_ORDER_COLUMNS.map(&:downcase).include?(order_column.downcase) | |||
|
|||
"WHERE DATE_TRUNC('milliseconds', #{table_name_sanitized}.#{order_column.downcase}) < DATE_TRUNC('milliseconds', #{checksum_calculated_at_sanitized}::timestamp)" | |||
# Add IS NULL to include records with null updated_at / created_at values | |||
"WHERE (#{table_name_sanitized}.#{order_column.downcase} IS NULL OR DATE_TRUNC('milliseconds', #{table_name_sanitized}.#{order_column.downcase}) < DATE_TRUNC('milliseconds', #{checksum_calculated_at_sanitized}::timestamp))" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK to keep as belt and braces I believe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Ticket
Publish have a table "course_subjects" that has been long term out of sync with Big Query
Investigation showed this was because they have a number of records where updated_at is null and they were being excluded from the row_count during entity table check job / import entity rake task causing the row_count and checksum to not match
This is thought to apply to several other tables in Publish and to at least one other service
This PR updates: