Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add cautionary note about performance in get_row_count method #998

Merged

Conversation

matthewkrausse
Copy link
Contributor

This pull request adds a cautionary note about the performance of the get_row_count method in the BigQuery materialization. The note warns that using SELECT COUNT(*) can be expensive for large tables, especially those with many columns, as BigQuery scans all table data to perform the count. The note aims to inform developers about the potential performance impact and encourage them to consider alternative approaches when dealing with large tables.

@Jason94
Copy link
Collaborator

Jason94 commented Feb 21, 2024

@matthewkrausse Do you think it might be worth changing this method to use the INFORMATION_SCHEMA if it detects it's a table? Apparently that can be out of date sometimes, so maybe we could add a parameter to force it to do a COUNT(*) query?

https://www.metaplane.dev/blog/four-efficient-techniques-to-retrieve-row-counts-in-bigquery-tables-and-views

@matthewkrausse
Copy link
Contributor Author

@Jason94 Yeah, that was what prompted this PR, I made an issue #995. I think that could be a good option but not sure how useful it is given the use case. Usually when I want the row count of a table it is to make a comparison. Thoughts?

@Jason94
Copy link
Collaborator

Jason94 commented Feb 23, 2024

@Jason94 Yeah, that was what prompted this PR, I made an issue #995. I think that could be a good option but not sure how useful it is given the use case. Usually when I want the row count of a table it is to make a comparison. Thoughts?

That's a good point. I'm sure how much of a concern the computational resources of running a big COUNT() would be. Maybe make COUNT() a default, and add an extra boolean flag for something like fast_mode=False that would switch it to use INFORMATION_SCHEMA?

Maybe that's not something that would actually be helpful to the way people are using this method in production code? Maybe we could keep it simple and just always use COUNT.

@matthewkrausse
Copy link
Contributor Author

Yeah, I think it's useful when someone is just looking to see if a table has 0 records, 10 records or 10 million records.But, I don't feel like anyone is doing that programatically. They are likely just using the BQ Studio.

@shaunagm
Copy link
Collaborator

shaunagm commented Feb 27, 2024

So I'm happy to merge this as is, but another approach would be to create another method like get_row_count_estimate that uses the information_schema.tables mentioned in #995. Then in this doc warning, we could say "if an estimate is fine and you don't want the performance hit, use get_row_count_estimate" (and the docstring for get_row_count_estimate could point people to get_row_count for when accuracy is important or they don't care about performance.

@matthewkrausse
Copy link
Contributor Author

So I'm happy to merge this as is, but another approach would be to create another method like get_row_count_estimate that uses the information_schema.tables mentioned in #995. Then in this doc warning, we could say "if an estimate is fine and you don't want the performance hit, use get_row_count_estimate" (and the docstring for get_row_count_estimate could point people to get_row_count for when accuracy is important or they don't care about performance.

Let's go ahead and merge this as is.

Copy link
Collaborator

@shaunagm shaunagm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@shaunagm shaunagm merged commit 3536c44 into move-coop:main May 23, 2024
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants